IM2 Publications

[1] N. Li, O. Mubin, F. Kaplan, and P. Dilllenbourg. A tabletop environment for augmenting meetings with background search. Under peer review for the ITS2011 conference, Kobe, Japan, . [ bib ]
Keywords: report_X, IM2.IP2, Group Dillenbourg, unpublished
[2] L. Goldmann, A. Samour, T. Ebrahimi, and T. Sikora. Multimodal person search combining information fusion and relevance feedback. In IEEE International Workshop on Multimedia Signal Processing (MMSP 2009), . [ bib | http | Abstract ]
Keywords: content based multimedia retrieval, sensing people, relevance feedback; multimodal fusion, IM2.MCA, Report_VIII
[3] F. De Simone, M. Naccari, M. Tagliasacchi, F. Dufaux, S. Tubaro, and T. Ebrahimi. Subjective assessment of h.264/avc video sequences transmitted over a noisy channel. In First International Workshop on Quality of Multimedia Experience (QoMEX 2009), . [ bib | http ]
In this paper we describe a database containing subjective assessment scores relative to 78 video streams encoded with H.264/AVC and corrupted by simulating the transmission over error-prone network. The data has been collected from 40 subjects at the premises of two academic institutions. Our goal is to provide a balanced and comprehensive database to enable reproducible research results in the field of video quality assessment. In order to support research works on Full-Reference, Reduced-Reference and No-Reference video quality assessment algorithms, both the uncompressed files and the H.264/AVC bitstreams of each video sequence have been made publicly available for the research community, together with the subjective results of the performed evaluations.

Keywords: Subjective video quality assessment; Packet loss rate; H.264/AVC; Error resilience, IM2.MCA, Report_VIII
[4] J. S. Lee, F. De Simone, and T. Ebrahimi. Influence of audio-visual attention on perceived quality of standard definition multimedia content. In First International Workshop on Quality of Multimedia Experience (QoMEX 2009), . [ bib | www: ]
When human subjects assess the quality of multimedia data, high level perceptual processes such as Focus of Attention (FoA) and eye movements are believed to play an important role in such tasks. While prior art reports incorporation of visual FoA into objective quality metrics, audio-visual FoA has been rarely addressed and utilized in spite of the importance and presence of both audio and video information in many multimedia systems. This paper explores the influence of audio-visual FoA in the perceived quality of standard definition audio-visual sequences. Results of a subjective quality assessment study are reported, where it is shown that the sound source attracts visual attention and thereby the visual degradation in the regions far from the source is less perceived when compared to sound-emitting regions.

Keywords: Quality assessment, Audio-visual Focus of Attention, Cross-modal interaction, Perceived quality, IM2.MCA, Report_VIII
[5] J. S. Lee and T. Ebrahimi. Two-level bimodal association for audio-visual speech recognition. In International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVSâ09), . [ bib ]
This paper proposes a new method for bimodal information fusion in audio-visual speech recognition, where cross-modal association is considered in two levels. First, the acoustic and the visual data streams are combined at the feature level by using the canonical correlation analysis, which deals with the problems of audio-visual synchronization and utilizing the cross-modal correlation. Second, information streams are integrated at the decision level for adaptive fusion of the streams according to the noise condition of the given speech datum. Experimental results demonstrate that the proposed method is effective for producing noise-robust recognition performance without a priori knowledge about the noise conditions of the speech data.

Keywords: audio-visual speech recognition; synchronization; cross-modal correlation; canonical correlation analysis, IM2.MCA, Report_VIII
[6] D. Gatica-Perez and J. M. Odobez. Visual attention, speaking activity, and group conversational analysis in multi-sensor environments. In H. Nakashima, J. Augusto, H. Aghajan (Eds.), Handbook of Ambient Intelligence and Smart Environments, Springer, in press, . [ bib ]
Keywords: IM2.MPR, Report_VIII
[7] A. Popescu-Belis. Multimodal database annotation formats and standards, software architecture for multimodal interfaces. In J. Ph. Thiran, H. Bourlard, and F. Marques, editors, Multimodal Signal Processing: Methods and Techniques to Build Multimodal Interactive Systems. Academic Press, . in press. [ bib ]
Keywords: IM2.DMA, Report_VIII
[8] E. Mugellini, D. Lalanne, B. Dumas, F. Evéquoz, S. Gerardi, A. Le Calvé, A. Boder, R. Ingold, and O. Khaled. Memodules as tangible shortcuts to multimedia information. . [ bib ]
Keywords: IM2.HMI, Report_VIII
[9] D. Gatica-Perez. Modeling interest in face-to-face conversations from multimodal nonverbal behavior. In In J.-P. Thiran, H. Bourlard, and F. Marques, (Eds.), Multimodal Signal Processing, Academic Press, in press, . [ bib ]
Keywords: IM2.MPR, Report_VIII
[10] D. Brodbeck, R. Mazza, and D. Lalanne. Interactive visualization - a survey. . [ bib ]
Keywords: IM2.HMI, Report_VIII
[11] B. Noris, K. Benmachiche, and A. Billard. Calibration-free eye gaze direction detection with gaussian processes. In International Conference on Computer Vision Theory and Applications (VISAPP 08), . [ bib ]
Keywords: IM2.MPR, Report_VIII
[12] B. Dumas, D. Lalanne, and S. Oviatt. Multimodal interfaces: a survey of principles, models and frameworks. . [ bib ]
Keywords: IM2.HMI, Report_VIII
[13] P. Motlicek, H. Hermansky, H. Garudadri, and N. Srinivasamurthy. Audio coding based on long temporal contexts. IDIAP-RR 30, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
We describe novel audio coding technique designed to be utilized at medium bit-rates. Unlike classical state-of-the-art audio coders that are based on short-term spectra, our approach uses relatively long temporal segments of audio signal in critical-band-sized sub-bands. We apply auto-regressive model to approximate Hilbert envelopes in frequency sub-bands. Residual signals (Hilbert carriers) are demodulated and thresholding functions are applied in spectral domain. The Hilbert envelopes and carriers are quantized and transmitted to the decoder. Our experiments focused on designing audio coder to provide broadcast radio-like quality audio around 10-20kbps. Objective quality measures indicate comparable performance with the 3GPP-AMR speech codec standard for both speech and non-speech signals.

Keywords: Report_VI, IM2.AP
[14] D. Zhang, D. Gatica-Perez, and S. Bengio. Exploring contextual information in a layered framework for group action recognition. In In the Eighth International Conference on Multimodal Interfaces (ICMI'06), 2006. IDIAP-RR 06-41. [ bib | .ps.gz | .pdf ]
Contextual information is important for sequence modeling. Hidden Markov Models (HMMs) and extensions, which have been widely used for sequence modeling, make simplifying, often unrealistic assumptions on the conditional independence of observations given the class labels, thus cannot accommoyearoverlapping features or long-term contextual information. In this paper, we introduce a principled layered framework with three implementation methods that take into account contextual information (as available in the whole or part of the sequence). The first two methods are based on state em alpha and em gamma posteriors (as usually referred to in the HMM formalism). The third method is based on Conditional Random Fields (CRFs), a conditional model that relaxes the independent assumption on the observations required by HMMs for computational tractability. We illustrate our methods with the application of recognizing group actions in meetings. Experiments and comparison with standard HMM baseline showed the validity of the proposed approach.

Keywords: Report_VI, IM2.MPR
[15] D. Barber and S. Chiappa. Unified inference for variational bayesian linear gaussian state-space models. In NIPS, 2006. IDIAP-RR 06-50. [ bib | .ps.gz | .pdf ]
Linear Gaussian State-Space Models are widely used and a Bayesian treatment of parameters is therefore of considerable interest. The approximate Variational Bayesian method applied to these models is an attractive approach, used successfully in applications ranging from acoustics to bioinformatics. The most challenging aspect of implementing the method is in performing inference on the hidden state sequence of the model. We show how to convert the inference problem so that standard and stable Kalman Filtering/Smoothing recursions from the literature may be applied. This is in contrast to previously published approaches based on Belief Propagation. Our framework both simplifies and unifies the inference problem, so that future applications may be easily developed. We demonstrate the elegance of the approach on Bayesian temporal ICA, with an application to finding independent components in noisy EEG signals.

Keywords: Report_VI, IM2.MPR
[16] H. Ketabdar and H. Hermansky. Identifying unexpected words using in-context and out-of-context phoneme posteriors. IDIAP-RR 68, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
The paper proposes and discusses a machine approach for identification of unexpected (zero or low probability) words. The approach is based on use of two parallel recognition channels, one channel employing sensory information from the speech signal together with a prior context information provided by the pronunciation dictionary and grammatical constraints, to estimate `in-context' posterior probabilities of phonemes, the other channel being independent of the context information and entirely driven by the sensory data to deliver estimates of `out-of-context' posterior probabilities of phonemes. A significant mismatch between the information from these two channels indicates unexpected word. The viability of this concept is demonstrated on identification of out-of-vocabulary digits in continuous digit streams. The comparison of these two channels provides a confidence measure on the output of the recognizer. Unlike conventional confidence measures, this measure is not relying on phone and word segmentation (boundary detection), thus it is not affected by possibly imperfect segment boundary detection. In addition, being a relative measure, it is more discriminative than the conventional posterior based measures.

Keywords: Report_VI, IM2.AP
[17] A. Just. Two-handed gestures for human-computer interaction. Idiap-rr, École Polytechnique Fédérale de Lausanne, 2006. PhD Thesis #3683 at the École Polytechnique Fédérale de Lausanne. [ bib | .ps.gz | .pdf ]
The present thesis is concerned with the development and evaluation (in terms of accuracy and utility) of systems using hand postures and hand gestures for enhanced Human-Computer Interaction (HCI). In our case, these systems are based on vision techniques, thus only requiring cameras, and no other specific sensors or devices. When dealing with hand movements, it is necessary to distinguish two aspects of these hand movements: the textitstatic aspect and the textitdynamic aspect. The static aspect is characterized by a pose or configuration of the hand in an image and is related to the Hand Posture Recognition (HPR) problem. The dynamic aspect is defined either by the trajectory of the hand, or by a series of hand postures in a sequence of images. This second aspect is related to the Hand Gesture Recognition (HGR) task. Given the recognized lack of common evaluation databases in the HGR field, a first contribution of this thesis was the collection and public distribution of two databases, containing both one- and two-handed gestures, which part of the results reported here will be based upon. On these databases, we compare two state-of-the-art models for the task of HGR. As a second contribution, we propose a HPR technique based on a new feature extraction. This method has the advantage of being faster than conventional methods while yielding good performances. In addition, we provide comparison results of this method with other state-of-the-art technique. Finally, the most important contribution of this thesis lies in the thorough study of the state-of-the-art not only in HGR and HPR but also more generally in the field of HCI. The first chapter of the thesis provides an extended study of the state-of-the-art. The second chapter of this thesis contributes to HPR. We propose to apply for HPR a technique employed with success for face detection. This method is based on the Modified Census Transform (MCT) to extract relevant features in images. We evaluate this technique on an existing benchmark database and provide comparison results with other state-of-the-art approaches. The third chapter is related to HGR. In this chapter we describe the first recorded database, containing both one- and two-handed gestures in the 3D space. We propose to compare two models used with success in HGR, namely Hidden Markov Models (HMM) and Input-Output Hidden Markov Model (IOHMM). The fourth chapter is also focused on HGR but more precisely on two-handed gesture recognition. For that purpose, a second database has been recorded using two cameras. The goal of these gestures is to manipulate virtual objects on a screen. We propose to investigate on this second database the state-of-the-art sequence processing techniques we used in the previous chapter. We then discuss the results obtained using different features, and using images of one or two cameras. In conclusion, we propose a method for HPR based on new feature extraction. For HGR, we provide two databases and comparison results of two major sequence processing techniques. Finally, we present a complete survey on recent state-of-the-art techniques for both HPR and HGR. We also present some possible applications of these techniques, applied to two-handed gesture interaction. We hope this research will open new directions in the field of hand posture and gesture recognition.

Keywords: Report_VI, IM2.VP, Human-Computer Interaction, Computer Vision, Hand Posture Recognition, Modified Census Transform, Hand Gesture Recognition, Hidden Markov Model, Input-Output Hidden Markov Model
[18] L. Pérez-Freire, F. Pérez-González, and S. Voloshynovskiy. An accurate analysis of scalar quantization-based data hiding. IEEE Trans. on Information Forensics and Security, 1(1):80–86, 2006. [ bib | .pdf ]
Keywords: Report_VI, IM2.MPR
[19] S. Chiappa. Analysis and classification of eeg signals using probabilistic models for brain computer interfaces. PhD thesis, École Polytechnique Fédérale de Lausanne, 2006. [ bib | .ps.gz | .pdf ]
Keywords: Report_VI, IM2.BMI
[20] R. Bertolami, B. Halter, and H. Bunke. Combination of multiple handwritten text line recognition systems with a recursive approach. In Proc. 10th Int. Workshop Frontiers in Handwriting Recognition, pages 61–65, 2006. [ bib ]
Keywords: Report_VI, IM2.VP
[21] S. Voloshynovskiy, O. Koval, M. K. Mihcak, and T. Pun. The edge process model and its application to information hiding capacity analysis. IEEE Trans. on Signal Processing, 54(5):1813–1825, 2006. [ bib | .pdf ]
Keywords: Report_VI, IM2.MPR
[22] G. Andreani, G. Di Fabbrizio, M. Gilbert, D. Gillick, D. Hakkani-Tur, and O. Lemon. Lets discoh: Collecting an annotated open corpus with dialog acts and reward signals for natural language helpdesks. Proc. IEEE/ACL Workshop on Spoken Language Technology, 2006. [ bib ]
Keywords: Report_VI, IM2.AP
[23] P. Motlicek, V. Ullal, and H. Hermansky. Wide-band perceptual audio coding based on frequency-domain linear prediction. IDIAP-RR 58, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
In this paper we propose an extension of the very low bit-rate speech coding technique, exploiting predictability of the temporal evolution of spectral envelopes, for wide-band audio coding applications. Temporal envelopes in critically band-sized sub-bands are estimated using frequency domain linear prediction applied on relatively long time segments. The sub-band residual signals, which play an important role in acquiring high quality reconstruction, are processed using a heterodyning-based signal analysis technique. For reconstruction, their optimal parameters are estimated using a closed-loop analysis-by-synthesis technique driven by a perceptual model emulating simultaneous masking properties of the human auditory system. We discuss the advantages of the approach and show some properties on challenging audio recordings. The proposed technique is capable of encoding high quality, variable rate audio signals on bit-rates below 1bit/sample.

Keywords: Report_VI, IM2.AP
[24] J. Richiardi and A. Drygajlo. Applying biometrics to identity documents: implementation issues. Snsf ambai project technical report, Swiss Federal Institute of Technology, 2006. [ bib ]
Keywords: Report_VI, IM2.MPR
[25] J. Luo, A. Pronobis, and B. Caputo. Svm-based transfer of visual knowledge across robotic platforms. IDIAP-RR 65, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
This paper presents an SVM–based algorithm for the transfer of knowledge across robot platforms aiming to perform the same task. Our method exploits efficiently the transferred knowledge while updating incrementally the internal representation as new information is available. The algorithm is adaptive and tends to privilege new data when building the SV solution. This prevents the old knowledge to nest into the model and eventually become a possible source of misleading information. We tested our approach in the domain of vision-based place recognition. Extensive experiments show that using transferred knowledge clearly pays off in terms of performance and stability of the solution.

Keywords: Report_VI, IM2.VP.HMI
[26] B. Leibe, N. Cornelis, K. Cornelis, and L. van Gool. Integrating recognition and reconstruction for cognitive traffic scene analysis from a moving vehicle. In DAGM Annual Pattern Recognition Symposium, volume 4174 of LNCS, pages 192–201. Springer, 2006. [ bib ]
Keywords: Report_VI, IM2.VP
[27] A. Schlapbach and H. Bunke. Off-line writer verification: a comparison of a hidden markov model (hmm) and a gaussian mixture model (gmm) based system. In Proc. 10th Int. Workshop Frontiers in Handwriting Recognition, pages 275–280, 2006. [ bib ]
Keywords: Report_VI, IM2.VP
[28] S. Ba and J. M. Odobez. Recognizing people's focus of attention from head poses: a study. IDIAP-RR 42, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
This paper presents a study on the recognition of the visual focus of attention (VFOA) of meeting participants based on their head pose. Contrary to previous studies on the topic, in our set-up, the potential VFOA of a person is not restricted to other meeting the participants only, but include environmental targets (including a table, a projection screen). This has two consequences. First, it increases the number of possible ambiguities in identifying the VFOA from the head pose. Secondly, in the scenario we present here, full knowledge of the head pointing direction is required to identify the VFOA. An incomplete representation of the head pointing direction (head pan only) will not suffice. In this paper, using a corpus of 8 meetings of 10 minutes average length, featuring 4 persons involved discussing statements projected on a screen, we analyze the above issues by evaluating, through numerical performance measures, the recognition of the VFOA from head pose information obtained either using a magnetic sensor device (the ground truth) or a vision based tracking system (head pose estimates). The results clearly show that in such complex but realistic situations, it is can be optimistic to believe that the recognition of the VFOA can solely be based on the head pose, as some previous studies had suggested.

Keywords: Report_VI, IM2.VP
[29] S. Cuendet, D. Hakkani-Tur, and G. Tur. Model adaptation for sentence segmentation from speech. Proc. IEEE/ACL Workshop on Spoken Language Technology,, 2006. [ bib ]
Keywords: Report_VI, IM2.AP
[30] S. Marcel, Y. Rodriguez, M. Guillemot, and A. Popescu-Belis. Annotation of face detection: description of xml format and files. IDIAP-COM 06, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
Keywords: Report_VI, IM2.VP
[31] Y. Rodriguez. Face detection and verification using local binary patterns. Idiap-rr, École Polytechnique Fédérale de Lausanne, 2006. PhD Thesis #3681 at the École Polytechnique Fédérale de Lausanne. [ bib | .ps.gz | .pdf ]
This thesis proposes a robust Automatic Face Verification (AFV) system using Local Binary Patterns (LBP). AFV is mainly composed of two modules: Face Detection (FD) and Face Verification (FV). The purpose of FD is to determine whether there are any face in an image, while FV involves confirming or denying the identity claimed by a person. The contributions of this thesis are the following: 1) a real-time multiview FD system which is robust to illumination and partial occlusion, 2) a FV system based on the adaptation of LBP features, 3) an extensive study of the performance evaluation of FD algorithms and in particular the effect of FD errors on FV performance. The first part of the thesis addresses the problem of frontal FD. We introduce the system of Viola and Jones which is the first real-time frontal face detector. One of its limitations is the sensitivity to local lighting variations and partial occlusion of the face. In order to cope with these limitations, we propose to use LBP features. Special emphasis is given to the scanning process and to the merging of overlapped detections, because both have a significant impact on the performance. We then extend our frontal FD module to multiview FD. In the second part, we present a novel generative approach for FV, based on an LBP description of the face. The main advantages compared to previous approaches are a very fast and simple training procedure and robustness to bad lighting conditions. In the third part, we address the problem of estimating the quality of FD. We first show the influence of FD errors on the FV task and then empirically demonstrate the limitations of current detection measures when applied to this task. In order to properly evaluate the performance of a face detection module, we propose to embed the FV into the performance measuring process. We show empirically that the proposed methodology better matches the final FV performance.

Keywords: Report_VI, IM2.VP, Face Detection and Verification, Boosting, Local Binary Patterns
[32] D. Hillard, Z. Huang, H. Ji, R. Grishman, D. Hakkani-Tur, M. Harper, M. Ostendorf, and W. Wang. Impact of automatic comma prediction on pos/name tagging of speech. Proc. IEEE/ACL Workshop on Spoken Language Technology,, 2006. [ bib ]
Keywords: Report_VI, IM2.AP
[33] M. Everingham, A. Zisserman, C. Williams, L. van Gool, M. Allan, C. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dorko, S. Duffner, J. Eichhorn, J. Farquhar, M. Fritz, C. Garcia, T. Griffiths, F. Jurie, D. Keysers, M. Koskela, J. Laaksonen, D. Larlus, B. Leibe, H. Meng, H. Ney, B. Schiele, C. Schmid, E. Seemann, J. Shawe-Taylor, A. Storkey, S. Szedmak, B. Triggs, I. Ulusoy, V. Viitaniemi, and J. Zhang. The 2005 pascal visual object class challenge. In Selected Proceedings of the 1st PASCAL Challenges Workshop, Lecture Notes in AI. Springer, 2006. [ bib ]
Keywords: Report_VI, IM2.VP
[34] P. Wey, B. Fischer, H. Bay, and J. M. Buhmann. Dense stereo by triangular meshing and cross validation. In DAGM-Symposium, pages 708–717, 2006. [ bib ]
Keywords: Report_VI, IM2.VP
[35] P. Müller, P. Wonka, S. Haegler, A. Ulmer, and L. van Gool. Procedural modeling of buildings. In Proceedings of ACM SIGGRAPH 2006 / ACM Transactions on Graphics, volume 25, pages 614–623. ACM Press, 2006. [ bib ]
Keywords: Report_VI, IM2.VP, Procedural Modeling, Architecture, Chomsky Grammars, L-systems, Computer-Aided Design
[36] D. Zhang. Probabilistic graphical models for human interaction analysis. PhD thesis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2006. thesis # (IDIAP-RR 06-78). [ bib | .ps.gz | .pdf ]
The objective of this thesis is to develop probabilistic graphical models for analyzing human interaction in meetings based on multimodel cues. We use meeting as a study case of human interactions since research shows that high complexity information is mostly exchanged through face-to-face interactions. Modeling human interaction provides several challenging research issues for the machine learning community. In meetings, each participant is a multimodal data stream. Modeling human interaction involves simultaneous recording and analysis of multiple multimodal streams. These streams may be asynchronous, have different frame rates, exhibit different stationarity properties, and carry complementary (or correlated) information. In this thesis, we developed three probabilistic graphical models for human interaction analysis. The proposed models use the “probabilistic graphical model” formalism, a formalism that exploits the conjoined capabilities of graph theory and probability theory to build complex models out of simpler pieces. We first introduce the multi-layer framework, in which the first layer models typical individual activity from low-level audio-visual features, and the second layer models the interactions. The two layers are linked by a set of posterior probability-based features. Next, we describe the team-player influence model, which learns the influence of interacting Markov chains within a team. The team-player influence model has a two-level structure: individual-level and group-level. Individual level models actions of each player, and the group-level models actions of the team as a whole. The influence of each player on the team is jointly learned with the rest of the model parameters in a principled manner using the Expectation-Maximization (EM) algorithm. Finally, we describe the semi-supervised adapted HMMs for unusual event detection. Unusual events are characterized by a number of features (rarity, unexpectedness, and relevance) that limit the application of traditional supervised model-based approaches. We propose a semi-supervised adapted Hidden Markov Model (HMM) framework, in which usual event models are first learned from a large amount of (commonly available) training data, while unusual event models are learned by Bayesian adaptation in an unsupervised manner.

Keywords: Report_VI, IM2.MPR.HMI
[37] E. L. Torre, B. Caputo, and T. Tommasi. Melanoma recognition using kernel classifiers. IDIAP-RR 53, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
Melanoma is the most deadly skin cancer. Early diagnosis is a current challenge for clinicians. Current algorithms for skin lesions classification focus mostly on segmentation and feature extraction. This paper instead puts the emphasis on the learning process, proposing two kernel-based classifiers: support vector machines, and spin glass-Markov random fields. We benchmarked these algorithms against a state-of-the-art method on melanoma recognition. We show with extensive experiments that the support vector machine approach outperforms the other methods, proving to be an effective classification algorithm for computer assisted diagnosis of melanoma.

Keywords: Report_VI, IM2.VP
[38] M. Keller. Machine learning approaches to text representation using unlabeled data. PhD thesis, Ecole Polytechnique Fédérale de Lausanne, 2006. IDIAP-RR 06-76. [ bib | .ps.gz | .pdf ]
With the rapid expansion in the use of computers for producing digitalized textual documents, the need of automatic systems for organizing and retrieving the information contained in large databases has become essential. In general, information retrieval systems rely on a formal description or representation of documents enabling their automatic processing. In the most common representation, the so-called bag-of-words, documents are represented by the words composing them and two documents (or a user query and a document) are considered similar if they have a high number of co-occurring words. In this representation, documents with different, but semantically related terms will be considered as unrelated, and documents using the same terms but in different contexts will be seen as similar. It arises quite naturally that information retrieval systems can use the huge amount of existing textual documents in order to “learn”, as humans do, the different uses of words depending on the context. This information can be used to enrich documents' representation. In this thesis dissertation we develop several original machine learning approaches which attempt at fulfilling this aim. As a first approach to document representation we propose a probabilistic model in which documents are assumed to be issued from a mixture of distributions over themes, modeled by a hidden variable conditioning a multinomial distribution over words. Simultaneously, words are assumed to be drawn from a mixture of distributions over topics, modeled by a second hidden variable dependent on the themes. As a second approach, we proposed a neural network which is trained to give a score for the appropriateness of a word in a given context. Finally we present, a multi-task learning approach, which is trained jointly to solve an information retrieval task, while learning on unlabeled data to improve its representation of documents.

Keywords: Report_VI, IM2.MPR.MCA
[39] P. Quelhas and J. M. Odobez. Natural scene image modeling using color and texture visterms. In Conference on Image and Video Retrieval CIVR, 2006. IDIAP-RR 06-17. [ bib | .ps.gz | .pdf ]
This paper presents a novel approach for visual scene representation, combining the use of quantized color and texture local invariant features (referred to here as em visterms) computed over interest point regions. In particular we investigate the different ways to fuse together local information from texture and color in order to provide a better em visterm representation. We develop and test our methods on the task of image classification using a 6-class natural scene database. We perform classification based on the em bag-of-visterms (BOV) representation (histogram of quantized local descriptors), extracted from both texture and color features. We investigate two different fusion approaches at the feature level: fusing local descriptors together and creating one representation of joint texture-color visterms, or concatenating the histogram representation of both color and texture, obtained independently from each local feature. On our classification task we show that the appropriate use of color improves the results w.r.t. a texture only representation.

Keywords: Report_VI, IM2.MCA
[40] B. Mesot and D. Barber. Switching linear dynamical systems for noise robust speech recognition. IDIAP-RR 08, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
Real world applications such as hands-free speech recognition of isolated digits may have to deal with potentially very noisy environments. Existing state-of-the-art solutions to this problem use feature-based HMMs, with a preprocessing stage to clean the noisy signal. However, the effect that raw signal noise has on the induced HMM features is poorly understood, and limits the performance of the HMM system. An alternative to feature-based HMMs is to model the raw signal, which has the potential advantage that including an explicit noise model is straightforward. Here we jointly model the dynamics of both the raw speech signal and the noise, using a Switching Linear Dynamical System (SLDS). The new model was tested on isolated digit utterances corrupted by Gaussian noise. Contrary to the SAR-HMM, which provides a model of uncorrupted raw speech, the SLDS is comparatively noise robust and also significantly outperforms a state-of-the-art feature-based HMM. The computational complexity of the SLDS scales exponentially with the length of the time series. To counter this we use Expectation Correction which provides a stable and accurate linear-time approximation for this important class of models, aiding their further application in acoustic modelling.

Keywords: Report_VI, IM2.AP
[41] J. Vepa and S. King. Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis. IEEE Trans. on Audio, Speech and Language Processing, 14(5):1763–1771, 2006. IDIAP-RR 05-34. [ bib | .ps.gz | .pdf ]
In unit selection-based concatenative speech synthesis, join cost (also known as concatenation cost), which measures how well two units can be joined together, is one of the main criteria for selecting appropriate units from the inventory. Usually, some form of local parameter smoothing is also needed to disguise the remaining discontinuities. This paper presents a subjective evaluation of three join cost functions and three smoothing methods. We also describe the design and performance of a listening test. The three join cost functions were taken from our previous study, where we proposed join cost functions derived from spectral distances, which have good correlations with perceptual scores obtained for a range of concatenation discontinuities. This evaluation allows us to further valiyeartheir ability to predict concatenation discontinuities. The units for synthesis stimuli are obtained from a state-of-the-art unit selection text-to-speech system: rVoice from Rhetorical Systems Ltd. In this paper, we report listeners' preferences for each join cost in combination with each smoothing method.

Keywords: Report_VI, IM2.AP
[42] M. Müller, F. Evéquoz, and D. Lalanne. Tjass, a smart board for augmenting card game playing and learning (demo). In Symposium on User Interface Software and Technology (UIST 2006), pages 67–68, Montreux (Switzerland), 2006. [ bib ]
Keywords: Report_VI, IM2.HMI
[43] S. Voloshynovskiy, O. Koval, E. Topak, J. E. V. Forcen, and T. Pun. On reversibility of random binning based data-hiding techniques: security perspectives. In ACM Multimedia and Security Workshop 2006, Geneva, Switzerland, 2006. [ bib | .ps ]
Keywords: Report_VI, IM2.MPR
[44] N. Moüenne-Loccoz, B. Janvier, S. Marchand-Maillet, and E. Bruno. Handling temporal heterogeneous data for content-based management of large video collections. Multimedia Tools and Applications, 31:309–325, 2006. [ bib ]
Keywords: Report_VI, IM2.MCA
[45] J. Richiardi and A. Drygajlo. Applying biometrics to identity documents: estimating and coping with errors. Snsf ambai project technical report, Swiss Federal Institute of Technology, 2006. [ bib ]
Keywords: Report_VI, IM2.MPR
[46] T. Spindler, C. Wartmann, D. Roth, A. Steffen, L. Hovestadt, and L. van Gool. Privacy in video surveilled areas. In International Conference on Privacy, Security and Trust (PST 2006), 2006. [ bib ]
Keywords: Report_VI, IM2.VP, major publication, Best Paper Awards, Surveillance, Cryptography, Computer Vision, Tracking, Building Automation
[47] B. Leibe, K. Mikolajczyk, and B. Schiele. Efficient clustering and matching for object class recognition. In British Machine Vision Conference (BMVC, 2006. [ bib ]
Keywords: Report_VI, IM2.VP
[48] B. Leibe, K. Mikolajczyk, and B. Schiele. Segmentation based multi-cue integration for object detection. In British Machine Vision Conference (BMVC, 2006. [ bib ]
Keywords: Report_VI, IM2.VP
[49] G. Lathoud. Observations on multi-band asynchrony in distant speech recordings. IDIAP-RR 74, IDIAP, Martigny, Switzerland, 2006. [ bib | .ps.gz | .pdf ]
Whenever the speech signal is captured by a microphone distant from the user, the acoustic response of the room introduces significant distortions. To remove these distortions from the signal, solutions exist that greatly improve the ASR performance (what was said?), such as dereverberation or beamforming. It may seem natural to apply those signal-level methods in the context of speaker clustering (who spoke when?) with distant microphones, for example when annotating a meeting recording for enhanced browsing experience. Unfortunately, on a corpus of real meeting recordings, it appeared that neither dereverberation nor beamforming gave any improvement on the speaker clustering task. The present technical report constitutes a first attempt to explain this failure, through a cross-correlation analysis between close-talking and distant microphone signals. The various frequency bands of the speech spectrum appear to become desynchronized when the speaker is 1 or 2 meters away from the microphone. Further directions of research are suggested to model this desynchronization.

Keywords: Report_VI, IM2.AP
[50] G. Lathoud, M. Magimai-Doss, and H. Bourlard. Unsupervised spectral subtraction for noise-robust asr on unknown transmission channels. IDIAP-RR 09, IDIAP, Martigny, Switzerland, 2006. [ bib | .ps.gz | .pdf ]
This paper addresses several issues of classical spectral subtraction methods with respect to the automatic speech recognition task in noisy environments. The main contributions of this paper are twofold. First, a channel normalization method is proposed to extend spectral subtraction to the case of transmission channels such as cellphones. It equalizes the transmission channel and removes part of the additive noise. Second, a simple, computationally efficient mbox2-component probabilistic model is proposed to discriminate between speech and additive noise at the magnitude spectrogram level. Based on this model, an alternative to classical spectral subtraction is proposed, called “Unsupervised Spectral Subtraction” (USS). The main difference is that the proposed approach does not require any parameter tuning. Experimental studies on Aurora 2 show that channel normalization followed by USS compares advantageously to both classical spectral subtraction, and the ETSI standard front-end (Wiener filtering). Compared to the ETSI standard front-end, a 21.3% relative improvement is obtained on 0 to 20 dB noise conditions, for an absolute loss of 0.1 % in clean conditions. The computational cost of the proposed approach is very low, which makes it fit for real-time applications.

Keywords: Report_VI, IM2.AP
[51] O. Cheng, J. Dines, and M. Magimai-Doss. A generalized dynamic composition algorithm of weighted finite state transducers for large vocabulary speech recognition. IDIAP-RR 62, IDIAP, 2006. Submitted for publication. [ bib | .ps.gz | .pdf ]
We propose a generalized dynamic composition algorithm of weighted finite state transducers (WFST), which avoids the creation of non-coaccessible paths, performs weight look-ahead and does not impose any constraints to the topology of the WFSTs. Experimental results on Wall Street Journal (WSJ1) 20k-word trigram task show that at 17% WER (moderately-wide beam width), the decoding time of the proposed approach is about 48% and 65% of the other two dynamic composition approaches. In comparison with static composition, at the same level of 17% WER, we observe a reduction of about 60% in memory requirement, with an increase of about 60% in decoding time due to extra overheads for dynamic composition.

Keywords: Report_VI, IM2.AP
[52] J. E. Vila-Forcén, S. Voloshynovskiy, O. Koval, and T. Pun. Facial image compression based on structured codebooks in overcomplete domain. EURASIP Journal on Applied Signal Processing, Frames and overcomplete representations in signal processing, communications, and information theory special issue, 2006(Article ID 69042):1–11, 2006. [ bib | .pdf ]
Keywords: Report_VI, IM2.MPR
[53] J. E. Vila-Forcén, S. Voloshynovskiy, O. Koval, and T. Pun. Costa problem under channel ambiguity. In Proceedings of 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, 2006. [ bib | .pdf ]
Keywords: Report_VI, IM2.MPR
[54] M. F. BenZeghiba and H. Bourlard. User-customized password speaker verification using multiple reference and background models. Speech Communication, 8:1200–1213, 2006. IDIAP-RR 04-41. [ bib | .ps.gz | .pdf ]
This paper discusses and optimizes an HMM/GMM based User-Customized Password Speaker Verification (UCP-SV) system. Unlike text-dependent speaker verification, in UCP-SV systems, customers can choose their own passwords with no lexical constraints. The password has to be pronounced a few times during the enrollment step to create a customer dependent model. Although potentially more “user-friendly”, such systems are less understood and actually exhibit several practical issues, including automatic HMM inference, speaker adaptation, and efficient likelihood normalization. In our case, HMM inference (HMM topology) is performed using hybrid HMM/MLP systems, while the parameters of the inferred model, as well as their adaptation, will use GMMs. However, the evaluation of a UCP-SV baseline system shows that the background model used for likelihood normalization is the main difficulty. Therefore, to circumvent this problem, the main contribution of the paper is to investigate the use of multiple reference models for customer acoustic modeling and multiple background models for likelihood normalization. In this framework, several scoring techniques are investigated, such as Dynamic Model Selection (DMS) and fusion techniques. Results on two different experimental protocols show that an appropriate selection criteria for customer and background models can improve significantly the UCP-SV performance, making the UCP-SV system quite competitive with a text-dependent SV system. Finally, as customers' passwords are short, a comparative experiment using the conventional GMM-UBM text-independent approach is also conducted.

Keywords: Report_VI, IM2.AP
[55] G. Lathoud. Spatio-temporal analysis of spontaneous speech with microphone arrays. PhD thesis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2006. PhD Thesis #3689 at the École Polytechnique Fédérale de Lausanne (IDIAP-RR 06-77). [ bib | .ps.gz | .pdf ]
Accurate detection, localization and tracking of multiple moving speakers permits a wide spectrum of applications. Techniques are required that are versatile, robust to environmental variations, and not constraining for non-technical end-users. Based on distant recording of spontaneous multiparty conversations, this thesis focuses on the use of microphone arrays to address the question Who spoke where and when?. The speed, the versatility and the robustness of the proposed techniques are tested on a variety of real indoor recordings, including multiple moving speakers as well as seated speakers in meetings. Optimized implementations are provided in most cases. We propose to discretize the physical space into a few sectors, and for each time frame, to determine which sectors contain active acoustic sources (Where? When?). A topological interpretation of beamforming is proposed, which permits both to evaluate the average acoustic energy in a sector for a negligible cost, and to locate precisely a speaker within an active sector. One additional contribution that goes beyond the eld of microphone arrays is a generic, automatic threshold selection method, which does not require any training data. On the speaker detection task, the new approach is dramatically superior to the more classical approach where a threshold is set on training data. We use the new approach into an integrated system for multispeaker detection-localization. Another generic contribution is a principled, threshold-free, framework for short-term clustering of multispeaker location estimates, which also permits to detect where and when multiple trajectories intersect. On multi-party meeting recordings, using distant microphones only, short-term clustering yields a speaker segmentation performance similar to that of close-talking microphones. The resulting short speech segments are then grouped into speaker clusters (Who?), through an extension of the Bayesian Information Criterion to merge multiple modalities. On meeting recordings, the speaker clustering performance is signicantly improved by merging the classical mel-cepstrum information with the short-term speaker location information. Finally, a close analysis of the speaker clustering results suggests that future research should investigate the effect of human acoustic radiation characteristics on the overall transmission channel, when a speaker is a few meters away from a microphone.

Keywords: Report_VI, IM2.AP.VP, joint publication
[56] G. Chanel, J. Kronegg, D. Grandjean, and T. Pun. Emotion assessment: arousal evaluation using eeg's and peripheral physiological signals. In B. Gunsel, A. K. Jain, A. M. Tekalp, and B. Sankur, editors, Proc. Int. Workshop Multimedia Content Representation, Classification and Security (MRCS), volume 4105, pages 530–537, Istanbul, Turkey, 2006. Lecture Notes in Computer Science, Springer. [ bib ]
Keywords: Report_VI, IM2.MPR
[57] M. Keller and S. Bengio. A multitask learning approach to document representation using unlabeled data. IDIAP-RR 44, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
Text categorization is intrinsically a supervised learning task, which aims at relating a given text document to one or more predefined categories. Unfortunately, labeling such databases of documents is a painful task. We present in this paper a method that takes advantage of huge amounts of unlabeled text documents available in digital format, to counter balance the relatively smaller available amount of labeled text documents. A Siamese MLP is trained in a multi-task framework in order to solve two concurrent tasks: using the unlabeled data, we search for a mapping from the documents' bag-of-word representation to a new feature space emphasizing similarities and dissimilarities among documents; simultaneously, this mapping is constrained to also give good text categorization performance over the labeled dataset. Experimental results on Reuters RCV1 suggest that, as expected, performance over the labeled task increases as the amount of unlabeled data increases.

Keywords: Report_VI, IM2.MPR.MCA, joint publication
[58] B. Mesot and D. Barber. A bayesian alternative to gain adaptation in autoregressive hidden markov models. IDIAP-RR 55, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
Models dealing directly with the raw acoustic speech signal are an alternative to conventional feature-based HMMs. A popular way to model the raw speech signal is by means of an autoregressive (AR) process. Being too simple to cope with the nonlinearity of the speech signal, the AR process is generally embedded into a more elaborate model, such as the switching autoregressive HMM (SAR-HMM). A fundamental issue faced by models based on AR processes is that they are very sensitive to variations in the amplitude of the signal. One way to overcome this limitation is to use Gain Adaptation to adjust the amplitude by maximising the likelihood of the observed signal. However, adjusting model parameters by maximising test likelihoods is fundamentally outside the framework of standard statistical approaches to machine learning, since this may lead to overfitting when the models are sufficiently flexible. We propose a statistically principled alternative based on an exact Bayesian procedure in which priors are explicitly defined on the parameters of the AR process. Explicitly, we present the Bayesian SAR-HMM and compare the performance of this model against the standard Gain-Adapted SAR-HMM on a single digit recognition task, showing the effectiveness of the approach and suggesting thereby a principled and straightforward solution to the issue of Gain Adaptation.

Keywords: Report_VI, IM2.MPR
[59] M. Melichar, P. Cenek, M. Ailomaa, A. Lisowska, and M. Rajman. From vocal to multimodal dialogue management. In Eighth International Conference on Multimodal Interfaces (ICMI'06), 2006. [ bib ]
Keywords: Report_VI, IM2.HMI
[60] N. Poh and S. Bengio. Estimating the confidence interval of expected performance curve in biometric authentication using joint bootstrap. IDIAP-RR 25, IDIAP, 2006. Submitted for publication. [ bib | .ps.gz | .pdf ]
Evaluating biometric authentication performance is a complex task because the performance depends on the user set size, composition and the choice of samples. We propose to reduce the performance dependency of these three factors by deriving appropriate confidence intervals. In this study, we focus on deriving a confidence region based on the recently proposed Expected Performance Curve (EPC). An EPC is different from the conventional DET or ROC curve because an EPC assumes that the test class-conditional (client and impostor) score distributions are unknown and this includes the choice of the decision threshold for various operating points. Instead, an EPC selects thresholds based on the training set and applies them on the test set. The proposed technique is useful, for example, to quote realistic upper and lower bounds of the decision cost function used in the NIST annual speaker evaluation. Our findings, based on the 24 systems submitted to the NIST2005 evaluation, show that the confidence region obtained from our proposed algorithm can correctly predict the performance of an unseen database with two times more users with an average coverage of 95% (over all the 24 systems). A coverage is the proportion of the unseen EPC covered by the derived confidence interval.

Keywords: Report_VI, IM2.MPR
[61] M. Liwicki and H. Bunke. Hmm-based on-line recognition of handwritten whiteboard notes. In Proceedings 10th International Workshop Frontiers in Handwriting Recognition, pages 595–599, 2006. [ bib ]
Keywords: Report_VI, IM2.VP
[62] S. Marcel, J. Keomany, and Y. Rodriguez. Robust-to-illumination face localisation using active shape models and local binary patterns. IDIAP-RR 47, IDIAP, 2006. Submitted for publication. [ bib | .ps.gz | .pdf ]
This paper addresses the problem of locating facial features in images of frontal faces taken under different lighting conditions. The well-known Active Shape Model method proposed by Cootes it et al. is extended to improve its robustness to illumination changes. For that purpose, we introduce the use of Local Binary Patterns (LBP). Three different incremental approaches combining ASM with LBP are presented: profile-based LBP-ASM, square-based LBP-ASM and divided-square-based LBP-ASM. Experiments performed on the standard and darkened image sets of the XM2VTS database demonstrate that the divided-square-based LBP-ASM gives superior performance compared to the state-of-the-art ASM. It achieves more accurate results and fails less frequently.

Keywords: Report_VI, IM2.VP
[63] K. Smith, S. Schreiber, V. Beran, I. Potúcek, G. Rigoll, and D. Gatica-Perez. Multi-person tracking in meetings: a comparative study. In Multimodal Interaction and Related Machine Learning Algorithms (MLMI), 2006. IDIAP-RR 06-38. [ bib | .ps.gz | .pdf ]
In this paper, we present the findings of the Augmented Multiparty Interaction (AMI) project investigation on the localization and tracking of 2D head positions in meetings. The focus of the study was to test and evaluate various multi-person tracking methods developed in the project using a standardized data set and evaluation methodology.

Keywords: Report_VI, IM2.MPR
[64] K. Smith, S. Ba, J. M. Odobez, and D. Gatica-Perez. Tracking attention for multiple people: wandering visual focus of attention estimation. IDIAP-RR 40, IDIAP, 2006. Submitted for publication. [ bib | .ps.gz | .pdf ]
The problem of finding the visual focus of attention of multiple people free to move in an unconstrained manner is defined here as the em wandering visual focus of attention (WVFOA) problem. Estimating the WVFOA for multiple unconstrained people is a new and important problem with implications for human behavior understanding and cognitive science, as well as real-world applications. One such application, which we present in this article, monitors the attention passers-by pay to an outdoor advertisement. In our approach to the WVFOA problem, we propose a multi-person tracking solution based on a hybrid Dynamic Bayesian Network that simultaneously infers the number of people in a scene, their body locations, their head locations, and their head pose. It is defined in a joint state-space formulation that allows for the modeling of interactions between people. For inference in the resulting high-dimensional state-space, we propose a trans-dimensional Markov Chain Monte Carlo (MCMC) sampling scheme, which not only handles a varying number of people, but also efficiently searches the state-space by allowing person-part state updates. Our model was rigorously evaluated for tracking quality and ability to recognize people looking at an outdoor advertisement, and the results indicate good performance for these tasks.

Keywords: Report_VI, IM2.VP
[65] S. Ba and J. M. Odobez. A study on visual focus of attention recognition from head pose in a meeting room. In 3rd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI06), 2006. IDIAP-RR 06-10. [ bib | .ps.gz | .pdf ]
This paper presents a study on the recognition of the visual focus of attention (VFOA) of meeting participants based on their head pose. Contrarily to previous studies on the topic, in our set-up, the potential VFOA of people is not restricted to other meeting participants only, but includes environmental targets (table, slide screen). This has two consequences. Firstly, this increases the number of possible ambiguities in identifying the VFOA from the head pose. Secondly, due to our particular set-up, the identification of the VFOA from head pose can not rely on an incomplete representation of the pose (the pan), but requests the knowledge of the full head pointing information (pan and tilt). In this paper, using a corpus of 8 meetings of 8 minutes on average, featuring 4 persons involved in the discussion of statements projected on a slide screen, we analyze the above issues by evaluating, through numerical performance measures, the recognition of the VFOA from head pose information obtained either using a magnetic sensor device (the ground truth) or a vision based tracking system (head pose estimates). The results clearly show that in complex but realistic situations, it is quite optimistic to believe that the recognition of the VFOA can solely be based on the head pose, as some previous studies had suggested.

Keywords: Report_VI, IM2.VP.MPR, Joint publication
[66] J. del R. Millán, F. Renkens, J. Mouriño, and W. Gerstner. Non-invasive brain-actuated control of a mobile robot by human eeg. In 2006 IMIA Yearbook of Medical Informatics. Schattauer Verlag, 2006. [ bib ]
Brain activity recorded non-invasively is sufficient to control a mobile robot if advanced robotics is used in combination with asynchronous EEG analysis and machine learning techniques. Until now brain-actuated control has mainly relied on implanted electrodes, since EEG-based systems have been considered too slow for controlling rapid and complex sequences of movements. We show that two human subjects successfully moved a robot between several rooms by mental control only, using an EEG-based brain-machine interface that recognized three mental states. Mental control was comparable to manual control on the same task with a performance ratio of 0.74.

Keywords: Report_VI, IM2.BMI, major
[67] A. Hannani, D. Toledano, D. Petrovska, A. Montero-Asenjo, and J. Hennebert. Using data-driven and phonetic units for speaker verification. In IEEE Speaker and Language Recognition Workshop (Odyssey 2006), Puerto Rico, 2006. [ bib ]
Keywords: Report_VI, IM2.MPR
[68] N. Poh and S. Bengio. Using chimeric users to construct fusion classifiers in biometric authentication tasks: an investigation. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2006. IDIAP-RR 05-59. [ bib | .ps.gz | .pdf ]
Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue – the fact that too many biometric modalities of a same person stored in a single location can present a emphhigher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a real-user database?, and, ii) can a chimeric database be exploited to emphimprove the generalization performance of a fusion operator on a real-user database?. Based on a considerable amount of empirical biometric person authentication experiments (21 real-user data sets and up to 21 times 1000 chimeric data sets and two fusion operators), our previous study citePoh_05_chimeric answers bf no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases emphdoes not degrade nor improve the performance of a fusion operator when tested on a real-user database with respect to using only a real-user database. Considering the possibly expensive cost involved in collecting the real-user multimodal data, our proposed approach is thus emphuseful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

Keywords: Report_VI, IM2.MPR
[69] P. C. Cattin, H. Bay, L. van Gool, and G. Székely. Retina mosaicing using local features. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 4191 of LNCS, pages 185–192, 2006. [ bib ]
Keywords: Report_VI, IM2.VP
[70] J. Mariéthoz. Discrmininant models for text-independent speaker verification. IDIAP-RR 70, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
This thesis addresses text-independent speaker verification from a machine learning point of view. We use the machine learning framework to better define the problem and to develop new unbiased performance measures and statistical tests to compare objectively new approaches. We propose a new interpretation of the state-of-the-art Gaussian Mixture Model based system and show that they are discriminant and equivalent to a mixture of linear classifiers. A general framework for score normalization is also given for both probability and non-probability based models. With this new framework we better show the hypotheses made for the well known Z- and T- score normalization techniques. Several uses of discriminant models are then proposed. In particular, we develop a new sequence kernel for Support Vector Machines that generalizes an other sequence kernel found in the literature. If the latter is limited to a polynomial form the former allows the use of infinite space kernels such as Radial Basis Functions. A variant of this kernel that finds the best match for each frame of the sequence to be compared, actually outperforms the state-of-the-art systems. As our new sequence kernel is computationally costly for long sequences, a clustering technique is proposed for reducing the complexity. We also address in this thesis some problems specific to speaker verification such as the fact that the classes are highly unbalanced. And the use of a specific intra- and inter-class distance distribution is proposed by modifying the kernel in order to assume a Gaussian noise distribution over negative examples. Even if this approach misses some theoretical justification, it gives very good empirical results and opens a new research direction.

Keywords: Report_VI, IM2.AP
[71] T. Pun, T. I. Alecu, G. Chanel, J. Kronegg, and S. Voloshynovskiy. Brain-computer interaction research at the computer vision and multimedia laboratory, university of geneva. IEEE Trans. Neural Systems and Rehabilitation Engineering, Special Issue on Brain-Computer Interaction, 14(2):210–213, 2006. [ bib ]
Keywords: Report_VI, IM2.MPR
[72] C. Hemptinne. Master thesis: integration of the harmonic plus noise model (hnm) into the hidden markov model-based speech synthesis system (hts). IDIAP-RR 69, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
Keywords: Report_VI, IM2.AP
[73] S. Kosinov, S. Marchand-Maillet, I. Kozintsev, C. Dulong, and T. Pun. Dual diffusion model of spreading activation for content-based image retrieval. In 8th ACM SIGMM - International Workshop on Multimedia Information Retrieval, Santa Barbara, CA, USA, 2006. [ bib ]
Keywords: Report_VI, IM2.MCA
[74] H. K. Maganti, P. Motlicek, and D. Gatica-Perez. Unsupervised speech/non-speech detection for automatic speech recognition in meeting rooms. IDIAP-RR 57, IDIAP, Martigny, Switzerland, 2006. [ bib | .ps.gz | .pdf ]
The goal of this work is to provide robust and accurate speech detection for automatic speech recognition (ASR) in meeting room settings. The solution is based on computing long-term modulation spectrum, and examining specific frequency range for dominant speech components to classify speech and non-speech signals for a given audio signal. Manually segmented speech segments, short-term energy, short-term energy and zero-crossing based segmentation techniques, and a recently proposed Multi Layer Perceptron (MLP) classifier system are tested for comparison purposes. Speech recognition evaluations of the segmentation methods are performed on a standard database and tested in conditions where the signal-to-noise ratio (SNR) varies considerably, as in the cases of close-talking headset, lapel, distant microphone array output, and distant microphone. The results reveal that the proposed method is more reliable and less sensitive to mode of signal acquisition and unforeseen conditions.

Keywords: Report_VI, IM2.AP
[75] B. Janvier, E. Bruno, S. Marchand-Maillet, and T. Pun. Handling temporal heterogeneous data for content-based management of large video collections. Multimedia Tools and Applications, 30:273–288, 2006. [ bib ]
Keywords: Report_VI, IM2.MCA
[76] A. Peregoudov, A. Vinciarelli, and H. Bourlard. Assessing the effectiveness of slides as a mean to improve the automatic transcription of oral presentations. IDIAP-RR 56, IDIAP, 2006. Submitted for publication. [ bib | .ps.gz | .pdf ]
This paper presents experiments aiming at improving the automatic transcription of oral presentations through the inclusion of the slides in the recognition process. The experiments are performed over a data set of around three hours of material ( 33 kwords and 270 slides) and are based on an approach trying to maximize the similarity between the recognizer output and the content of the slides. The results show that the upper bound to the Word Error Rate (WER) reduction is 1.7% (obtained by transcribing correctly all words co-occurring in both slides and speech), but that our approach does not produce statistically significant improvements. Results analysis seems to suggest that such results do not depend on the similarity maximization approach, but on the statistical characteristics of the language.

Keywords: Report_VI, IM2.AP.MCA, joint publication
[77] S. Cuendet. Model adaptation for sentence unit segmentation from speech. IDIAP-RR 64, IDIAP, Martigny, Switzerland, 2006. [ bib | .ps.gz | .pdf ]
The sentence segmentation task is a classification task that aims at inserting sentence boundaries in a sequence of words. One of the applications of sentence segmentation is to detect the sentence boundaries in the sequence of words that is output by an automatic speech recognition system (ASR). The purpose of correctly finding the sentence boundaries in ASR transcriptions is to make it possible to use further processing tasks, such as automatic summarization, machine translation, and information extraction. Being a classification task, sentence segmentation requires training data. To reduce the labor-intensive labeling task, available labeled data can be used to train the classifier. The high variability of speech among the various speech styles makes it inefficient to use the classifier from one speech style (designated as out-of-domain) to detect sentence boundaries on another speech style (in-domain) and thus, makes it necessary for one classifier to be adapted before it is used on another speech style. In this work, we first justify the need for adapting data among the broadcast news, conversational telephone and meeting speech styles. We then propose methods to adapt sentence segmentation models trained on conversational telephone speech to meeting conversations style. Our results show that using the model adapted from the telephone conversations, instead of the model trained only on meetings conversation style, significantly improves the performance of the sentence segmentation. Moreover, this improvement holds independently from the amount of in-domain data used. In addition, we also study the differences between speech styles, with statistical measures and by examining the performances of various subsets of features. Focusing on broadcast news and meeting speech style, we show that on the meeting speech style, lexical features are more correlated with the sentence boundaries than the prosodic features, whereas it is the contrary on the broadcast news. Furthermore, we observe that prosodic features are more independent from the speech style than lexical features.

Keywords: Report_VI, IM2.AP
[78] V. Ullal and P. Motlicek. Audio coding based on long temporal segments: experiments with quantization of excitation signal. IDIAP-RR 46, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
In this paper, we describe additional experiments based on a novel audio coding technique that uses an autoregressive model to approximate an audio signal's Hilbert envelope. This technique is performed over long segments (1000 ms) in critical-band-sized sub-bands. We have performed a series of experiments to find more efficient methods of quantizing the frequency components of the Hilbert carrier, which is the excitation found in the temporal audio signal. When using linear quantization, it was found that allocating 5 bits for transmitting the Hilbert carrier every 200 ms was sufficient. Other techniques, such as quantizing the first derivative of phase and using an iterative adaptive threshold, were examined.

Keywords: Report_VI, IM2.AP
[79] N. Poh. Multi-system biometric authentication: optimal fusion and user-specific information. PhD thesis, École Polytechnique Fédérale de Lausanne, 2006. [ bib | .ps.gz | .pdf ]
Verifying a person's identity claim by combining multiple biometric systems (fusion) is a promising solution to identity theft and automatic access control. This thesis contributes to the state-of-the-art of multimodal biometric fusion by improving the understanding of fusion and by enhancing fusion performance using information specific to a user. One problem to deal with at the score level fusion is to combine system outputs of different types. Two statistically sound representations of scores are probability and log-likelihood ratio (LLR). While they are equivalent in theory, LLR is much more useful in practice because its distribution can be approximated by a Gaussian distribution, which makes it useful to analyze the problem of fusion. Furthermore, its score statistics (mean and covariance) conditioned on the claimed user identity can be better exploited. Our first contribution is to estimate the fusion performance given the class-conditional score statistics and given a particular fusion operator/classifier. Thanks to the score statistics, we can predict fusion performance with reasonable accuracy, identify conditions which favor a particular fusion operator, study the joint phenomenon of combining system outputs with different degrees of strength and correlation and possibly correct the adverse effect of bias (due to the score-level mismatch between training and test sets) on fusion. While in practice the class-conditional Gaussian assumption is not always true, the estimated performance is found to be acceptable. Our second contribution is to exploit the user-specific prior knowledge by limiting the class-conditional Gaussian assumption to each user. We exploit this hypothesis in two strategies. In the first strategy, we combine a user-specific fusion classifier with a user-independent fusion classifier by means of two LLR scores, which are then weighted to obtain a single output. We show that combining both user-specific and user-independent LLR outputs always results in improved performance than using the better of the two. In the second strategy, we propose a statistic called the user-specific F-ratio, which measures the discriminative power of a given user based on the Gaussian assumption. Although similar class separability measures exist, e.g., the Fisher-ratio for a two-class problem and the d-prime statistic, F-ratio is more suitable because it is related to Equal Error Rate in a closed form. F-ratio is used in the following applications: a user-specific score normalization procedure, a user-specific criterion to rank users and a user-specific fusion operator that selectively considers a subset of systems for fusion. The resultant fusion operator leads to a statistically significantly increased performance with respect to the state-of-the-art fusion approaches. Even though the applications are different, the proposed methods share the following common advantages. Firstly, they are robust to deviation from the Gaussian assumption. Secondly, they are robust to few training data samples thanks to Bayesian adaptation. Finally, they consider both the client and impostor information simultaneously.

Keywords: Report_VI, IM2.MPR, multiple classifier system, pattern recognition, user-specific processing
[80] F. Mendels, J. Ph. Thiran, and P. Vandergheynst. Matching pursuit-based shape representation and recognition using scale-space. International Journal of Imaging Systems and Technology, 6(15):162–180, 2006. [ bib | DOI | http ]
Keywords: Report_VI, matching pursuit ; scale-space shape representation ;shape recognition ; sparse representation; LTS2; LTS5; lts2; lts5, IM2.VP
[81] A. Janin, A. Stolcke, X. Anguera, K. Boakye, O. Cetin, J. Frankel, and J. Zheng. The icsi-sri spring 2006 meeting evaluation system. In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interaction: Third International Workshop (MLMI 2006); Lecture Notes in Computer Science. Springer, 2006. [ bib ]
Keywords: Report_VI, IM2.AP
[82] A. Buttfield and J. del R. Millán. Online classifier adaptation in brain-computer interfaces. IDIAP-RR 16, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
Brain-computer interfaces (BCIs) aim to provide a new channel of communication by enabling the subject to control an external systems by using purely mental commands. One method of doing this without invasive surgical procedures is by measuring the electrical activity of the brain on the scalp through electroencephalography (EEG). A major obstacle to developing complex EEG-based BCI systems that provide a number of intuitive mental commands is the high variability of EEG signals. EEG signals from the same subject vary considerably within a single session and between sessions on the same or different days. To deal with this we are investigating methods of adapting the classifier while it is being used by the subject. By keeping the classifier constantly tuned to the EEG signals of the current session we hope to improve the performance of the classifier and allow the subject to learn to use the BCI more effectively. This paper discusses preliminary offline and online experiments towards this goal, focusing on the initial training period when the task that the subject is trying to achieve is known and thus supervised adaptation methods can be used. In these experiments the subjects were asked to perform three mental commands (imagination of left and right hand movements, and a language task) and the EEG signals were classified with a Gaussian classifier.

Keywords: Report_VI, IM2.BMI
[83] O. Koval, S. Voloshynovskiy, T. Holotyak, and T. Pun. Information-theoretic analysis of steganalysis in real images. In ACM Multimedia and Security Workshop 2006, Geneva, Switzerland, 2006. [ bib | .ps ]
Keywords: Report_VI, IM2.MPR
[84] H. Chiquet, F. Evéquoz, and D. Lalanne. Elcano, a tangible multimedia browser (demo). In Symposium on User Interface Software and Technology (UIST 2006), pages 51–52, Montreux (Switzerland), 2006. [ bib ]
Keywords: Report_VI, IM2.HMI
[85] J. Luo, A. Pronobis, B. Caputo, and P. Jensfelt. Incremental learning for place recognition in dynamic environments. IDIAP-RR 52, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
Vision–based place recognition is a desirable feature for an autonomous mobile system. In order to work in realistic scenarios, a visual recognition algorithm should have two key properties: robustness and adaptability. This paper focuses on the latter, and presents a discriminative incremental learning approach to place recognition. We use a recently introduced version of the fixed–partition incremental SVM, which allows to control the memory requirements as the system updates its internal representation. At the same time, it preserves the recognition performance of the batch algorithm and runs online. In order to assess the method, we acquired a database capturing the intrinsic variability of places over time. Extensive experiments show the power and the potential of the approach.

Keywords: Report_VI, IM2.MPR.HMI, joint publication
[86] R. Rienks, D. Zhang, D. Gatica-Perez, and W. Post. Detection and application of influence rankings in small group meetings. In ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces, pages 257–264, New York, NY, USA, 2006. ACM Press. [ bib | DOI ]
Keywords: Report_VI, IM2.MPR
[87] G. Tur, U. Guz, and D. Hakkani-Tur. Model adaptation for dialog act tagging. Proc. IEEE/ACL Workshop on Spoken Language Technology, 2006. [ bib ]
Keywords: Report_VI, IM2.AP
[88] M. Radgohar, F. Evéquoz, and D. Lalanne. Phong, augmenting virtual and real gaming experience (demo). In Symposium on User Interface Software and Technology (UIST 2006), pages 71–72, Montreux (Switzerland), 2006. [ bib ]
Keywords: Report_VI, IM2.HMI
[89] C. Dimitrakakis. Ensembles for sequence learning. PhD thesis, École Polytechnique Fédérale de Lausanne, 2006. [ bib | .ps.gz | .pdf ]
This thesis explores the application of ensemble methods to sequential learning tasks. The focus is on the development and the critical examination of new methods or novel applications of existing methods, with emphasis on supervised and reinforcement learning problems. In both types of problems, even after having observed a certain amount of data, we are often faced with uncertainty as to which hypothesis is correct among all the possible ones. However, in many methods for both supervised and for reinforcement learning problems this uncertainty is ignored, in the sense that there is a single solution selected out of the whole of the hypothesis space. Apart from the classical solution of analytical Bayesian formulations, ensemble methods offer an alternative approach to representing this uncertainty. This is done simply through maintaining a set of alternative hypotheses. The sequential supervised problem considered is that of automatic speech recognition using hidden Markov models. The application of ensemble methods to the problem represents a challenge in itself, since most such methods can not be readily adapted to sequential learning tasks. This thesis proposes a number of different approaches for applying ensemble methods to speech recognition and develops methods for effective training of phonetic mixtures with or without access to phonetic alignment data. Furthermore, the notion of expected loss is introduced for integrating probabilistic models with the boosting approach. In some cases substantial improvements over the baseline system are obtained. In reinforcement learning problems the goal is to act in such a way as to maximise future reward in a given environment. In such problems uncertainty becomes important since neither the environment nor the distribution of rewards that result from each action are known. This thesis presents novel algorithms for acting nearly optimally under uncertainty based on theoretical considerations. Some ensemble-based representations of uncertainty (including a fully Bayesian model) are developed and tested on a few simple tasks resulting in performance comparable with the state of the art. The thesis also draws some parallels between a proposed representation of uncertainty based on gradient-estimates and on “prioritised sweeping” and between the application of reinforcement learning to controlling an ensemble of classifiers and classical supervised ensemble learning methods.

Keywords: Report_VI, IM2.MPR, Ensembles, boosting, bagging, mixture of experts, speech recognition, reinforcement learning, exploration-exploitation, uncertainty, sequence learning, sequential decision making
[90] A. Buttfield, P. W. Ferrez, and J. del R. Millán. Towards a robust bci: error potentials and online learning. IEEE Trans. on Neural Systems and Rehabilitation Engineering, 14(2):164–168, 2006. [ bib | .pdf ]
Recent advances in the field of Brain-Computer Interfaces (BCIs) have shown that BCIs have the potential to provide a powerful new channel of communication, completely independent of muscular and nervous systems. However, while there have been successful laboratory demonstrations, there are still issues that need to be addressed before BCIs can be used by non-experts outside the laboratory. At IDIAP we have been investigating several areas that we believe will allow us to improve the robustness, flexibility and reliability of BCIs. One area is recognition of cognitive error states, that is, identifying errors through the brain's reaction to mistakes. The production of these error potentials (ErrP) in reaction to an error made by the user is well established. We have extended this work by identifying a similar but distinct ErrP that is generated in response to an error made by the interface, (a misinterpretation of a command that the user has given). This ErrP can be satisfactorily identified in single trials and can be demonstrated to improve the theoretical performance of a BCI. A second area of research is online adaptation of the classifier. BCI signals change over time, both between sessions and within a single session, due to a number of factors. This means that a classifier trained on data from a previous session will probably not be optimal for a new session. In this paper we present preliminary results from our investigations into supervised online learning that can be applied in the initial training phase. We also discuss the future direction of this research, including the combination of these two currently separate issues to create a potentially very powerful BCI.

Keywords: Report_VI, IM2.BMI
[91] T. I. Alecu, S. Voloshynovskiy, and T. Pun. The gaussian transform of distributions: definition, computation and application. IEEE Trans. on Signal Processing, 54(8):2976–2995, 2006. [ bib ]
Keywords: Report_VI, IM2.MPR
[92] D. Moore. The juicer lvcsr decoder - user manual for juicer version 0.5.0. IDIAP-COM 03, IDIAP, 2006. [ bib | .ps.gz | .pdf ]
Juicer is a decoder for HMM-based large vocabulary speech recognition that uses a weighted finite state transducer (WFST) representation of the search space. The package consists of a number of command line utilities: the Juicer decoder itself, along with a number of tools and scripts that are used to combine the various ASR knowledge sources (language model, pronunciation dictionary, acoustic models) into a single, optimised WFST that is input to the decoder.

Keywords: Report_VI, IM2.AP
[93] A. Pozdnoukhov. Prior knowledge in kernel methods. PhD thesis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2006. PhD Thesis #3606 at the École Polytechnique Fédérale de Lausanne (IDIAP-RR 06-66). [ bib | .ps.gz | .pdf ]
Machine Learning is a modern and actively developing field of computer science, devoted to extracting and estimating dependencies from empirical data. It combines such fields as statistics, optimization theory and artificial intelligence. In practical tasks, the general aim of Machine Learning is to construct algorithms able to generalize and predict in previously unseen situations based on some set of examples. Given some finite information, Machine Learning provides ways to exract knowledge, describe, explain and predict from data. Kernel Methods are one of the most successful branches of Machine Learning. They allow applying linear algorithms with well-founded properties such as generalization ability, to non-linear real-life problems. Support Vector Machine is a well-known example of a kernel method, which has found a wide range of applications in data analysis nowadays. In many practical applications, some additional prior knowledge is often available. This can be the knowledge about the data domain, invariant transformations, inner geometrical structures in data, some properties of the underlying process, etc. If used smartly, this information can provide significant improvement to any data processing algorithm. Thus, it is important to develop methods for incorporating prior knowledge into data-dependent models. The main objective of this thesis is to investigate approaches towards learning with kernel methods using prior knowledge. Invariant learning with kernel methods is considered in more details. In the first part of the thesis, kernels are developed which incorporate prior knowledge on invariant transformations. They apply when the desired transformation produce an object around every example, assuming that all points in the given object share the same class. Different types of objects, including hard geometrical objects and distributions are considered. These kernels were then applied for images classification with Support Vector Machines. Next, algorithms which specifically include prior knowledge are considered. An algorithm which linearly classifies distributions by their domain was developed. It is constructed such that it allows to apply kernels to solve non-linear tasks. Thus, it combines the discriminative power of support vector machines and the well-developed framework of generative models. It can be applied to a number of real-life tasks which include data represented as distributions. In the last part of the thesis, the use of unlabelled data as a source of prior knowledge is considered. The technique of modelling the unlabelled data with a graph is taken as a baseline from semi-supervised manifold learning. For classification problems, we use this apporach for building graph models of invariant manifolds. For regression problems, we use unlabelled data to take into account the inner geometry of the input space. To conclude, in this thesis we developed a number of approaches for incorporating some prior knowledge into kernel methods. We proposed invariant kernels for existing algorithms, developed new algorithms and adapted a technique taken from semi-supervised learning for invariant learning. In all these cases, links with related state-of-the-art approaches were investigated. Several illustrative experiments were carried out on real data on optical character recognition, face image classification, brain-computer interfaces, and a number of benchmark and synthetic datasets.

Keywords: Report_VI, IM2.MPR
[94] G. Heusch and S. Marcel. A novel statistical generative model dedicated to face recognition. Idiap-RR Idiap-RR-39-2007, IDIAP, 2007. [ bib ]
In this paper, a novel statistical generative model to describe a face is presented, and is applied on the face authentication task. Classical generative models used so far in face recognition, such as Gaussian Mixture Models (GMM) and Hidden Markov Models (HMM) for instance, are making strong assumptions on the observations derived from a face image. Indeed, such models usually assume that local observations are independent, which is obviously not the case in a face. The presented model hence proposes to encode relationships between salient facial features by using a static Bayesian Network. Since robustness against imprecisely located faces is of great concern in a real-world scenario, authentication results are presented using automatically localised faces. Experiments conducted on the XM2VTS and the BANCA databases showed that the proposed approach is suitable for this task, since it reaches state-of-the-art results. We compare our model to baseline appearance-based systems (Eigenfaces and Fisherfaces) but also to classical generative models, namely GMM, HMM and pseudo-2DHMM.

Keywords: IM2.VP, Report_VII
[95] P. Motlicek, S. Ganapathy, H. Hermansky, and H. Garudadri. Scalable wide-band audio codec based on frequency domain linear prediction. IDIAP-RR 16, IDIAP, 2007. [ bib | .ps.gz | .pdf ]
This paper proposes a technique for wide-band audio applications based on the predictability of the temporal evolution of Quadrature Mirror Filter (QMF) sub-band signals. An input audio signal is first decomposed into 64 frequency sub-band signals using QMF decomposition. The temporal envelopes in critically sampled QMF sub-bands are approximated using frequency domain linear prediction applied over relatively long time segments (e.g. 1000 ms). Line Spectral Frequency parameters related to autoregressive models are computed and quantized in each frequency sub-band. The sub-band residual signals are quantized in the frequency domain using a split Vector Quantization (VQ) technique. In the decoder, the sub-band signal is reconstructed using the quantized residual and the corresponding quantized envelope. Finally, application of inverse QMF reconstructs the audio signal. Even with simple quantization techniques and without any psychoacoustic model, the proposed audio coder provides encouraging results on objective quality tests.

Keywords: Report_VI, IM2.AP
[96] H. Hung, D. Jayagopi, C. Yeo, G. Friedland, S. Ba, J. M. Odobez, K. Ramchandran, N. Mirghafori, and D. Gatica-Perez. Using audio and video features to classify the most dominant person in meetings. Proceedings of ACM Multimedia 2007, pp. 835-838, Augsburg, Germany, 2007. [ bib ]
Keywords: Report_VII, IM2.AP.VP, joint publication
[97] F. Orabona, C. Castellini, B. Caputo, J. Luo, and G. Sandini. Indoor place recognition using online independent support vector machines. In 18th British Machine Vision Conference (BMVC07), pages 1090–1099, 2007. [ bib ]
In the framework of indoor mobile robotics, place recognition is a challenging task, where it is crucial that self-localization be enforced precisely, notwithstanding the changing conditions of illumination, objects being shifted around and/or people affecting the appearance of the scene. In this scenario online learning seems the main way out, thanks to the possibility of adapting to changes in a smart and flexible way. Nevertheless, standard machine learning approaches usually suffer when confronted with massive amounts of data and when asked to work online. Online learning requires a high training and testing speed, all the more in place recognition, where a continuous flow of data comes from one or more cameras. In this paper we follow the Support Vector Machines-based approach of Pronobis et al., proposing an improvement that we call Online Independent Support Vector Machines. This technique exploits linear independence in the image feature space to incrementally keep the size of the learning machine remarkably small while retaining the accuracy of a standard machine. Since the training and testing time crucially depend on the size of the machine, this solves the above stated problems. Our experimental results prove the effectiveness of the approach.

Keywords: IM2.VP, Report_VII
[98] P. Bouillon, M. Rayner, B. Novellas Vall, M. Starlander, M. Santaholma, Y. Nakao, and N. Chatzichrisafis. Une grammaire partagée multi-tâche pour le traitement de la parole : application aux langues romanes. TAL (Traitement Automatique des Langues), 47(3), 2007. [ bib ]
Keywords: Report_VI, IM2.HMI
[99] H. Paugam-Moisy, R. Martinez, and S. Bengio. A supervised learning approach based on stdp and polychronization in spiking neuron networks. In European Symposium on Artificial Neural Networks, ESANN, 2007. IDIAP-RR 06-54. [ bib | .ps.gz | .pdf ]
We propose a network model of spiking neurons, without preimposed topology and driven by STDP (Spike-Time-Dependent Plasticity), a temporal Hebbian unsupervised learning mode, biologically observed. The model is further driven by a supervised learning algorithm, based on a margin criterion, that has effect on the synaptic delays linking the network to the output neurons, with classification as a goal task. The network processing and the resulting performance are completely explainable by the concept of polychronization, proposed by Izhikevich citeIzh06NComp. The model emphasizes the computational capabilities of this concept.

Keywords: Report_VI, IM2.MPR
[100] A. Vinciarelli and S. Favre. Role recognition in radio programs using social affiliation networks and mixtures of discrete distributions: an approach inspired by social cognition. Idiap-RR Idiap-RR-40-2007, IDIAP, 2007. Submitted for publication. [ bib ]
This paper presents an approach for the recognition of the roles played by speakers participating in radio programs. The approach is inspired by social cognition, i.e. by the way humans make sense of people they do not know, and it includes unsupervised speaker clustering performed with Hidden Markov Models, Social Network Analysis and Mixtures of Bernoulli and Multinomial Distributions. The experiments are performed over two corpora of radio programs for a total of around 45 hours of material. The results show that more than 80 percent of the data time can be labeled correctly in terms of role.

Keywords: IM2.MCA, Report_VII
[101] S. Marcel. Joint bi-modal face and speaker authentication using explicit polynomial expansion. IDIAP-RR 14, IDIAP, 2007. Submitted for publication. [ bib | .ps.gz | .pdf ]
Keywords: Report_VI, IM2.MPR
[102] E. Szekely, E. Bruno, and S. Marchand-Maillet. Clustered multidimensional scaling for exploration in information retrieval. In International Conference on the Theory of Information Retrieval, Bucarest, HU, 2007. submitted. [ bib ]
Keywords: Report_VI, IM2.MCA
[103] H. Bunke and T. Varga. Off-line roman cursive handwriting recognition. Digital Document Processing: Major Directions and Recent Advances, 20:165–173, 2007. [ bib ]
Keywords: Report_VI, IM2.ACP
[104] M. Levit, D. Hakkani-Tur, G. Tur, and D. Gillick. Integrating several annotation layers for statistical information distillation. IEEE workshop on Automatic Speech Recognition and Understanding (ASRU 07), Kyoto, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[105] J. Kronegg, G. Chanel, S. Voloshynovskiy, and T. Pun. Eeg-based synchronized brain-computer interfaces: a model for optimizing the number of mental tasks. IEEE Trans. on Neural Systems and Rehabilitation Engineering, 15(1):50–58, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[106] E. Shriberg. Higher level features in speaker recognition. In C. Muller, editor, Speaker Classification I. Lecture Notes in Computer Science, Springer, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[107] A. Lisowska, M. Betrancourt, S. Armstrong, and M. Rajman. Minimizing modality bias when exploring input preference for multimodal systems in new domains: the archivus case study. In CHI' 07, 2007. [ bib ]
Keywords: Report_VI, IM2.HMI, joint publication, major, April 28 - May 3
[108] M. Huijbregts, C. Wooters, and R. Ordelman. Filtering the unknown: Speech activity detection in heterogeneous video collections. to appear in Proceedings of Interspeech, Antwerp, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[109] J. Dines and M. Magimai-Doss. A study of phoneme and grapheme based context-dependent asr systems. IDIAP-RR 12, IDIAP, 2007. [ bib | .ps.gz | .pdf ]
In this paper we present a study of automatic speech recognition systems using context-dependent phonemes and graphemes as sub-word units based on the conventional HMM/GMM system as well as tandem system. Experimental studies conducted on three different continuous speech recognition tasks show that systems using only context-dependent graphemes can yield competitive performance on small to medium vocabulary tasks when compared to a context-dependent phoneme-based automatic speech recognition system. In particular, we demonstrate the utility of tandem features that use an MLP trained to estimate phoneme posterior probabilities in improving grapheme based recognition system performance by incorporating phonemic knowledge into the system without having to explicitly define a phonetically transcribed lexicon.

Keywords: Report_VI, IM2.AP, major
[110] J. Dines and J. Vepa. Direct optimisation of a multilayer perceptron for the estimation of cepstral mean and variance statistics. IDIAP-RR 13, IDIAP, 2007. [ bib | .ps.gz | .pdf ]
We propose an alternative means of training a multilayer perceptron for the task of speech activity detection based on a criterion to minimise the error in the estimation of mean and variance statistics for speech cepstrum based features using the Kullback-Leibler divergence. We present our baseline and proposed speech activity detection approaches for multi-channel meeting room recordings and demonstrate the effectiveness of the new criterion by comparing the two approaches when used to carry out cepstrum mean and variance normalisation of features used in our meeting ASR system.

Keywords: Report_VI, IM2.AP
[111] C. Gaudard, G. Aradilla, and H. Bourlard. Speech recognition based on template matching and phone posterior probabilities. IDIAP-COM 02, IDIAP, 2007. [ bib | .ps.gz | .pdf ]
Keywords: Report_VI, IM2.AP
[112] F. Valente, J. Vepa, and H. Hermansky. Multi-stream features combination based on dempster-shafer rule for lvcsr system. In Interspeech 2007, 2007. IDIAP-RR 07-09. [ bib | .ps.gz | .pdf ]
This paper investigates the combination of two streams of acoustic features. Extending our previous work on small vocabulary task, we show that combination based on Dempster-Shafer rule outperforms several classical rules like sum, product and inverse entropy weighting even in LVCSR systems. We analyze results in terms of Frame Error Rate and Cross Entropy measures. Experimental framework uses meeting transcription task and results are provided on RT05 evaluation data. Results are consistent with what has been previously observed on smaller databases.

Keywords: Report_VI, IM2.AP.MPR, joint publication
[113] P. W. Ferrez and J. del R. Millán. Error-related eeg potentials in brain-computer interfaces. In G. Dornhege, J. del R. Millán, T. Hinterberger, D. McFarland, and K. R. Müller, editors, Towards Brain-Computer Interfacing. The MIT Press, 2007. [ bib ]
Brain-computer interfaces (BCI), as any other interaction modality based on physiological signals and body channels (e.g., muscular activity, speech and gestures), are prone to errors in the recognition of subject's intent. An elegant approach to improve the accuracy of BCIs consists in a verification procedure directly based on the presence of error-related potentials (ErrP) in the EEG recorded right after the occurrence of an error. Most of these studies show the presence of ErrP in typical choice reaction tasks where subjects respond to a stimulus and ErrP arise following errors due to the subject's incorrect motor action. However, in the context of a BCI, the central question is: "Are ErrP also elicited when the error is made by the interface during the recognition of the subject's intent?" We have thus explored whether ErrP also follow a feedback indicating incorrect responses of the interface and no longer errors of the subject himself. Four healthy volunteer subjects participated in a simple human-robot interaction experiment (i.e., bringing the robot to either the left or right side of a room), which seem to reveal a new kind of ErrP. These "interaction ErrP" exhibit a first sharp negative peak followed by a broader positive peak and a second negative peak ( 270, 400 and 550 ms after the feedback, respectively). But in order to exploit these ErrP we need to detect it in each single trial using a short window following the feedback that shows the response of the classifier embedded in the BCI. We have achieved an average recognition rate of correct and erroneous single trials of 83.7% and 80.2%, respectively. We also show that the integration of these ErrP in a BCI, where the subject's intent is not executed if an ErrP is detected, significantly improves the performance of the BCI.

Keywords: IM2.BCI, Report_VII
[114] H. Bunke, P. Dickinson, A. Humm, C. Irniger, and M. Kraetzl. Graph sequence visualisation and its application to computer network monitoring and abnormal event detection. In A. Kandel, H. Bunke, and M. Last, editors, Applied Graph Theory in Computer Vision and Pattern Recognition, pages 227–245. Springer, 2007. [ bib ]
Keywords: Report_VI, IM2.ACP, joint publication
[115] H. Hung, D. Jayagopi, C. Yeo, G. Friedland, S. Ba, J. M. Odobez, K. Ramchandran, N. Mirghafori, and D. Gatica-Perez. Using audio and video features to classify the most dominant person in a group meeting. 2007. IDIAP-RR 07-29. [ bib ]
The automated extraction of semantically meaningful information from multi-modal data is becoming increasingly necessary due to the escalation of captured data for archival. A novel area of multi-modal data labelling, which has received relatively little attention, is the automatic estimation of the most dominant person in a group meeting. In this paper, we provide a framework for detecting dominance in group meetings using different audio and video cues. We show that by using a simple model for dominance estimation we can obtain promising results.

Keywords: Report_VI, IM2.MPR, joint publication
[116] A. Lisowska, S. Armstrong, M. Melichar, M. Ailomaa, and M. Rajman. The wizard of oz meets multimodal language-enabled gui interfaces: new challenges. In Proceedings of CHI' 07, Beyond Current User Research: Designing Methods for New Users, T, 2007. [ bib ]
Keywords: Report_VI, IM2.HMI, joint publication, April 28 - May 3
[117] W. Li and H. Bourlard. Non-linear spectral stretching for in-car speech recognition. In Interspeech, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[118] S. Renals, T. Hain, and H. Bourlard. Recognition and understanding of meetings the ami and amida projects. In Proc. of the IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU'07, pages 238–247, 2007. IDIAP-RR 07-46. [ bib | DOI ]
The AMI and AMIDA projects are concerned with the recognition and interpretation of multiparty meetings. Within these projects we have: developed an infrastructure for recording meetings using multiple microphones and cameras; released a 100 hour annotated corpus of meetings; developed techniques for the recognition and interpretation of meetings based primarily on speech recognition and computer vision; and developed an evaluation framework at both component and system levels. In this paper we present an overview of these projects, with an emphasis on speech recognition and content extraction.

Keywords: IM2.MCA, Report_VII
[119] J. Kittler, N. Poh, O. Fatukasi, K. Messer, K. Kryszczuk, J. Richiardi, and A. Drygajlo. Quality dependent fusion of intramodal and multimodal biometric experts. In Proc. SPIE Defense and Security Symposium, Orlando, USA, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[120] D. Morrison, S. Marchand-Maillet, and E. Bruno. Hierarchical long-term learning for automatic image annotation. In Proceedings 2nd International Conference on Semantic and Digital Media Technologies, Genova, Italy, 2007. [ bib ]
Keywords: Report_VII, IM2.MCA
[121] F. Lüthy, T. Varga, and H. Bunke. Using hidden markov models as a tool for handwritten text line segmentation. In Proc. 9th Int. Conf. on Document Analysis and Recognition, pages 8–12, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[122] R. Chavarriaga, P. W. Ferrez, and J. del R. Millán. To err is human: learning from error potentials in brain-computer interfaces. In 1st International Conference on Cognitive Neurodynamics (ICCN 2007), 2007. IDIAP-RR 07-37. [ bib | .ps.gz | .pdf ]
Several studies describe evoked EEG potentials elicited when a subject is aware of an erroneous decision either taken by him or by an external interface. This paper studies em Error-related potentials (ErrP) elicited when a human user monitors an external system upon which he has no control whatsoever. In addition, the possibility of using the ErrPs as a learning signals to infer the user's intended strategy is also addressed. Experimental results show that single-trial recognition of correct and error trials can be achieved, allowing the fast learning of the user's strategy. These results may constitute the basis of a new kind of human-computer interaction where the former provides monitoring signals that can be used to modify the performance of the latter.This work has been supported by the Swiss National Science Foundation NCCR-IM2 and by the EC-contract number BACS FP6-IST-027140. This paper only reflects the authors' views and funding agencies are not liable for any use that may be made of the information contained herein.

Keywords: Report_VI, IM2.BMI
[123] A. Drygajlo. Multimodal biometrics for identity documents and smart cards european challenge. In Proc. 15th European Signal Processing Conf. (EUSIPCO), Poznan, Poland, 2007. (invited paper). [ bib ]
Keywords: Report_VI, IM2.MPR
[124] S. Marcel and J. del R. Millán. Person authentication using brainwaves (eeg) and maximum a posteriori model adaptation. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Special Issue on Biometrics, 2007. IDIAP-RR 05-81. [ bib | .ps.gz | .pdf ]
In this paper, we investigate the use of brain activity for person authentication. It has been shown in previous studies that the brain-wave pattern of every individual is unique and that the electroencephalogram (EEG) can be used for biometric identification. EEG-based biometry is an emerging research topic and we believe that it may open new research directions and applications in the future. However, very little work has been done in this area and was focusing mainly on person identification but not on person authentication. Person authentication aims to accept or to reject a person claiming an identity, i.e comparing a biometric data to one template, while the goal of person identification is to match the biometric data against all the records in a database. We propose the use of a statistical framework based on Gaussian Mixture Models and Maximum A Posteriori model adaptation, successfully applied to speaker and face authentication, which can deal with only one training session. We perform intensive experimental simulations using several strict train/test protocols to show the potential of our method. We also show that there are some mental tasks that are more appropriate for person authentication than others.

Keywords: Report_VI, IM2.BMI
[125] E. Kokiopoulou and P. Frossard. Image alignment with rotation manifolds built on sparse geometric expansions. In IEEE International Workshop on Multimedia Signal Processing, 2007. [ bib | http ]
Keywords: Report_VI, IM2.VP
[126] M. Sorci, G. Antonini, and J. Ph. Thiran. Fisher's discriminant and relevant component analysis for static facial expression classification. In 15th European Signal Processing Conference (EUSIPCO), Poznan, Poland, Poznan, Poland, 2007. ITS. [ bib | http ]
Keywords: Report_VI, LTS5; Facial expression recognition; Dimensionality reduction; IM2.VP
[127] U. Guz, S. Cuendet, D. Hakkani-Tur, and G. Tur. Co-training using prosodic and lexical information for sentence segmentation. to appear in Proceedings of Interspeech, Antwerp, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[128] B. Noris, K. Benmachiche, J. Meynet, J. Ph. Thiran, and A. Billard. Analysis of head mounted wireless camera videos for early diagnosis of autism. In International Conference on Recognition Systems, 2007. [ bib | http ]
Keywords: Report_VI, LTS5, IM2.VP
[129] J. Richiardi, K. Kryszczuk, and A. Drygajlo. Quality measures in unimodal and multimodal biometric verification. In Proc. 15th European Signal Processing Conf. (EUSIPCO), Poznan, Poland, 2007. (invited paper). [ bib ]
Keywords: Report_VI, IM2.MPR
[130] M. Liwicki, E. Indermühle, and H. Bunke. On-line handwritten text line detection using dynamic programming. In Proc. 9th Int. Conf. on Document Analysis and Recognition, pages 447–451, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[131] A. Popescu-Belis and S. Zufferey. Contrasting the automatic identification of two discourse markers in multiparty dialogues. In Proceedings of SIGDIAL 2007, 8th SIGdial Workshop on Discourse and Dialogue, page 10, 2007. [ bib ]
Keywords: Report_VI, IM2.MCA
[132] S. Cuendet, D. Hakkani-Tur, E. Shriberg, J. Fung, and B. Favre. Cross-genre feature comparisons for spoken sentence segmentation. International Conference on Semantic Computing (ICSC), Irvine, CA, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[133] X. Perrin, R. Chavarriaga, R. Siegwart, and J. del R. Millán. Bayesian controller for a novel semi-autonomous navigation concept. In 3rd European Conference on Mobile Robots (ECMR 2007), 2007. IDIAP-RR 07-26. [ bib | .ps.gz | .pdf ]
This paper presents a novel concept of semi-autonomous navigation where a mobile robot evolves autonomously under the monitoring of a human user. The user provides corrective commands to the robot whenever he disagrees with the robot's navigational choices. These commands are not related to navigational values like directions or goals, but to the relevance of the robot's actions to the overall task. A binary error signal is used to correct the robot's decisions and to bring it to the desired goal location. This simple interface could easily be adapted to input systems designed for disabled people, offering them a convenient alternative to existing assistive systems. After a description of the whole concept, a special focus is given to the decisional process, which takes into account in a Bayesian way the environment perceived by the robot and the user generated signals in order to propose a navigational strategy to the human user. The strength and advantages of the proposed semi-autonomous concept are illustrated with two experiments. em Keywords: Semi-autonomous navigation, error signal, probabilistic reasoning, human-machine interaction.

Keywords: Report_VI, IM2.BMI, major
[134] K. Livescu, O. Cetin, M. Hasegawa-Johnson, S. King, C. Bartels, N. Borges, A. Kantor, P. Lal, L. Yung, A. Bezman, S. Dawson-Haggerty, B. Woods, J. Frankel, M. Magimai-Doss, and K. Saenko. Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 jhu summer workshop. Proc. ICASSP, Honolulu, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[135] W. Li, J. Dines, and M. Magimai-Doss. Robust overlapping speech recognition based on neural networks. Idiap-RR Idiap-RR-55-2007, IDIAP, 2007. [ bib ]
We address issues for improving hands-free speech recognition performance in the presence of multiple simultaneous speakers using multiple distant microphones. In this paper, a log spectral mapping is proposed to estimate the log mel-filterbank outputs of clean speech from multiple noisy speech using neural networks. Both the mapping of the far-field speech and combination of the enhanced speech and the estimated interfering speech are investigated. Our neural network based feature enhancement method incorporates the noise information and can be viewed as a non-linear log spectral subtraction. Experimental studies on MONC corpus showed that MLP-based mapping techniques yields a improvement in the recognition accuracy for the overlapping speech.

Keywords: IM2.AP, Report_VII
[136] M. Liwicki and H. Bunke. Feature selection for on-line handwriting recognition of whiteboard notes. In Proc. 13th Conf. of the Graphonomics Society, pages 101–105, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[137] R. Bertolami and H. Bunke. Multiple classifier methods for offline handwritten text line recognition. In M. Haindl, J. Kittler, and F. Roli, editors, Multiple Classifier Systems, volume 4472 of Lecture Notes in Computer Science, pages 72–81. Springer, 2007. [ bib ]
Keywords: Report_VI, IM2.VP
[138] A. Humm, J. Hennebert, and R. Ingold. Hidden markov models for spoken signature verification. 2007. [ bib ]
Keywords: Report_VII, IM2.HMI
[139] M. Huijbregts and C. Wooters. The blame game: Performance analysis of speaker diarization system components. to appear in Proc. Interspeech, Antwerp., 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[140] F. Cincotti, L. Kauhanen, and F. Aloise. Vibrotactile feedback for brain-computer interface operation. Computational Intelligence and Neuroscience, 2007:Article ID, 2007. doi:10.1155/2007/48937. [ bib ]
Keywords: Report_VI, IM2.BMI
[141] P. Bouillon, G. Flores, M. Starlander, N. Chatzichrisafis, M. Santaholma, N. Tsourakis, M. Rayner, and B. A. Hockey. A bidirectional grammar-based medical speech translator. In Proceedings of workshop on Grammar-based approaches to spoken language processing, pages 41–48. ACL 2007, 2007. [ bib ]
Keywords: Report_VI, IM2.HMI, ACL 2007, June 29
[142] K. Ansari-Asl, G. Chanel, and T. Pun. A channel selection method for eeg classification in emotion assessment based on synchronization likelihoo. In Eusipco 2007, 15th Eur. Signal Proc. Conf., Poznan, Poland, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[143] J. Kolar, Y. Liu, and E. Shriberg. Speaker adaptation of language models for automatic dialog act segmentation of meetings. to appear in Proceedings of Interspeech, Antwerp., 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[144] F. Einsele, J. Hennebert, and R. Ingold. Towards identification of very low resolution, anti-aliased characters. In IEEE International Symposium on Signal Processing and its Applications (ISSPA'07), Sharjah, United Arab Emirates, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[145] M. Neuhaus and H. Bunke. A quadratic programming approach to the graph edit distance problem. In F. Escolano and M. Vento, editors, Graph-Based Representations in Pattern Recognition, volume 4538 of Lecture Notes in Computer Science, pages 92–102. Springer, 2007. [ bib ]
Keywords: Report_VI, IM2.ACP
[146] K. Livescu, A. Bezman, N. Borges, L. Yung, O. Cetin, J. Frankel, S. King, M. Magimai-Doss, X. Chi, and L. Lavoie. Manual transcription of conversational speech at the articulatory feature level. Proc. ICASSP, Honolulu, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[147] K. Kumatani, H. Mayer, T. Gehrig, E. Stoimenov, J. McDonough, and M. Wölfel. Minimum mutual information beamforming for simultaneous active speakers. In IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), number Idiap-RR-73-2007, pages 71–76, 2007. [ bib | DOI ]
In this work, we consider an acoustic beamforming application where two speakers are simultaneously active. We construct one subband-domain beamformer in emphgeneralized sidelobe canceller (GSC) configuration for each source. In contrast to normal practice, we then jointly optimize the emphactive weight vectors of both GSCs to obtain two output signals with emphminimum mutual information (MMI). Assuming that the subband snapshots are Gaussian-distributed, this MMI criterion reduces to the requirement that the emphcross-correlation coefficient of the subband outputs of the two GSCs vanishes. We also compare separation performance under the Gaussian assumption with that obtained from several super-Gaussian probability density functions (pdfs), namely, the Laplace, K0, and Gamma pdfs. Our proposed technique provides effective nulling of the undesired source, but without the signal cancellation problems seen in conventional beamforming. Moreover, our technique does not suffer from the source permutation and scaling ambiguities encountered in conventional blind source separation algorithms. We demonstrate the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on data from the emphPASCAL Speech Separation Challenge (SSC). On the SSC development data, the simple delay-and-sum beamformer achieves a word error rate (WER) of 70.4%. The MMI beamformer under a Gaussian assumption achieves a 55.2% WER, which is further reduced to 52.0% with a K0 pdf, whereas the WER for data recorded with a close-talking microphone is 21.6%.

Keywords: IM2.AP, Report_VII
[148] K. Kumatani, H. Mayer, T. Gehrig, E. Stoimenov, J. McDonough, and M. Wölfel. Adaptive beamforming with a minimum mutual information criterion. volume 15, pages 2527—2541, 2007. [ bib | DOI ]
In this work, we consider an acoustic beamforming application where two speakers are simultaneously active. We construct one subband-domain beamformer in emphgeneralized sidelobe canceller (GSC) configuration for each source. In contrast to normal practice, we then jointly optimize the emphactive weight vectors of both GSCs to obtain two output signals with emphminimum mutual information (MMI). Assuming that the subband snapshots are Gaussian-distributed, this MMI criterion reduces to the requirement that the emphcross-correlation coefficient of the subband outputs of the two GSCs vanishes. We also compare separation performance under the Gaussian assumption with that obtained from several super-Gaussian probability density functions (pdfs), namely, the Laplace, K0, and Gamma pdfs. Our proposed technique provides effective nulling of the undesired source, but without the signal cancellation problems seen in conventional beamforming. Moreover, our technique does not suffer from the source permutation and scaling ambiguities encountered in conventional blind source separation algorithms. We demonstrate the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on data from the emphPASCAL Speech Separation Challenge (SSC). On the SSC development data, the simple delay-and-sum beamformer achieves a word error rate (WER) of 70.4%. The MMI beamformer under a Gaussian assumption achieves a 55.2% WER, which is further reduced to 52.0% with a K0 pdf, whereas the WER for data recorded with a close-talking microphone is 21.6%.

Keywords: IM2.AP, Report_VII
[149] J. Richiardi and A. Drygajlo. Reliability-based voting schemes using modality-independent features in multi-classifier biometric authentication. In Proc. 7th Int. Workshop on Multiple Classifier Systems, Prague, Czech Republic, 2007. Springer. [ bib | .pdf ]
Keywords: Report_VI, IM2.MPR
[150] M. Germann, M. D. Breitenstein, I. K. Park, and H. Pfister. Automatic pose estimation for range images on the gpu. In Sixth International Conference on 3-D Digital Imaging and Modeling (3DIM 2007), pages 81–90. IEEE Computer Society, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[151] A. Thomas, V. Ferrari, B. Leibe, T. Tuytelaars, and L. van Gool. Depth-from-recognition: inferring metadata by cognitive feedback. In ICCV'07 Workshop on 3D Representations for Recognition, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[152] A. Vinciarelli and S. Favre. Broadcast news story segmentation using social network analysis and hidden markov models. In ACM International Conference on Multimedia, pages 261–264, 2007. IDIAP-RR 07-30. [ bib ]
This paper presents an approach for the segmentation of broadcast news into stories. The main novelty of this work is that the segmentation process does not take into account the content of the news, i.e. what is said, but rather the structure of the social relationships between the persons that in the news are involved. The main rationale behind such an approach is that people interacting with each other are likely to talk about the same topics, thus social relationships are likely to be correlated to stories. The approach is based on Social Network Analysis (for the representation of social relationships) and Hidden Markov Models (for the mapping of social relationships into stories). The experiments are performed over 26 hours of radio news and the results show that a fully automatic process achieves a purity higher than 0.75.

Keywords: Report_VI, IM2.AP.MPR, joint publication
[153] P. Besson, V. Popovici, J. M. Vesin, J. Ph. Thiran, and M. Kunt. Extraction of audio features specific to speech production for multimodal speaker detection. IEEE Transactions on Multimedia, 2007. [ bib | DOI ]
Keywords: Report_VI, LTS1; LTS5; speaker detection; multimodal; feature extraction; besson p.; IM2.MPR
[154] J. P. Pinto, H. Bourlard, A. Graves, and H. Hermansky. Comparing different word lattice rescoring approaches towards keyword spotting. Idiap-RR-32-2007 32, IDIAP, 2007. Submitted for publication. [ bib ]
In this paper, we further investigate the large vocabulary continuous speech recognition approach to keyword spotting. Given a speech utterance, recognition is performed to obtain a word lattice. The posterior probability of keyword hypotheses in the lattice is computed and used to derive a confidence measure to accept/reject the keyword. We extend this framework and replace the acoustic likelihoods in the lattice obtained from a Gaussian mixture model (GMM) with likelihoods derived from a multilayered perceptron (MLP). We compare the two rescoring techniques on the conversational telephone speech database distributed by NIST for the spoken term detection evaluation. Experimental results show that GMM lattices still perform better than the rescored lattices for short and medium length keywords, but on longer keywords, the MLP rescored word lattices perform slightly better.

Keywords: IM2.AP, Report_VII
[155] D. Morrison, S. Marchand-Maillet, and E. Bruno. Hierarchical long-term learning for automatic image. In International Conference on Semantics And digital Media Technologies (SAMT 2007), Genova, IT, 2007. [ bib ]
Keywords: Report_VI, IM2.MCA
[156] I. Bogdanova, X. Bresson, J. Ph. Thiran, and P. Vandergheynst. Scale-space analysis and active contours for omnidirectional images. IEEE Transactions on Image Processing, 16(7):1888–1901, 2007. [ bib | DOI ]
Keywords: Report_VII, IM2.VP, joint publication, active contour; catadioptric camera; computer vision; LTS2; LTS5; omnidirection vision; scale-space; segmentation
[157] P. Müller, G. Zeng, P. Wonka, and L. van Gool. Image-based procedural modeling of facades. In Proceedings of ACM SIGGRAPH 2007 / ACM Transactions on Graphics, volume 26, New York, NY, USA, 2007. ACM Press. [ bib ]
Keywords: Report_VI, IM2.VP
[158] A. Vinciarelli, F. Fernàndez, and S. Favre. Semantic segmentation of radio programs using social network analysis and duration distribution modeling. In IEEE International Conference on Multimedia and Expo (ICME), 2007. IDIAP-RR 06-75. [ bib | .ps.gz | .pdf ]
This work presents and compare two approaches for the semantic segmentation of broadcast news: the first is based on Social Network Analysis, the second is based on Poisson Stochastic Processes. The experiments are performed over 27 hours of material: preliminary results are obtained by addressing the problem of splitting different episodes of the same program into two parts corresponding to a news bulletin and a talk-show respectively. The results show that the transition point between the two parts can be detected with an average error of around three minutes, i.e. roughly 5 percent of each episode duration.

Keywords: Report_VI, IM2.AP.MPR, joint publication
[159] F. Evéquoz and D. Lalanne. Indexing and visualizing digital memories through personal email archive. pages 21–24, 2007. [ bib ]
Keywords: Report_VII, IM2.HMI
[160] M. Liwicki, A. Schlapbach, P. Loretan, and H. Bunke. Automatic detection of gender and handedness from on-line handwriting. In Proc. 13th Conf. of the Graphonomics Society, pages 179–183, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[161] J. P. Pinto, P. R. M., B. Yegnanarayana, and H. Hermansky. Significance of contextual information in phoneme recognition. 2007. IDIAP-RR 07-28. [ bib | .ps.gz | .pdf ]
Keywords: Report_VI, IM2.AP
[162] M. Liwicki and H. Bunke. Combining on-line and off-line systems for handwriting recognition. In Proc. 9th Int. Conf. on Document Analysis and Recognition, pages 372–376, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[163] K. Kryszczuk and A. Drygajlo. Q-stack: uni- and multimodal classifier stacking with quality measures. In Proc. 7th Int. Workshop on Multiple Classifier Systems, Prague, Czech Republic, 2007. Springer. [ bib ]
Keywords: Report_VI, IM2.MPR
[164] H. Bunke and M. Neuhaus. Graph matching – exact and error-tolerant methods and the automatic learning of edit costs. In D. J. Cook and L. B. Holder, editors, Mining Graph Data, pages 17–34. Wiley, 2007. [ bib ]
Keywords: Report_VI, IM2.ACP
[165] S. Marcel, P. Abbet, and M. Guillemot. Google portrait. Idiap-Com Idiap-Com-07-2007, IDIAP, 2007. [ bib ]
This paper presents a system to retrieve and browse images from the Internet containing only one particular object of interest: the human face. This system, called Google Portrait, uses Google Image search engine to retrieve images matching a text query and filters images containing faces using a face detector. Results and ranked by portraits and a tagging module is provided to change manually the label attached to faces.

Keywords: IM2.VP, Report_VI
[166] V. Pallotta, V. Seretan, and M. Ailomaa. User requirement analysis for meeting information retrieval based on query elicitation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), pages 1008–1015, Prague, Czech Republic, 2007. Association for Computational Linguistics. [ bib | .pdf ]
Keywords: Report_VI, IM2.HMI
[167] D. Morrison, S. Marchand-Maillet, and E. Bruno. Automatic image annotation with relevance feedback and latent semantic analysis. In Workshop on Adaptive Multimedia Retrieval (AMR 2007), Paris, FR, 2007. [ bib ]
Keywords: Report_VI, IM2.MCA
[168] A. Ess, A. Neubeck, and L. van Gool. Generalised linear pose estimation. In BMVC, 2007. in press. [ bib ]
Keywords: Report_VI, IM2.VP
[169] K. E. Ozden, K. Schindler, and L. van Gool. Simultaneous segmentation and 3d reconstruction of monocular image sequences. In International Conference on Computer Vision (ICCV'07), 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[170] B. Leibe, K. Schindler, and L. van Gool. Coupled detection and trajectory estimation for multi-object tracking. In International Conference on Computer Vision (ICCV'07), 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[171] A. Ess, B. Leibe, and L. van Gool. Depth and appearance for mobile scene analysis. In International Conference on Computer Vision (ICCV'07), 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[172] T. Quack, V. Ferrari, B. Leibe, and L. van Gool. Efficient mining of frequent and distinctive feature configurations. In International Conference on Computer Vision (ICCV'07), 2007. [ bib ]
Keywords: Report_VII, IM2.MCA
[173] M. Bray, E. Koller-Meier, and L. van Gool. Smart particle filtering for high-dimensional tracking. Computer Vision and Image Understanding, 2007. [ bib ]
Keywords: Report_VI, IM2.VP, stochastic meta-descent, importance sampling, smart particle filter, hand tracking
[174] K. Kryszczuk, J. Richiardi, and A. Drygajlo. Reliability estimation for multimodal error prediction and fusion. In Proc. 7th Int. Workshop on Pattern Recognition in Information Systems (PRIS 2007), Funchal, Portugual, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[175] J. del R. Millán, P. W. Ferrez, F. Galán, E. Lew, and R. Chavarriaga. Non-invasive brain-actuated interaction. In Proceedings of the 2nd International Symposium on Brain, Vision and Artificial Intelligence, volume 4729, Naples, Italy, 2007. [ bib | DOI ]
The promise of Brain-Computer Interfaces (BCI) technology is to augment human capabilities by enabling interaction with computers through a conscious and spontaneous modulation of the brainwaves after a short training period. Indeed, by analyzing brain electrical activity online, several groups have designed brain-actuated devices that provide alternative channels for communication, entertainment and control. Thus, a person can write messages using a virtual keyboard on a computer screen and also browse the internet. Alternatively, subjects can operate simple computer games, or brain games, and interact with educational software. Work with humans has shown that it is possible for them to move a cursor and even to drive a wheelchair. This paper briefly reviews the field of BCI, with a focus on non-invasive systems based on electroencephalogram (EEG) signals. It also describes three brain-actuated devices we have developed: a virtual keyboard, a brain game, and a wheelchair. Finally, it shortly discusses current research directions we are pursuing in order to improve the performance and robustness of our BCI system, especially for real-time control of brain-actuated robots.

Keywords: IM2.BMI, Report_VII
[176] F. Monay. Learning the structure of image collections with latent aspect models. In ., 2007. IDIAP-RR 07-06. [ bib | .pdf ]
The approach to indexing an image collection depends on the type of data to organize. Satellite images are likely to be searched with latitude and longitude coordinates, medical images are often searched with an image example that serves as a visual query, and personal image collections are generally browsed by event. A more general retrieval scenario is based on the use of textual keywords to search for images containing a specific object, or representing a given scene type. This requires the manual annotation of each image in the collection to allow for the retrieval of relevant visual information based on a text query. This time-consuming and subjective process is the current price to pay for a reliable and convenient text-based image search. This dissertation investigates the use of probabilistic models to assist the automatic organization of image collections, attempting to link the visual content of digital images with a potential textual description. Relying on robust, patch-based image representations that have proven to capture a variety of visual content, our work proposes to model images as mixtures of emphlatent aspects. These latent aspects are defined by multinomial distributions that capture patch co-occurrence information observed in the collection. An image is not represented by the direct count of its constituting elements, but as a mixture of latent aspects that can be estimated with principled, generative unsupervised learning methods. An aspect-based image representation therefore incorporates contextual information from the whole collection that can be exploited. This emerging concept is explored for several fundamental tasks related to image retrieval - namely classification, clustering, segmentation, and annotation - in what represents one of the first coherent and comprehensive study of the subject. We first investigate the possibility of classifying images based on their estimated aspect mixture weights, interpreting latent aspect modeling as an unsupervised feature extraction process. Several image categorization tasks are considered, where images are classified based on the present objects or according to their global scene type. We demonstrate that the concept of latent aspects allows to take advantage of non-labeled data to infer a robust image representation that achieves a higher classification performance than the original patch-based representation. Secondly, further exploring the concept, we show that aspects can correspond to an interesting soft clustering of an image collection that can serve as a browsing structure. Images can be ranked given an aspect, illustrating the corresponding co-occurrence context visually. In the third place, we derive a principled method that relies on latent aspects to classify image patches into different categories. This produces an image segmentation based on the resulting spatial class-densities. We finally propose to model images and their caption with a single aspect model, merging the co-occurrence contexts of the visual and the textual modalities in different ways. Once a model has been learned, the distribution of words given an unseen image is inferred based on its visual representation, and serves as textual indexing. Overall, we demonstrate with extensive experiments that the co-occurrence context captured by latent aspects is suitable for the above mentioned tasks, making it a promising approach for multimedia indexing.

Keywords: Report_VI, IM2.MCA
[177] G. Aradilla and J. Ajmera. Detection and recognition of number sequences within spoken utterances. In 2nd Workshop on Speech in Mobile and Pervasive Environments, 2007. [ bib ]
In this paper we investigate the detection and recognition of sequences of numbers in spoken utterances. This is done in two steps: first, the entire utterance is decoded assuming that only numbers were spoken. In the second step, non-number segments (garbage) are detected based on word confidence measures. We compare this approach to conventional garbage models. Also, a comparison of several phone posterior based confidence measures is presented in this paper. The work is evaluated in terms of detection task (hit rate and false alarms) and recognition task (word accuracy) within detected number sequences. The proposed method is tested on German continuous spoken utterances where target content (numbers) is only 20%.

Keywords: Report_VII, IM2.AP
[178] F. Aloise, N. Caporusso, D. Mattia, F. Babiloni, L. Kauhanen, J. del R. Millán, M. Nuttin, M. G. Marciani, and F. Cincotti. Brain-machine interfaces through control of electroencephalographic signals and vibrotactile feedback. In Proceedings of the 12th International Conference on Human-Computer Interaction, volume 125, Beijing, China, 2007. [ bib ]
A Brain-Computer Interface (BCI) allow direct expression of its userüiÂ?`Âfrac1/2s will by interpreting signals which directly reflect the brainüiÂ?`Âfrac1/2s activity, thus bypassing the natural efferent channels (nerves and muscles). To be correctly mastered, it is needed that this artificial efferent channel is complemented by an artificial feedback, which continuously informs the user about the current state (in the same way as proprioceptors give a feedback about joint angle and muscular tension). This feedback is usually delivered through the visual channel. We explored the benefits of vibrotactile feedback during usersüiÂ?`Âfrac1/2 training and control of EEG-based BCI applications. A protocol for delivering vibrotactile feedback, including specific hardware and software arrangements, was specified and implemented. Thirteen subjects participated in an experiment where the feedback of the BCI system was delivered either through a visual display, or through a vibrotactile display, while they performed a virtual navigation task. Attention to the task was probed by presenting visual cues that the subjects had to describe afterwards. When compared with visual feedback, the use of tactile feedback did not decrease BCI control performance; on the other side, it improved the capacity of subjects to concentrate on the requested (visual) task. During experiments, vibrotactile feedback felt (after some training) more natural. This study indicated that the vibrotactile channel can function as a valuable feedback modality in the context of BCI applications. Advantages of using a vibrotactile feedback emerged when the visual channel was highly loaded by a complex task.

Keywords: IM2.BCI, Report_VII
[179] K. Smith. Bayesian methods for visual multi-object tracking with applications to human activity recognition. PhD thesis, École Polytechnique Fédérale de Lausanne, Lausanne , Switzerland, 2007. Thèse sciences Ecole polytechnique fédérale de Lausanne EPFL, no 3745 (2007), Faculté des sciences et techniques de l'ingénieur STI, Section de génie électrique et électronique, Institut de génie électrique et électronique IEL (Laboratoire de l'IDIAP LIDIAP). Dir.: Hervé Bourlard, Daniel Gatica-Perez. [ bib ]
In recent years, we have seen a dramatic increase in the amount of video data recorded and stored around the world. Driven by the availability of low-cost video cameras, the ever-decreasing cost of digital media storage, and the explosion in popularity of video sharing across the Internet, there is a growing demand for sophisticated methods to automatically analyze and understand video content. One of the most fundamental processes to understanding video content is visual multi-object tracking, which is the process of locating, identifying, and determining the dynamic configuration of one or many moving (possibly deformable) objects in each frame of a video sequence. In this dissertation, we focus on a general probabilistic approach known as recursive state-space Bayesian estimation, which estimates the unknown probability distribution of the state of the objects recursively over time, using information extracted from video data. The central problem addressed in this dissertation is the development of novel probabilistic models using this framework to perform accurate, robust automatic visual multi-object tracking. In addressing this problem, we consider the following questions: What types of probabilistic models can we develop to improve the state-of-the-art, and where do the improvements come from? What benefits and drawbacks are associated with these models? How can we objectively evaluate the performance of a multi-object tracking model? How can a probabilistic multi-object tracking model be extended to perform human activity recognition tasks? Over the course of our work, we attempt to provide an answer to each of these questions, beginning with a proposal for a comprehensive set of measures and a formal evaluation protocol for evaluating multi-object tracking performance. We proceed by defining two new probabilistic tracking models: one which improves the efficiency of a state-of-the-art model, the Distributed Partitioned Sampling Particle Filter (DPS PF), and one which provides a formal framework for efficiently tracking a variable number of objects, the Reversible Jump Markov Chain Monte Carlo Particle Filter (RJMCMC PF). Using our proposed evaluation framework, we compare our proposed models with other state-of-the-art tracking methods in a meeting room head tracking task. Finally, we show how the RJMCMC PF can be applied to human activity recognition tasks such as detecting abandoned luggage items in a busy train terminal and determining if and when pedestrians look at an outdoor advertisement as they pass.

Keywords: IM2.VP, Report_VII
[180] J. Hennebert, A. Humm, and R. Ingold. Modelling spoken signatures with gaussian mixture model adaptation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 07), 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[181] A. Drygajlo. Man-machine voice communication, pages 433–461. EPFL Press, 2007. [ bib | DOI ]
Keywords: Report_VI, IM2.MPR
[182] O. Vinyals, G. Friedland, and N. Mirghafori. Revisiting a basic function on current cpus: A fast logarithm implementation with adjustable accuracy. ICSI Technical Report number TR-07-002, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[183] M. Liwicki, A. Graves, H. Bunke, and J. Schmidhuber. A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In Proc. 9th Int. Conf. on Document Analysis and Recognition, pages 367–371, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[184] F. Orabona, C. Castellini, B. Caputo, J. Luo, and G. Sandini. On-line independent support vector machines for cognitive systems. Idiap-RR Idiap-RR-63-2007, IDIAP, 2007. [ bib ]
Learning from experience and adapting to changing stimuli are fundamental capabilities for artificial cognitive systems. This calls for on-line learning methods able to achieve high accuracy while at the same time using limited computer power. Research on autonomous agents has been actively investigating these issues, mostly using probabilistic frameworks and within the context of navigation and learning by imitation. Still, recent results on robot localization have clearly pointed out the potential of discriminative classifiers for cognitive systems. In this paper we follow this approach and propose an on-line version of the Support Vector Machine (SVM) algorithm. Our method, that we call On-line Independent SVM, builds a solution on-line, achieving an excellent accuracy vs. compactness trade-off. In particular the size of the obtained solution is always bounded, implying a bounded testing time. At the same time, the algorithm converges to the optimal solution at each incremental step, as opposed to similar approaches where optimality is achieved in the limit of infinite number of training data. These statements are supported by experiments on standard benchmark databases as well as on two real-world applications, namely (a) place recognition by a mobile robot in an indoor environment, and (b) human grasping posture classification.

Keywords: IM2.MPR, Report_VII
[185] P. W. Ferrez. Error-related eeg potentials in brain-computer interfaces. PhD thesis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2007. PhD Thesis #3928 at the École Polytechnique Fédérale de Lausanne. [ bib ]
People with severe motor disabilities (spinal cord injury (SCI), amyotrophic lateral sclerosis (ALS), etc.) but with intact brain functions are somehow prisoners of their own body. They need alternative ways of communication and control to interact with their environment in their everyday life. These new tools are supposed to increase their quality of life by giving these people the opportunity to recover part of their independence. Therefore, these alternative ways have to be reliable and ergonomic to be successfully used by disabled people. Over the past two decades, numerous studies proposed electroencephalogram (EEG) activity for direct brain-computer interaction. EEG-based brain-computer interfaces (BCIs) provide disabled people with new tools for control and communication and are promising alternatives to invasive methods. However, as any other interaction modality based on physiological signals and body channels (muscular activity, speech and gestures, etc.), BCIs are prone to errors in the recognition of subjectÂs intent, and those errors can be frequent. Indeed, even well-trained subjects rarely reach 100 percent of success. In contrast to other interaction modalities, a unique feature of the brain channel is that it conveys both information from which we can derive mental control commands to operate a brain-actuated device as well as information about cognitive states that are crucial for a purposeful interaction, all this on the millisecond range. One of these states is the awareness of erroneous responses, which a number of groups have recently proposed as a way to improve the performance of BCIs. However, most of these studies propose the use of error-related potentials (ErrP) following an error made by the subject himself. This thesis first describes a new kind of ErrP, the so-called interaction ErrP, that are present in the ongoing EEG following an error of the interface and no longer errors of the subject himself. More importantly, these ErrP are satisfactorily detected no more in grand averages but at the level of single trials. Indeed, the classification rates of both error and correct single trials based on error-potentials detection are on average 80 percent. At this level it becomes possible to introduce a kind of automatic verification procedure in the BCI: after translating the subjectÂs intention into a control command, the BCI provides a feedback of that command, but will not transfer it to the device if ErrP follow the feedback. Experimental results presented in this thesis confirm that this new protocol greatly increases the reliability of the BCI. Furthermore, this tool turns out to be of great benefit especially for beginners who normally reach moderate performances. Indeed, filtering out wrong responses increases the userÂs confidence in the interface and thus accelerates mastering the control of the brainactuated device. The second issue explored in this thesis is the practical integration of ErrP detection in a BCI. Indeed, providing a first feedback of the subjectÂs intent, as recognized by the BCI, before eventually sending the command to the controlled device, induces additional information to process by the subject and may considerably slow down the interaction since the introduction of an automatic response rejection strongly interferes with the BCI. However, this study shows the feasibility of simultaneously and satisfactorily detecting erroneous responses of the interface and classifying motor imagination for device control at the level of single trials. The integration of an automatic error detection procedure leads to great improvements of the BCI performance. Another aspect of this thesis is to investigate the potential benefit of using neurocognitive knowledge to increase the classification rate of ErrP, and more generally the performance of the BCI. Recent findings have uncovered that ErrP are most probably generated in a deep fronto-central brain area called anterior cingulate cortex (ACC). This hypothesis is verified using a well-known inverse model called sLORETA. Indeed, the localization provided for ErrP shows clear foci of activity both in the ACC and the pre-supplementary motor area (pre-SMA). The localization results using the cortical current density (CCD) model are very similar and more importantly, this model outperforms EEG for ErrP classification. Thanks to its stability, this model is likely to be successfully used in a BCI framework. The ELECTRA model for estimating local field potentials is also tested, but classification and localization results using this method are not so encouraging. More generally, the work described here suggests that it could be possible to recognize in real time high-level cognitive and emotional states from EEG (as opposed, and in addition, to motor commands) such as alarm, fatigue, frustration, confusion, or attention that are crucial for an effective and purposeful interaction. Indeed, the rapid recognition of these states will lead to truly adaptive interfaces that customize dynamically in response to changes of the cognitive and emotional/affective states of the user.

Keywords: IM2.BCI, Report_VII
[186] G. Aradilla and H. Bourlard. Posterior-based features and distances in template matching for speech recognition. In 4th Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI), volume 4892, pages 204–214, 2007. IDIAP-RR 07-41. [ bib | DOI ]
The use of large speech corpora in example-based approaches for speech recognition is mainly focused on increasing the number of examples. This strategy presents some difficulties because databases may not provide enough examples for some rare words. In this paper we present a different method to incorporate the information contained in such corpora in these example-based systems. A multilayer perceptron is trained on these databases to estimate speaker and task-independent phoneme posterior probabilities, which are used as speech features. By reducing the variability of features, fewer examples are needed to properly characterize a word. In this way, performance can be highly improved when limited number of examples is available. Moreover, we also study posterior-based local distances, these result more effective than traditional Euclidean distance. Experiments on Phonebook database support the idea that posterior features with a proper local distance can yield competitive results.

Keywords: Report_VII, IM2.AP
[187] B. Mesot and D. Barber. A bayesian switching linear dynamical system for scale-invariant robust speech extraction. Technical report, Idiap Research Institute, 2007. [ bib ]
Most state-of-the-art automatic speech recognition (ASR) systems deal with noise in the environment by extracting noise robust features which are subsequently modelled by a Hidden Markov Model (HMM). A limitation of this feature-based approach is that the influence of noise on the features is difficult to model explicitly and the HMM is typically over sensitive, dealing poorly with unexpected and severe noise environments. An alternative is to model the raw signal directly which has the potential advantage of allowing noise to be explicitly modelled. A popular way to model raw speech signals is to use an Autoregressive (AR) process. AR models are however very sensitive to variations in the amplitude of the signal. Our proposed Bayesian Autoregressive Switching Linear Dynamical System (BAR-SLDS) treats the observed noisy signal as a scaled, clean hidden signal plus noise. The variance of the noise and signal scaling factor are automatically adapted, enabling the robust identification of scale-invariant clean signals in the presence of noise.

Keywords: Report_VII, IM2.AP
[188] R. Villán, S. Voloshynovskiy, O. Koval, F. Deguillaume, and T. Pun. Tamper-proofing of electronic and printed text documents via robust hashing and data-hiding. In Proceedings of SPIE-IS&T Electronic Imaging 2007, Security, Steganography, and Watermarking of Multimedia Contents IX, San Jose, USA, 2007. [ bib | .pdf ]
Keywords: Report_VI, IM2.MPR
[189] K. Riesen, M. Neuhaus, and H. Bunke. Graph embedding in vector spaces by means of prototype selection. In F. Escolano and M. Vento, editors, Graph-Based Representations in Pattern Recognition, volume 4538 of Lecture Notes in Computer Science, pages 383–393. Springer, 2007. [ bib ]
Keywords: Report_VI, IM2.ACP
[190] A. Stolcke, X. Anguera, K. Boakye, O. Cetin, A. Janin, M. Magimai-Doss, C. Wooters, and J. Zheng. The sri-icsi spring 2007 meeting and lecture recognition system. Lecture Notes in Computer Science, 2007. [ bib ]
Keywords: Report_VII, IM2.AP, joint publication
[191] A. Stolcke, S. Kajarekar, L. Ferrer, and E. Shriberg. Speaker recognition with session variability normalization based on mllr adaptation transforms. IEEE Transactions on Audio, Speech, and Language Processing, special issue on speaker and language recognition, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[192] H. Hung, D. Jayagopi, C. Yeo, G. Friedland, S. Ba, J. M. Odobez, K. Ramchandran, N. Mirghafori, and D. Gatica-Perez. Using audio and video features to classify the most dominant person in a group meeting multi-layer background subtraction based on color and texture. In Proc. ACM Multi Media, Augsburg, Germany, 2007. [ bib ]
Keywords: Report_VII, IM2.AP.VP, joint publication
[193] K. Kryszczuk, J. Richiardi, P. Prodanov, and A. Drygajlo. Reliability-based decision fusion in multimodal biometric verification systems. EURASIP Journal of Advances in Signal Processing, 2007. (in press). [ bib ]
Keywords: Report_VI, IM2.MPR
[194] M. Rigamonti, D. Lalanne, and R. Ingold. Faericworld: browsing multimedia events through static documents and links. In In proc. of INTERACT 2007, LNCS, page to appear, Rio De Janeiro, Brasil, 2007. Springer-Verlag. [ bib ]
Keywords: Report_VI, IM2.HMI
[195] J. Hennebert, R. Loeffel, A. Humm, and R. Ingold. A new forgery scenario based on regaining dynamics of signature. In Accepted for publication, International Conference on Biometrics (ICB 2007), Seoul Korea, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[196] J. M. Pardo, X. Anguera, and C. Wooters. Speaker diarization for multiple-distant-microphone meetings using several sources of information. to appear in IEEE Transactions on Computers, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[197] A. Jaimes, D. Gatica-Perez, N. Sebe, and T. S. Huang. Human-centered computing: toward a human revolution. IEEE Computer, 40(5), 2007. IDIAP-RR 07-57. [ bib | DOI ]
Human-centered computing studies the design, development, and deployment of mixed-initiative human-computer systems. HCC is emerging from the convergence of multiple disciplines that are concerned both with understanding human beings and with the design of computational artifacts.

Keywords: IM2.HCI, Report_VI
[198] F. Galán, J. Palix, R. Chavarriaga, P. W. Ferrez, E. Lew, C. A. Hauert, and J. del R. Millán. Visuo-spatial attention frame recognition for brain-computer interfaces. In Proceedings of the 1st International Conference on Cognitive Neurodynamics, Shanghai, China, 2007. [ bib ]
Objective: To assess the feasibility of recognizing visual spatial attention frames for Brain-computer interfaces (BCI) applications. Methods: EEG data was recorded with 64 electrodes from 2 subjects executing a visual spatial attention task indicating 2 target locations. Continuous Morlet wavelet coefficients were estimated on 18 frequency components and 16 preselected electrodes in trials of 600 ms. The spatial patterns of the 16 frequency components frames were simultaneously detected and classified (between the two targets). The classification accuracy was assessed using 20-fold crossvalidation. Results: The maximum frames average classification accuracies are 80.64% and 87.31% for subject 1 and 2 respectively, both utilizing coefficients estimated at frequencies located in gamma band.

Keywords: IM2.VP, Report_VII
[199] A. Schlapbach and H. Bunke. Fusing asynchronous feature streams for on-line writer identification. In Proc. 9th Int. Conf. on Document Analysis and Recognition, pages 103–107, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[200] S. R. Mahadeva Prasanna, B. Yegnanarayana, J. P. Pinto, and H. Hermansky. Analysis of confusion matrix to combine evidence for phoneme recognition. Idiap-RR-27-2007 27, IDIAP, 2007. Submitted for publication. [ bib ]
In this work we analyze and combine evidences from different classifiers for phoneme recognition using information from the confusion matrices. Speech signals are processed to extract the Perceptual Linear Prediction (PLP) and Multi-RASTA (MRASTA) features. Neural network classifiers with different architectures are built using these features. The classifiers are analyzed using their confusion matrices. The motivation behind this analysis is to come up with some objective measures which indicate the complementary nature of information in each of the classifiers. These measures are useful for combining a subset of classifiers. The classifiers can be combined using different combination schemes like product, sum, minimum and maximum rules. The significance of the objective measures is demonstrated in terms the results of combination. Classifiers selected through the proposed objective measures seem to provide the best performance.

Keywords: IM2.AP, Report_VII
[201] T. Quack, V. Ferrari, B. Leibe, and L. van Gool. Efficient mining of frequent and distinctive feature configurations. In accepted for ICCV'07, 2007. [ bib ]
Keywords: Report_VI, IM2.ISD, IM2.MCA, joint publication
[202] A. Humm, J. Hennebert, and R. Ingold. Modelling combined handwriting and speech modalities. In Accepted for publication, International Conference on Biometrics (ICB 2007), Seoul Korea, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[203] J. P. Pinto, A. Lovitt, and H. Hermansky. Exploiting phoneme similarities in hybrid hmm-ann keyword spotting. In Proceedings of Interspeech, 2007. IDIAP-RR 07-11. [ bib | .ps.gz | .pdf ]
We propose a technique for generating alternative models for keywords in a hybrid hidden Markov model - artificial neural network (HMM-ANN) keyword spotting paradigm. Given a base pronunciation for a keyword from the lookup dictionary, our algorithm generates a new model for a keyword which takes into account the systematic errors made by the neural network and avoiding those models that can be confused with other words in the language. The new keyword model improves the keyword detection rate while minimally increasing the number of false alarms.

Keywords: Report_VI, IM2.AP
[204] E. Kokiopoulou and P. Frossard. Accelarating distributed consensus using extrapolation. IEEE Signal Processing Letters, 14(10):665–668, 2007. [ bib ]
Keywords: Report_VII, IM2.DMA.VP, joint publication
[205] A. Schlapbach and H. Bunke. A writer identification and verification system using hmm based recognizers. Pattern Analysis and Applications, 10(1):33–43, 2007. [ bib ]
Keywords: Report_VI, IM2.VP
[206] F. Valente and H. Hermansky. Combination of acoustic classifiers based on dempster-shafer theory of evidence. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2007. IDIAP-RR 06-61. [ bib | .ps.gz | .pdf ]
In this paper we investigate combination of neural net based classifiers using Dempster-Shafer Theory of Evidence. Under some assumptions, combination rule resembles a product of errors rule observed in human speech perception. Different combination are tested in ASR experiments both in matched and mismatched conditions and compared with more conventional probability combination rules. Proposed techniques are particularly effective in mismatched conditions.

Keywords: Report_VI, IM2.AP
[207] A. Vinciarelli. Mapping nonverbal communication into social status: automatic recognition of journalists and non-journalists in radio news. IDIAP-RR 33, IDIAP, 2007. Submitted for publication. [ bib | .ps.gz | .pdf ]
This work shows how features accounting for nonverbal speaking characteristics can be used to map people into predefined categories. In particular, the results of this paper show that the speakers participating in radio broadcast news can be classified into journalists and non-journalists with an accuracy higher than 80 percent. The results of the approach proposed for this task is compared with the effectiveness of 16 human assessors performing the same task. The assessors do not understand the language of the data and are thus forced to use mostly nonverbal features. The results of the comparison suggest that the assessors and the automatic system have a similar performance.

Keywords: Report_VI, IM2.MCA.MPR, joint publication
[208] M. Plauché, O. Cetin, and N. Uhdaykumar. How to build a spoken dialog system with limited (or no) resources. AI in ICT for Development Workshop of the Twentieth Intl. Joint Conf. on AI, Hyderabad, India, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[209] P. Quelhas, J. M. Odobez, D. Gatica-Perez, and T. Tuytelaars. A thousand words in a scene. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(9):151575–1589, 2007. IDIAP-RR 05-40. [ bib | DOI ]
This paper presents a novel approach for visual scene modeling and classification, investigating the combined use of text modeling methods and local invariant features. Our work attempts to eluciyear(1) whether a text-like emphbag-of-visterms representation (histogram of quantized local visual features) is suitable for scene (rather than object) classification, (2) whether some analogies between discrete scene representations and text documents exist, and (3) whether unsupervised, latent space models can be used both as feature extractors for the classification task and to discover patterns of visual co-occurrence. Using several data sets, we valiyearour approach, presenting and discussing experiments on each of these issues. We first show, with extensive experiments on binary and multi-class scene classification tasks using a 9500-image data set, that the emphbag-of-visterms representation consistently outperforms classical scene classification approaches. In other data sets we show that our approach competes with or outperforms other recent, more complex, methods. We also show that Probabilistic Latent Semantic Analysis (PLSA) generates a compact scene representation, discriminative for accurate classification, and more robust than the emphbag-of-visterms representation when less labeled training data is available. Finally, through aspect-based image ranking experiments, we show the ability of PLSA to automatically extract visually meaningful scene patterns, making such representation useful for browsing image collections.

Keywords: IM2.VP, Report_VII
[210] F. Valente, J. Vepa, C. Plahl, C. Gollan, H. Hermansky, and R. Schlüter. Hierarchical neural networks feature extraction for lvcsr system. In Interspeech 2007, 2007. IDIAP-RR 07-08. [ bib | .ps.gz | .pdf ]
This paper investigates the use of a hierarchy of Neural Networks for performing data driven feature extraction. Two different hierarchical structures based on long and short temporal context are considered. Features are tested on two different LVCSR systems for Meetings data (RT05 evaluation data) and for Arabic Broadcast News (BNAT05 evaluation data). The hierarchical NNs outperforms the single NN features consistently on different type of data and tasks and provides significant improvements w.r.t. respective baselines systems. Best result is obtained when different time resolutions are used at different level of the hierarchy.

Keywords: Report_VI, IM2.AP
[211] T. Jaeggli, E. Koller-Meier, and L. van Gool. Learning generative models for monocular body pose estimation. In ACCV, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[212] K. Schindler, D. Suter, and H. Wang. A model-selection framework for multibody structure-and-motion of image sequences. International Journal of Computer Vision, 79(2):159–177, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[213] T. Jaeggli, E. Koller-Meier, and L. van Gool. Multi-activity tracking in lle body pose space. In 2nd Workshop on HUMAN MOTION Understanding, Modeling, Capture and Animation, ICCV, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[214] X. Bresson, S. Esedoglu, P. Vandergheynst, J. Ph. Thiran, and S. Osher. Fast global minimization of the active contour/snake model. Journal of Mathematical Imaging and Vision, 28(2):151–167, 2007. [ bib | DOI | http ]
Keywords: Report_VII, IM2.VP, LTS2; LTS5
[215] J. Kludas, E. Bruno, and S. Marchand-Maillet. Information fusion in multimedia information retrieval. In Workshop on Adaptive Multimedia Retrieval (AMR 2007), Paris, FR, 2007. [ bib ]
Keywords: Report_VI, IM2.MCA
[216] A. Lovitt. Correcting confusion matrices for phone recognizers. IDIAP-COM 03, IDIAP, 2007. [ bib | .ps.gz | .pdf ]
Modern speech recognition has many ways of quantifying the misrecognitions a speech recognizer makes. The errors in modern speech recognition makes extensive use of the Levenshtein algorithm to find the distance between the labeled target and the recognized hypothesis. This algorithm has problems when properly aligning substitution confusions due to the lack of knowledge about the system. This work addresses a shortcoming of the alignment provided by speech recognition analysis systems (HTK specifically) and provides a more applicable algorithm for aligning the hypothesis with the target. This new procedure takes into account the systematic errors the recognizer will make and uses that knowledge to produce correct alignments.

Keywords: Report_VI, IM2.AP
[217] M. Levit, D. Hakkani-Tur, G. Tur, and D. Gillick. Integrating several annotation layers for statistical information distillation. In Workshop on Automatic Speech Recognition and Understanding, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[218] L. Piccardi, B. Noris, O. Barbey, G. Schiavone, F. Keller, C. Von Hofsten, and A. Billard. Wearcam: a head mounted wireless camera for monitoring gaze attention and for the diagnosis of developmental disorders in young children. In 16th IEEE International Symposium on Robot & Human Interactive Communication, RO-MAN, Special Session: Applications of Robotics and Intelligent System, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[219] A. Popescu-Belis and P. Estrella. Generating usable formats for metadata and annotations in a large meeting corpus. In ACL 2007, 45th International Conference of the for Computation, pages 93–96. ACL 2007, 2007. [ bib ]
Keywords: Report_VI, IM2.DMA, major, Interactive Poster and Demonstration Sessions
[220] O. Koval, S. Voloshynovskiy, and T. Pun. Error exponent analysis of person identification based on fusion of dependent/independent modalities. In Proceedings of SPIE-IS&T Electronic Imaging 2007, Security, Steganography, and Watermarking of Multimedia Contents IX, San Jose, USA, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[221] R. Hérault and Y. Grandvalet. Sparse probabilistic classifiers. In International Conference on Machine Learning (ICML), 2007. IDIAP-RR 07-19. [ bib | .ps.gz | .pdf ]
The scores returned by support vector machines are often used as a confidence measures in the classification of new examples. However, there is no theoretical argument sustaining this practice. Thus, when classification uncertainty has to be assessed, it is safer to resort to classifiers estimating conditional probabilities of class labels. Here, we focus on the ambiguity in the vicinity of the boundary decision. We propose an adaptation of maximum likelihood estimation, instantiated on logistic regression. The model outputs proper conditional probabilities into a user-defined interval and is less precise elsewhere. The model is also sparse, in the sense that few examples contribute to the solution. The computational efficiency is thus improved compared to logistic regression. Furthermore, preliminary experiments show improvements over standard logistic regression and performances similar to support vector machines.

Keywords: Report_VI, IM2.MPR
[222] D. Lalanne, F. Evéquoz, H. Chiquet, M. Müller, M. Radgohar, and R. Ingold. Going through digital versus physical augmented gaming. In Tangible Play: Research and Design for Tangible and Tabletop Games. Workshop at the 2007 Intelligent User Interfaces Conference (IUI'07), pages 41–44, Hawaii (USA), 2007. [ bib ]
Keywords: Report_VI, IM2.HMI
[223] E. Kokiopoulou and P. Frossard. Accelerating distributed consensus using extrapolation. IEEE Signal Processing Letters, 14(10), 2007. [ bib | DOI | http ]
Keywords: Report_VI, IM2.VP
[224] A. Humm, J. Hennebert, and R. Ingold. Spoken handwriting verification using statistical models. In Accepted for publication, International Conference on Document Analysis and Recognition (ICDAR 07), Curitiba Brazil, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[225] M. Georgescul, A. Clark, and S. Armstrong. Exploiting structural meeting-specific features for topic segmentation. In Actes de la 14ème Conférence sur le Traitement Automatique des Langues Naturelles, 2007. [ bib ]
Keywords: Report_VI, IM2.MCA, major
[226] F. Monay and D. Gatica-Perez. Modeling semantic aspects for cross-media image indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29:1802–1817, 2007. IDIAP-RR 05-56. [ bib | DOI ]
To go beyond the query-by-example paradigm in image retrieval, there is a need for semantic indexing of large image collections for intuitive text-based image search. Different models have been proposed to learn the dependencies between the visual content of an image set and the associated text captions, then allowing for the automatic creation of semantic indices for unannotated images. The task, however, remains unsolved. In this paper, we present three alternatives to learn a Probabilistic Latent Semantic Analysis model (PLSA) for annotated images, and evaluate their respective performance for automatic image indexing. Under the PLSA assumptions, an image is modeled as a mixture of latent aspects that generates both image features and text captions, and we investigate three ways to learn the mixture of aspects. We also propose a more discriminative image representation than the traditional Blob histogram, concatenating quantized local color information and quantized local texture descriptors. The first learning procedure of a PLSA model for annotated images is a standard EM algorithm, which implicitly assumes that the visual and the textual modalities can be treated equivalently. The other two models are based on an asymmetric PLSA learning, allowing to constrain the definition of the latent space on the visual or on the textual modality. We demonstrate that the textual modality is more appropriate to learn a semantically meaningful latent space, which translates into improved annotation performance. A comparison of our learning algorithms with respect to recent methods on a standard dataset is presented, and a detailed evaluation of the performance shows the validity of our framework.

Keywords: IM2.MCA, Report_VII
[227] M. Broschart, C. de Negueruela, J. del R. Millán, and C. Menon. Augmenting astronaut's capabilities through brain-machine interfaces. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, Workshop on Artificial Intelligence for Space Applications, Hyderabad, India, 2007. [ bib ]
Brain-Machine Interfaces (BMIs) transform the brain activity of a human operator into executable commands that can be sent to a machine, usually a computer or robot, to perform intended tasks. In addition to current biomedical applications, available technology could also make feasible augmenting devices for space applications that could be promising means to improve astronauts' efficiency and capabilities. The implementation of artificial intelligence algorithms into the software architecture of present BMIs will be of crucial importance to guarantee a proper functionality of the device in the highly dynamic and unpredictable space environment.

Keywords: IM2.BCI, Report_VI
[228] B. Fasel and L. van Gool. Interactive museum guide: accurate retrieval of object descriptions. In S. Marchand-Maillet, E. Bruno, A. Nürnberger, and M. Detyniecki, editors, Adaptive Multimedia Retrieval: User, Context, and Feedback, pages 179–191. Springer, 2007. [ bib ]
Keywords: Report_VI, IM2.VP
[229] G. Bologna, B. Deville, T. Pun, and M. Vinckenbosch. Identifying major components of pictures by audio encoding of colors. In IWINAC2007, 2nd. Int. Work-conf. on the Interplay between Natural and Artificial Computation, Murcia, Spain, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[230] S. Chiappa and D. Barber. Bayesian factorial linear gaussian state-space models for biosignal decomposition. IEEE Signal Processing Letters, 2007. IDIAP-RR 05-84. [ bib | .pdf ]
We discuss a method to extract independent dynamical systems underlying a single or multiple channels of observation. In particular, we search for one dimensional subsignals to aid the interpretability of the decomposition. The method uses an approximate Bayesian analysis to determine automatically the number and appropriate complexity of the underlying dynamics, with a preference for the simplest solution. We apply this method to unfiltered EEG signals to discover low complexity sources with preferential spectral properties, demonstrating improved interpretability of the extracted sources over related methods.

Keywords: Report_VI, IM2.BMI
[231] T. Kaufmann and B. Pfister. Applying licenser rules to a grammar with continuous constituents. In The Proceedings of the 14th International Conference on Head-Driven Phrase Structure Grammar, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[232] S. Bengio and J. Mariéthoz. Biometric person authentication is a multiple classifier problem. In 7th International Workshop on Multiple Classifier Systems, MCS, 2007. IDIAP-RR 07-03. [ bib | .ps.gz | .pdf ]
Several papers have already shown the interest of using multiple classifiers in order to enhance the performance of biometric person authentication systems. In this paper, we would like to argue that the core task of Biometric Person Authentication is actually a multiple classifier problem as such: indeed, in order to reach state-of-the-art performance, we argue that all current systems , in one way or another, try to solve several tasks simultaneously and that without such joint training (or sharing), they would not succeed as well. We explain hereafter this perspective, and according to it, we propose some ways to take advantage of it, ranging from more parameter sharing to similarity learning.

Keywords: Report_VI, IM2.MPR
[233] R. Grave de Peralta Menendez, S. L. González Andino, P. W. Ferrez, and J. del R. Millán. Non-invasive estimates of local field potentials for brain-computer interfaces. In G. Dornhege, J. del R. Millán, T. Hinterberger, D. McFarland, and K. R. Müller, editors, Towards Brain-Computer Interfacing. The MIT Press, 2007. [ bib ]
Recent experiments have shown the possibility to use the brain electrical activity to directly control the movement of robots or prosthetic devices in real time. Such neuroprostheses can be invasive or non-invasive, depending on how the brain signals are recorded. In principle, invasive approaches will provide a more natural and flexible control of neuroprostheses, but their use in humans is debatable given the inherent medical risks. Non-invasive approaches mainly use scalp electroencephalogram (EEG) signals and their main disadvantage is that these signals represent the noisy spatiotemporal overlapping of activity arising from very diverse brain regions; i.e., a single scalp electrode picks up and mixes the temporal activity of myriads of neurons at very different brain areas. In order to combine the benefits of both approaches, we propose to rely on the non-invasive estimation of local field potentials (eLFP) in the whole human brain from the scalp measured EEG data using a recently developed inverse solution (ELECTRA) to the EEG inverse problem. The goal of a linear inverse procedure is to deconvolve or unmix the scalp signals attributing to each brain area its own temporal activity. To illustrate the advantage of this approach we compare, using identical set of spectral features, classification of rapid voluntary finger self-tapping with left and right hands based on scalp EEG and eLFP on three subjects using different number of electrodes. It is shown that the eLFP-based Gaussian classifier outperforms the EEG-based Gaussian classifier for the three subjects.

Keywords: IM2.BCI, Report_VII
[234] L. van Gool, G. Zeng, F. van den Borre, and P. Müller. Towards mass-produced building models. In U. Stilla, H. Mayer, F. Rottensteiner, C. Heipke, and S. Hinz, editors, Photogrammetric Image Analysis, pages 209–220. Institute of Photogrammetry and Cartography, Technische Universitaet Muenchen, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[235] J. Meynet, V. Popovici, and J. Ph. Thiran. Mixtures of boosted classifiers for frontal face detection. Signal, Image and Video Processing, 1(1):29–38, 2007. [ bib | DOI | http ]
Keywords: Report_VII, IM2.VP, combination of classifiers; face detection; gaussian features; lts5
[236] D. G. Zacharie and J. P. Pinto. Keyword spotting on word lattices. IDIAP-RR 22, IDIAP, 2007. [ bib | .ps.gz | .pdf ]
Keywords: Report_VI, IM2.AP
[237] L. Uldry, P. W. Ferrez, and J. del R. Millán. Feature selection methods on distributed linear inverse solutions for a non-invasive brain-machine interface. IDIAP-COM 04, IDIAP, 2007. [ bib | .ps.gz | .pdf ]
Keywords: Report_VI, IM2.BMI
[238] E. Bruno, J. Kludas, and S. Marchand-Maillet. Combining multimodal preferences for multimedia information retrieval. In Proc. of International Workshop on Multimedia Information Retrieval, Augsburg, Germany, 2007. [ bib ]
Keywords: Report_VII, IM2.MCA
[239] J. del R. Millán. Tapping the mind or resonating minds? In P. T. Kidd, editor, European Visions for the Knowledge Age. Cheshire Henbury, 2007. [ bib ]
Brains interfaced to machines, where thought is used to control and manipulate these machines. This is the vision examined in this chapter. First-generation brain-machine interfaces have already been developed, and technological developments must surely lead to increased capabilities in this field. The most obvious applications for these technologies are those that will assist disabled people. The technology can help restore mobility and communication capabilities, thus helping disabled people to increase their independence and facilitate their participation in society. But how should this technology be employed: just to manipulate the world or also to leverage self-knowledge? And what will the technology mean for the rest of the population? These are some of the questions that are addressed in this chapter.

Keywords: Report_VI, IM2.BMI
[240] E. Kron, M. Rayner, M. Santaholma, and P. Bouillon. A development environment for building grammar-based speech-enabled applications. In Proceedings of workshop on Grammar-based approaches to spoken language processing, pages 49–52. ACL 2007, 2007. [ bib ]
Keywords: Report_VI, IM2.HMI, ACL 2007, June 29
[241] A. Vinciarelli. Role recognition in broadcast news using social network analysis and duration distribution modeling. IEEE Transactions on Multimedia, 2007. IDIAP-RR 06-35. [ bib | .ps.gz | .pdf ]
This paper presents two approaches for speaker role recognition in multiparty audio recordings. The experiments are performed over a corpus of 96 radio bulletins corresponding to roughly 19 hours of material. Each recording involves, on average, eleven speakers playing one among six roles belonging to a predefined set. Both proposed approaches start by segmenting automatically the recordings into single speaker segments, but perform role recognition using different techniques. The first approach is based on Social Network Analysis, the second relies on the intervention duration distribution across different speakers. The two approaches are used separately and combined and the results show that around 85 percent of the recordings time can be labeled correctly in terms of role.

Keywords: Report_VI, IM2.AP.MCA, joint publucation
[242] L. Chen, D. Barber, and J. M. Odobez. Dynamical dirichlet mixture model. IDIAP-RR 02, IDIAP, 2007. [ bib | .ps.gz | .pdf ]
In this report, we propose a statistical model to deal with the discrete-distribution data varying over time. The proposed model – HMM DM – extends the Dirichlet mixture model to the dynamic case: Hidden Markov Model with Dirichlet mixture output. Both the inference and parameter estimation procedures are proposed. Experiments on the generated data verify the proposed algorithms. Finally, we discuss the potential applications of the current model.

Keywords: Report_VI, IM2.MPR
[243] J. Zheng, O. Cetin, M. Y. Hwang, X. Lei, A. Stolcke, and N. Morgan. Combining discriminative feature, transform, and model training for large vocabulary speech recognition. Proc. ICASSP, Honolulu., 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[244] M. Gerber, T. Kaufmann, and B. Pfister. Perceptron-based class verification. In Proceedings of NOLISP (ISCA Workshop on non linear speech processing), Paris, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[245] P. Motlicek, H. Hermansky, S. Ganapathy, and H. Garudadri. Non-uniform speech/audio coding exploiting predictability of temporal evolution of spectral envelopes. In Tenth International Conference on TEXT, SPEECH and DIALOGUE (TSD) [13], pages 350–357. IDIAP-RR 06-30. [ bib ]
Unlike classical state-of-the-art coders that are based on short-term spectra, our approach uses relatively long temporal segments of audio signal in critical-band-sized sub-bands. We apply auto-regressive model to approximate Hilbert envelopes in frequency sub-bands. Residual signals (Hilbert carriers) are demodulated and thresholding functions are applied in spectral domain. The Hilbert envelopes and carriers are quantized and transmitted to the decoder. Our experiments focused on designing speech/audio coder to provide broadcast radio-like quality audio around 15-25kbps. Obtained objective quality measures, carried out on standard speech recordings, were compared to the state-of-the-art 3GPP-AMR speech coding system.

Keywords: IM2.AP, Report_VII
[246] J. Philips, J. del R. Millán, G. Vanacker, E. Lew, F. Galán, P. W. Ferrez, H. van Brussel, and M. Nuttin. Adaptive shared control of a brain-actuated simulated wheelchair. In Proceedings of the 10th IEEE International Conference on Rehabilitation Robotics, pages 408–414, Noordwijk, The Netherlands, 2007. [ bib | DOI ]
The use of shared control techniques has a profound impact on the performance of a robotic assistant controlled by human brain signals. However, this shared control usually provides assistance to the user in a constant and identical manner each time. Creating an adaptive level of assistance, thereby complementing the user's capabilities at any moment, would be more appropriate. The better the user can do by himself, the less assistance he receives from the shared control system; and vice versa. In order to do this, we need to be able to detect when and in what way the user needs assistance. An appropriate assisting behaviour would then be activated for the time the user requires help, thereby adapting the level of assistance to the specific situation. This paper presents such a system, helping a brain-computer interface (BCI) subject perform goal-directed navigation of a simulated wheelchair in an adaptive manner. Whenever the subject has more difficulties in driving the wheelchair, more assistance will be given. Experimental results of two subjects show that this adaptive shared control increases the task performance. Also, it shows that a subject with a lower BCI performance has more need for extra assistance in difficult situations, such as manoeuvring in a narrow corridor.

Keywords: IM2.BCI, Report_VI
[247] G. Heusch and S. Marcel. Face authentication with salient local features and static bayesian network. In IEEE / IAPR Intl. Conf. On Biometrics (ICB), 2007. IDIAP-RR 07-04. [ bib | .ps.gz | .pdf ]
In this paper, the problem of face authentication using salient facial features together with statistical generative models is adressed. Actually, classical generative models, and Gaussian Mixture Models in particular make strong assumptions on the way observations derived from face images are generated. Indeed, systems proposed so far consider that local observations are independent, which is obviously not the case in a face. Hence, we propose a new generative model based on Bayesian Networks using only salient facial features. We compare it to Gaussian Mixture Models using the same set of observations. Conducted experiments on the BANCA database show that our model is suitable for the face authentication task, since it outperforms not only Gaussian Mixture Models, but also classical appearance-based methods, such as Eigenfaces and Fisherfaces.

Keywords: Report_VI, IM2.VP
[248] O. Cetin, A. Kantor, S. King, C. Bartels, M. Magimai-Doss, J. Frankel, and K. Livescu. An articulatory feature-based tandem approach and factored observation modeling. Proc. ICASSP, Honolulu, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[249] A. Behera, D. Lalanne, and R. Ingold. Docmir: an automatic document-based indexing system for meeting retrieval. Multimedia Tools and Applications, 37(2), 2007. [ bib ]
Keywords: Report_VI, IM2.DMA
[250] M. Gurban, A. Valles, and J. Ph. Thiran. Low-dimensional motion features for audio-visual speech recognition. In 15th European Signal Processing Conference (EUSIPCO), Poznan, Poland, Poznan, Poland, 2007. [ bib | http ]
Keywords: Report_VI, LTS5; IM2.MPR
[251] T. Weise, B. Leibe, and L. van Gool. Fast 3d scanning with automatic motion compensation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR'07), 2007. [ bib ]
Keywords: Report_VI, IM2.VP
[252] B. Leibe, N. Cornelis, K. Cornelis, and L. van Gool. Dynamic 3d scene analysis from a moving vehicle. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR'07), 2007. [ bib ]
Keywords: Report_VI, IM2.VP, major publication, Best Paper Awards
[253] F. Evéquoz and D. Lalanne. Personal information management through interactive visualizations. pages 158–160, 2007. [ bib ]
Keywords: Report_VII, IM2.HMI
[254] T. Kaufmann and B. Pfister. An hpsg parser supporting discontinuous licenser rules. In International Conference on HPSG, 2007. (to appear). [ bib ]
Keywords: Report_VI, IM2.AP
[255] A. Jaimes, D. Gatica-Perez, N. Sebe, and T. S. Huang. Guest editors' introduction: Human-centered computing-toward a human revolution. Computer, 40(5):30–34, 2007. [ bib ]
Keywords: Report_VI, IM2.HMI, hci
[256] M. Magimai-Doss, D. Hakkani-Tur, O. Cetin, E. Shriberg, J. Fung, and N. Mirghafori. Entropy based classifier combination for sentence segmentation,. Proc. ICASSP, Honolulu, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[257] J. Mariéthoz and S. Bengio. A kernel trick for sequences applied to text-independent speaker verification systems. Pattern Recognition, 40(8), 2007. IDIAP-RR 05-77. [ bib ]
This paper present a principled SVM based speaker verification system. We propose a new framework and a new sequence kernel that can make use of any Mercer kernel at the frame level. An extension of the sequence kernel based on the Max operator is also proposed. The new system is compared to state-of-the-art GMM and other SVM based systems found in the literature on the Banca and Polyvar databases. The new system outperforms, most of the time, the other systems, statistically significantly. Finally, the new proposed framework clarifies previous SVM based systems and suggests interesting future research directions.

Keywords: IM2.AP, Report_VI
[258] S. Ba. Joint head tracking and pose estimation for visual focus of attention recognition. PhD thesis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2007. Thèse sciences Ecole polytechnique fédérale de Lausanne EPFL, no 3764 (2007), Faculté des sciences et techniques de l'ingénieur STI, Section de génie électrique et électronique, Institut de génie électrique et électronique IEL (Laboratoire de l'IDIAP LIDIAP). Dir.: Hervé Bourlard, Jean-Marc Odobez. [ bib ]
During the last two decades, computer science what are the ability to give to provide to machines in order to give them the ability to understand human behavior. One of them which is an important key to understand human behaviors, is the visual focus of attention (VFOA) of a person, which can be inferred from the gaze direction. The VFOA of a person gives insight about information such as who or what is the interest of a person, who is the target of the person's speech, who the person is listening to. Our interest in this thesis is to study people's VFOA using computer vision techniques. To estimate the VFOA of attention from a computer vision point of view, it is required to track the person's gaze. Because, tracking the eye gaze is impossible on low or mid resolution images, head orientation can be used as a surrogate for the gaze direction. Thus, in this thesis, we investigate in a first step the tracking of people's head orientation, in a second step the recognition of their VFOA from their head orientation. For the head tracking, we consider probabilistic methods based on sequential Monte Carlo (SMC) techniques. The head pose space is discretized into a finite set of poses, and Multi-dimensional Gaussian appearance models are learned for each discrete pose. The discrete head models are embedded into a mixed state particle filter (MSPF) framework to jointly estimate the head location and pose. The evaluation shows that this approach works better than the traditional paradigm in which the head is first tracked then the head pose is estimated. An important contribution of this thesis is the head pose tracking evaluation. As people usually evaluate their head pose tracking methods either qualitatively or with private data, we built a head pose video database using a magnetic field 3D location and orientation tracker. The database was used to evaluate our tracking methods, and was made publicly available to allow other researchers to evaluate and compare their algorithms. Once the head pose is available, the recognition of the VFOA can be done. Two environments are considered to study the VFOA: a meeting room environment and an outdoor environment. In the meeting room environment, people are static. People's VFOAs were studied depending on their locations in the meeting room. The set of VFOAs for a person is discretized into a finite set of targets: the other people attending the meeting, the table, the slide screen, and another VFOA target called un-focused denoting that the person is focusing none of the previous defined VFOAs. The head poses are used as observations and potential VFOA targets as hidden states in a Gaussian mixture model (GMM) or a hidden Markov model (HMM) framework. The parameters of the emission probability distributions were learned by two ways. A first way using head pose training data, and a second way exploiting the geometry of the room and the head and eye-in-head rotations. Maximum a posteriori adaptation (MAP) of the VFOA models was to the input test data to take into account people personal ways of gazing at VFOA targets. In the outdoor environment, people are moving and there is a single VFOA target. The problem in this study is to track multiple people passing and estimate whether or not they were focusing the advertisement. The VFOA is modeled as a GMM having as observations people's head location and pose.

Keywords: IM2.VP, Report_VI
[259] A. Graves, M. Liwicki, and H. Bunke. Unconstrained on-line handwriting recognition with recurrent neural networks. In Advances in Neural Information Processing, volume 20 of NIPS, Vancouver, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[260] I. Laptev, B. Caputo, and T. Lindberg. Local velocity-adapted motion events for spatio-temporal recognition. Computer Vision and Image Undertanding, 108(3):207–229, 2007. [ bib ]
In this paper we address the problem in motion recognition using event-based local motion representations. We assume that similar patterns of motion contain similar events with consistent motion across image sequences. Using this assumption, we formulate the problem of motion recognition as a matching of corresponding events in image sequences. To enable the matching, we present and evaluate a set of motion descriptors exploiting the spatial and the temporal coherence of motion measurements between corresponding events in image sequences. As motion measurements may depend on the relative motion of the camera, we also present a mechanism for local velocity adaptation of events and evaluate its influence when recognizing image sequences subjected to different camera motions. When recognizing motion, we compare the performance of nearest neighbor (NN) classifier with the performance of support vector machine (SVM).We also compare event-based motion representations to motion representations by global histograms. An experimental evaluation on a large video database with human actions demonstrates the advantage of the proposed scheme for event-based motion representation in combination with SVM classification. The particular advantage of event-based representations and velocity adaptation is further emphasized when recognizing human actions in unconstrained scenes with complex and non-stationary backgrounds.

Keywords: IM2.VP, Report_VII
[261] G. Chanel, K. Ansari-Asl, and T. Pun. Valence-arousal evaluation using physiological signals in an emotion recall paradigm. In 2007 IEEE SMC, Int. Conf. on Systems, Man and Cybernetics, Smart cooperative systems and cybernetics: advancing knowledge and security for humanity, Montreal, Canada, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[262] C. Müller and F. Burkhardt. Combining short-term cepstral and long-term pitch features for automatic recognition of speaker age. to appear in Proceedings of Interspeech, Antwerp., 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[263] D. Lalanne, F. Evéquoz, M. Rigamonti, B. Dumas, and R. Ingold. An ego-centric and tangible approach to meeting indexing and browsing. In 4th Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI'07), page to appear, Brno (Czech Republic), 2007. [ bib ]
Keywords: Report_VI, IM2.HMI
[264] S. Marcel, Y. Rodriguez, and G. Heusch. On the recent use of local binary patterns for face authentication. International Journal on Image and Video Processing Special Issue on Facial Image Processing, 2007. IDIAP-RR 06-34. [ bib ]
This paper presents a survey on the recent use of Local Binary Patterns (LBPs) for face recognition. LBP is becoming a popular technique for face representation. It is a non-parametric kernel which summarizes the local spacial structure of an image and it is invariant to monotonic gray-scale transformations. This is a very interesting property in face recognition. This probably explains the recent success of Local Binary Patterns in face recognition. In this paper, we describe the LBP technique and different approaches proposed in the literature to represent and to recognize faces. The most representatives are considered for experimental comparison on a common face authentication task. For that purpose, the XM2VTS and BANCA databases are used according to their respective experimental protocols.

Keywords: IM2.VP, Report_VII
[265] D. Dessimoz, J. Richiardi, C. Champod, and A. Drygajlo. Multimodal biometrics for identity documents (mbioid). Forensic Science International, 167:154–159, 2007. [ bib | DOI ]
Keywords: Report_VI, IM2.MPR
[266] M. Knox and N. Mirghafori. Automatic laughter detection using neural networks. to appear in Proceedings of Interspeech, Antwerp., 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[267] S. Cuendet, D. Hakkani-Tur, and E. Shriberg. Automatic labeling inconsistencies detection and correction for sentence unit segmentation in conversational speech. to appear in Proceedings of MLMI, Brno, Czech Republic, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[268] A. Popescu-Belis. Evaluation of nlg: some analogies and differences with mt and reference resolution. In MT Summit XI Workshop on Using Corpora for NLG and MT (UCNLG MT), pages 66–68, Copenhagen, Denmark, 2007. [ bib ]
Keywords: Report_VII, IM2.DMA
[269] M. Neuhaus and H. Bunke. Bridging the gap between graph edit distance and kernel machines, volume 68 of Machine Perception and Artificial Intelligence. World Scientific, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[270] J. M. Odobez and S. Ba. A cognitive and unsupervised map adaptation approach to the recognition of the focus of attention from head pose. In International Conference on Multi-Media & Expo (ICME07), 2007. IDIAP-RR 07-20. [ bib | .ps.gz | .pdf ]
In this paper, the recognition of the visual focus of attention (VFOA) of meeting participants (as defined by their eye gaze direction) from their head pose is addressed. To this end, the head pose observations are modeled using an Hidden Markov Model (HMM) whose hidden states corresponds to the VFOA. The novelties are threefold. First, contrary to previous studies on the topic, in our set-up, the potential VFOA of a person is not restricted to other participants only, but includes environmental targets (a table and a projection screen), which increases the complexity of the task, with more VFOA targets spread in the pan and tilt (as well) gaze space. Second, the HMM parameters are set by exploiting results from the cognitive science on saccadic eye motion, which allows to predict what the head pose should be given an actual gaze target. Third, an unsupervised parameter adaptation step is proposed which accounts for the specific gazing behaviour of each participant. Using a publicly available corpus of 8 meetings featuring 4 persons, we analyze the above methods by evaluating, through objective performance measures, the recognition of the VFOA from head pose information obtained either using a magnetic sensor device or a vision based tracking system.

Keywords: Report_VI, IM2.VP
[271] M. Starlander. Using a wizard of oz as a baseline to determine which system architecture is the best for a spoken language translation system. In Proceedings of Nodalida 2007, 16th Nordic Conference of Computational Linguistics, pages 161–164, 2007. [ bib ]
Keywords: Report_VI, IM2.HMI, 24-26 may
[272] M. Y. Hwang, G. Peng, W. Wang, A. Faria, A. Heidel, and M. Ostendorf. Building a highly accurate mandarin speech recognizer. IEEE workshop on Automatic Speech Recognition and Understanding (ASRU 07), Kyoto, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[273] D. Grangier and S. Bengio. Learning the inter-frame distance for discriminative template-based keyword detection. In International Conference on Speech Communication and Technology (INTERSPEECH), 2007. [ bib | .ps.gz | .pdf ]
This paper proposes a discriminative approach to template-based keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.

Keywords: Report_VI, IM2.MPR
[274] D. Lalanne and E. van den Hoven. Supporting human memory with interactive systems. pages 215–216, 2007. [ bib ]
Keywords: Report_VII, IM2.HMI
[275] F. Cincotti, D. Mattia, F. Aloise, S. Bufalari, L. Astolfi, F. De Vico Fallani, A. Tocci, L. Bianchi, M. G. Marciani, S. Gao, J. del R. Millán, and F. Babiloni. High-resolution eeg techniques for brain-computer interface applications. Journal of Neuroscience Methods, 167:31–42, 2007. [ bib ]
High-resolution electroencephalographic (HREEG) techniques allow estimation of cortical activity based on non-invasive scalp potential measurements, using appropriate models of volume conduction and of neuroelectrical sources. In this study we propose an application of this body of technologies, originally developed to obtain functional images of the brainüiÂ?`Âfrac1/2s electrical activity, in the context of brainüiÂ?`Âfrac1/2computer interfaces (BCI). Our working hypothesis predicted that, since HREEG pre-processing removes spatial correlation introduced by current conduction in the head structures, by providing the BCI with waveforms that are mostly due to the unmixed activity of a small cortical region, a more reliable classification would be obtained, at least when the activity to detect has a limited generator, which is the case in motor related tasks. HREEG techniques employed in this study rely on (i) individual head models derived from anatomical magnetic resonance images, (ii) distributed source model, composed of a layer of current dipoles, geometrically constrained to the cortical mantle, (iii) depth-weighted minimum L2-norm constraint and Tikhonov regularization for linear inverse problem solution and (iv) estimation of electrical activity in cortical regions of interest corresponding to relevant Brodmann areas. Six subjects were trained to learn self modulation of sensorimotor EEG rhythms, related to the imagination of limb movements. Off-line EEG data was used to estimate waveforms of cortical activity (cortical current density, CCD) on selected regions of interest. CCD waveforms were fed into the BCI computational pipeline as an alternative to raw EEG signals; spectral features are evaluated through statistical tests (r2 analysis), to quantify their reliability for BCI control. These results are compared, within subjects, to analogous results obtained without HREEG techniques. The processing procedure was designed in such a way that computations could be split into a setup phase (which includes most of the computational burden) and the actual EEG processing phase, which was limited to a single matrix multiplication. This separation allowed to make the procedure suitable for on-line utilization, and a pilot experiment was performed. Results show that lateralization of electrical activity, which is expected to be contralateral to the imagined movement, is more evident on the estimated CCDs than in the scalp potentials. CCDs produce a pattern of relevant spectral features that is more spatially focused, and has a higher statistical significance (EEG: 0.20üiÂ?`Âfrac1/20.114 S.D.; CCD: 0.55üiÂ?`Âfrac1/20.16 S.D.; p=10-5). A pilot experiment showed that a trained subject could utilize voluntary modulation of estimated CCDs for accurate (eight targets) on-line control of a cursor. This study showed that it is practically feasible to utilize HREEG techniques for on-line operation of a BCI system; off-line analysis suggests that accuracy of BCI control is enhanced by the proposed method.

Keywords: IM2.BCI, Report_VII
[276] C. Wooters and M. Huijbregts. The icsi rt07s speaker diarization system. to appear in Lecture Notes in Computer Science, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[277] E. Bruno, J. Kludas, and S. Marchand-Maillet. Combining multimodal preferences for multimedia information retrieval. In ACM SIGMM - International Workshop on Multimedia Information Retrieval, Ausburg, DE, 2007. [ bib ]
Keywords: Report_VI, IM2.MCA
[278] R. Chavarriaga, P. W. Ferrez, and J. del R. Millán. To err is human: Learning from error potentials in brain-computer interfaces. In 1st International Conference on Cognitive Neurodynamics (ICCN 2007), Shanghai, China, 2007. [ bib ]
Keywords: IM2.BMI, Report_VIII
[279] L. Stoll, J. Frankel, and N. Mirghafori. Speaker recognition via nonlinear discriminant features. Proceedings of NOLISP, Paris, France,, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[280] J. del R. Millán, P. W. Ferrez, and A. Buttfield. The idiap brain-computer interface: an asynchronous multi-class approach. In G. Dornhege, J. del R. Millán, T. Hinterberger, D. McFarland, and K. R. Müller, editors, Towards Brain-Computer Interfacing. The MIT Press, 2007. [ bib ]
In this paper we give an overview of our work on a self-pace asynchronous BCI that responds every 0.5 seconds. A statistical Gaussian classifier tries to recognize three different mental tasks; it may also respond unknown for uncertain samples as the classifier has incorporated statistical rejection criteria. We report our experience with different subjects. We also describe three brain-actuated applications we have developed: a virtual keyboard, a brain game, and a mobile robot (emulating a motorized wheelchair). Finally, we discuss current research directions we are pursuing in order to improve the performance and robustness of our BCI system, especially for real-time control of brain-actuated robots.

Keywords: IM2.BCI, Report_VII
[281] X. Anguera, C. Wooters, and J. Hernando. Acoustic beamforming for speaker diarization of meetings. to appear in IEEE Transactions on Audio, Speech and Language Processing, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[282] F. Frapolli, B. Hirsbrunner, and D. Lalanne. Dynamic rules: towards interactive games intelligence. In Tangible Play: Research and Design for Tangible and Tabletop Games. Workshop at the 2007 Intelligent User Interfaces Conference (IUI'07), pages 29–32, Hawaii (USA), 2007. [ bib ]
Keywords: Report_VI, IM2.HMI
[283] G. Dornhege, J. del R. Millán, T. Hinterberger, D. McFarland, and K. R. Müller. Towards brain-computer interfacing. The MIT Press, 2007. [ bib ]
Keywords: Report_VI, IM2.BMI
[284] K. Riesen, M. Neuhaus, and H. Bunke. Bipartite graph matching for computing the edit distance of graphs. In F. Escolano and M. Vento, editors, Graph-Based Representations in Pattern Recognition, volume 4538 of Lecture Notes in Computer Science, pages 1–12. Springer, 2007. [ bib ]
Keywords: Report_VI, IM2.ACP
[285] T. Drugman, M. Gurban, and J. Ph. Thiran. Relevant feature selection for audio-visual speech recognition. In 9th International Workshop on Multimedia Signal Processing (MMSP), 2007. [ bib | http ]
We present a feature selection method based on information theoretic measures, targeted at multimodal signal processing, showing how we can quantitatively assess the relevance of features from different modalities. We are able to find the features with the highest amount of information relevant for the recognition task, and at the same having minimal redundancy. Our application is audio-visual speech recognition, and in particular selecting relevant visual features. Experimental results show that our method outperforms other feature selection algorithms from the literature by improving recognition accuracy even with a significantly reduced number of features.

Keywords: Report_VII, IM2.MPR, LTS5
[286] J. Keshet. Theoretical foundations for large-margin kernel-based continuous speech recognition. Idiap-RR Idiap-RR-44-2007, IDIAP, 2007. [ bib ]
Keywords: IM2.AP, Report_VII
[287] S. Cuendet, E. Shriberg, B. Favre, J. Fung, and D. Hakkani-Tur. An analysis of sentence segmentation features for broadcast news, broadcast conversations, and meetings. In SIGIR Workshop on Searching Conversational Spontaneous Speech, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[288] G. Vanacker, J. del R. Millán, E. Lew, P. W. Ferrez, F. Galán, J. Philips, H. van Brussel, and M. Nuttin. Context-based filtering for assisted brain-actuated wheelchair driving. Computational Intelligence and Neuroscience, 2007:3, 2007. [ bib ]
Controlling a robotic device by using human brain signals is an interesting and challenging task. The device may be complicated to control and the non-stationary nature of the brain signals provides for a rather unstable input. With the use of intelligent processing algorithms adapted to the task at hand however, the performance can be increased. This paper introduces a shared control system that helps the subject in driving an intelligent wheelchair with a non-invasive brain interface. The subject's steering intentions are estimated from electroencephalogram (EEG) signals and passed through to the shared control system before being sent to the wheelchair motors. Experimental results show a possibility for significant improvement in the overall driving performance when using the shared control system compared to driving without it. These results have been obtained with 2 healthy subjects during their first day of training with the brain-actuated wheelchair.

Keywords: IM2.BCI, Report_VI
[289] P. Bouillon, N. Chatzichrisafis, S. Halimi, B. A. Hockey, H. Isahara, K. Kanzaki, Y. Nakao, B. Novellas Vall, M. Rayner, M. Santaholma, and M. Starlander. Medslt: a multi-lingual grammar-based medical speech translator. In Proceedings of First International Workshop on Intercultural Collaboration. IWIC2007, 2007. [ bib ]
Keywords: Report_VI, IM2.HMI, January 25-26
[290] E. Kokiopoulou and P. Frossard. Dimensionality reduction with adaptive approximation. In IEEE Int. Conf. on Multimedia & Expo (ICME), 2007. [ bib | http ]
Keywords: Report_VI, IM2.VP
[291] A. Popescu-Belis. Le rôle des métriques d'évaluation dans le processus de recherche en tal. TAL (Traitement Automatique des Langues), 47(2), 2007. [ bib ]
Keywords: Report_VI, IM2.DMA
[292] J. Meynet and J. Ph. Thiran. Information theoretic combination of classifiers with application to adaboost. In 7th international Workshop on Multiple Classifier Systems (MCS), Prague, 2007. ITS. [ bib ]
Keywords: Report_VI, ITS, lts5, pattern recognition, classifier combination, information theory, adaboost, diversity; IM2.VP
[293] F. Valente, H. Bourlard, and V. Deepu. Agglomerative information bottleneck for speaker diarization of meetings data. IDIAP-RR 31, IDIAP, 2007. Submitted for publication. [ bib | .ps.gz | .pdf ]
In this paper, we investigate the use of agglomerative Information Bottleneck (aIB) clustering for the speaker diarization task of meetings data. In contrary to the state-of-the-art diarization systems that models individual speakers with Gaussian Mixture Models, the proposed algorithm is completely non parametric . Both clustering and model selection issues of non-parametric models are addressed in this work. The proposed algorithm is evaluated on meeting data on the RT06 evaluation data set. The system is able to achieve Diarization Error Rates comparable to state-of-the-art systems at a much lower computational complexity.

Keywords: Report_VI, IM2.AP
[294] U. Hoffmann, J. M. Vesin, and T. Ebrahimi. Recent advances in brain-computer interfaces. In IEEE International Workshop on Multimedia Signal Processing, 2007. Invited Paper. [ bib | http ]
A brain-computer interface (BCI) is a communication system that translates brain activity into commands for a computer or other devices. In other words, a BCI allows users to act on their environment by using only brain activity, without using peripheral nerves and muscles. The major goal of BCI research is to develop systems that allow disabled users to communicate with other persons, to control artificial limbs, or to control their environment. An alternative application area for brain-computer interfaces (BCIs) lies in the field of multimedia communication. To develop systems for usage in the field of assistive technology or multimedia communication, many aspects of BCI systems are currently being investigated. Research areas include evaluation of invasive and noninvasive technologies to measure brain activity, evaluation of control signals (i.e. patterns of brain activity that can be used for communication), development of algorithms for translation of brain signals into computer commands, and the development of new BCI applications. In this paper we give an introduction to some of the aspects of BCI research mentioned above, present a concrete example of a BCI system, and highlight recent developments and open problems.

Keywords: Report_VII, IM2.BMI
[295] J. Meynet, V. Popovici, and J. Ph. Thiran. Face detection with boosted gaussian features. Pattern Recognition, 40(8):2283–2291, 2007. [ bib | DOI ]
Detecting faces in images is a key step in numerous computer vision applications, such as face recognition or facial expression analysis. Automatic face detection is a difficult task because of the large face intra-class variability which is due to the important influence of the environmental conditions on the face appearance. We propose new features based on anisotropic Gaussian filters for detecting frontal faces in complex images. The performances of our face detector based on these new features have been evaluated on reference test sets, and clearly show improvements compared to the state-of-the-art.

Keywords: Report_VI, Adaboost, face detection, Gaussian features, lts5, IM2.VP, joint publication
[296] G. Bologna, B. Deville, T. Pun, and M. Vinckenbosch. Transforming 3d coloured pixels into musical instrument notes for vision substitution applications. Eurasip J. of Image and Video Processing, Special Issue: Image and Video Processing for Disability, accepted for publication, 2007. (to appear). [ bib ]
Keywords: Report_VI, IM2.MPR
[297] Y. Huang, G. Friedland, C. Müller, and N. Mirghafori. Speeding up speaker diarization by using prosodic features. Technical Report TR-07-004, International Computer Science Institute, Berkeley, California, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[298] Y. Huang, O. Vinyals, G. Friedland, C. Müller, N. Mirghafori, and C. Wooters. A fast-match approach for robust, faster than real-time speaker diarization. IEEE workshop on Automatic Speech Recognition and Understanding (ASRU 07), Kyoto, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[299] Y. Huang. Robust and rapid speaker diarization. Master Thesis, University of California, Berkeley, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[300] J. del R. Millán, A. Buttfield, C. Vidaurre, M. Krauledat, A. Schlögl, P. Shenoy, B. Blankertz, R. P. N. Rao, R. Cabeza, G. Pfurtscheller, and K. R. Müller. Adaptation in brain-computer interfaces. In G. Dornhege, J. del R. Millán, T. Hinterberger, D. McFarland, and K. R. Müller, editors, Towards Brain-Computer Interfacing. The MIT Press, 2007. [ bib ]
One major challenge in Brain-Computer Interface (BCI) research is to cope with the inherent nonstationarity of the recorded brain signals caused by changes in the subjects brain processes during an experiment. Online adaptation of the classifier embedded into the BCI is a possible way of tackling this issue. In this chapter we investigate the effect of adaptation on the performance of the classifier embedded in three different BCI systems, all of them based on non-invasive electroencephalogram (EEG) signals. Through this adaptation we aim to keep the classifier constantly tuned to the EEG signals it is receiving in the current session. Although the experimental results reported here show the benefits of online adaptation, some questions need still to be addressed. The chapter ends discussing some of these open issues.

Keywords: IM2.BCI, Report_VII
[301] I. McCowan, H. K. Maganti, and D. Gatica-Perez. Speech enhancement and recognition in meetings with an audio-visual sensor array. IEEE Trans. on Audio, Speech, and Language Processing, 15(8):2257–2269, 2007. [ bib ]
Keywords: Report_VII, IM2.MPR
[302] G. Aradilla, J. Vepa, and H. Bourlard. An acoustic model based on kullback-leibler divergence for posterior features. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2007. IDIAP-RR 06-60. [ bib | .ps.gz | .pdf ]
This paper investigates the use of features based on posterior probabilities of subword units such as phonemes. These features are typically transformed when used as inputs for a hidden Markov model with mixture of Gaussians as emission distribution (HMM/GMM). In this work, we introduce a novel acoustic model that avoids the Gaussian assumption and directly uses posterior features without any transformation. This model is described by a finite state machine where each state is characterized by a target distribution and the cost function associated to each state is given by the Kullback-Leibler (KL) divergence between its target distribution and the posterior features. Furthermore, hybrid HMM/ANN system can be seen as a particular case of this KL-based model where state target distributions are predefined. A training method is also presented that minimizes the KL-divergence between the state target distributions and the posteriors features.

Keywords: Report_VI, IM2.AP
[303] A. Pronobis and B. Caputo. Confidence-based cue integration for visual place recognition. IDIAP-RR 17, IDIAP, 2007. [ bib | .ps.gz | .pdf ]
A distinctive feature of intelligent systems is their capability to analyze their level of expertise for a give task; in other words, they know what they know. As a way towards this ambitious goal, this paper presents an algorithm for recognition able to measure its own level of confidence and, in case of uncertainty, to seek for extra information so to increase its own knowledge and ultimately achieve better performance. We focus on the visual place recognition problem for topological localization, and we take an SVM approach. We propose a new method for measuring the confidence level of the classification output, based on the distance of a test image and the average distance of training vectors. This method is combined with a discriminative accumulation scheme for cue integration. We show with extensive experiments that the resulting algorithm achieves better performances for two visual cues than the classic single cue SVM on the same task, while minimising the computational load. More important, our method provides a reliable measure of the level of confidence of the decision.

Keywords: Report_VI, IM2.VP
[304] B. Mesot and D. Barber. A gaussian sum smoother for inference in switching linear dynamical systems. Technical report, Idiap Research Institute, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[305] J. Yao and J. M. Odobez. Multi-layer background subtraction based on color and texture. In CVPR 2007 Workshop on Visual Surveillance (VS2007), volume 17-22, pages 1–8, 2007. [ bib | DOI ]
In this paper, we propose a robust multi-layer background subtraction technique which takes advantages of local texture features represented by local binary patterns (LBP) and photometric invariant color measurements in RGB color space. LBP can work robustly with respective to light variation on rich texture regions but not so efficiently on uniform regions. In the latter case, color information should overcome LBPâs limitation. Due to the illumination invariance of both the LBP feature and the selected color feature, the method is able to handle local illumination changes such as cast shadows from moving objects. Due to the use of a simple layer-based strategy, the approach can model moving background pixels with quasiperiodic flickering as well as background scenes which may vary over time due to the addition and removal of long-time stationary objects. Finally, the use of a cross-bilateral filter allows to implicitly smooth detection results over regions of similar intensity and preserve object boundaries. Numerical and qualitative experimental results on both simulated and real data demonstrate the robustness of the proposed method.

Keywords: IM2.VP, Report_VI
[306] P. Motlicek, H. Hermansky, S. Ganapathy, and H. Garudadri. Frequency domain linear prediction for qmf sub-bands and applications to audio coding. In 4th Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI) [95], pages 248–258. IDIAP-RR 07-16. [ bib ]
This paper proposes an analysis technique for wide-band audio applications based on the predictability of the temporal evolution of Quadrature Mirror Filter (QMF) sub-band signals. The input audio signal is first decomposed into 64 sub-band signals using QMF decomposition. The temporal envelopes in critically sampled QMF sub-bands are approximated using frequency domain linear prediction applied over relatively long time segments (e.g. 1000 ms). Line Spectral Frequency parameters related to autoregressive models are computed and quantized in each frequency sub-band. The sub-band residuals are quantized in the frequency domain using a combination of split Vector Quantization (VQ) (for magnitudes) and uniform scalar quantization (for phases). In the decoder, the sub-band signal is reconstructed using the quantized residual and the corresponding quantized envelope. Finally, application of inverse QMF reconstructs the audio signal. Even with simple quantization techniques and without any sophisticated modules, the proposed audio coder provides encouraging results in objective quality tests. Also, the proposed coder is easily scalable across a wide range of bit-rates.

Keywords: IM2.AP, Report_VI
[307] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. More efficiency in multiple kernel learning. In International Conference on Machine Learning (ICML), 2007. IDIAP-RR 07-18. [ bib | .ps.gz | .pdf ]
An efficient and general multiple kernel learning (MKL) algorithm has been recently proposed by singleemcitesonnenburg_mkljmlr. This approach has opened new perspectives since it makes the MKL approach tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs several iterations before converging towards a reasonable solution. In this paper, we address the MKL problem through an adaptive 2-norm regularization formulation. Weights on each kernel matrix are included in the standard SVM empirical risk minimization problem with a ell1 constraint to encourage sparsity. We propose an algorithm for solving this problem and provide an new insight on MKL algorithms based on block 1-norm regularization by showing that the two approaches are equivalent. Experimental results show that the resulting algorithm converges rapidly and its efficiency compares favorably to other MKL algorithms.

Keywords: Report_VI, IM2.MPR
[308] A. Lovitt. Truncation confusion patterns in onset consonants. In Interspeech 2007, 2007. IDIAP-RR 07-05. [ bib | .ps.gz | .pdf ]
Confusion matrices and truncation experiments have long been a part of psychoacoustic experimentation. However confusion matrices are seldom used to analyze truncation experiments. A truncation experiment was conducted and the confusion patterns were analyzed for 6 consonant-vowels (CVs). The confusion patterns show significant structure as the CV is truncated from the onset of the consonant. These confusions show correlations with both articulatory, acoustic features, and other related CVs. These confusions patterns are shown and explored as they relate to human speech recognition.

Keywords: Report_VI, IM2.AP
[309] H. Lei and N. Mirghafori. Word-conditioned hmm supervectors for speaker recognition. to appear in Proceedings of Interspeech, Antwerp., 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[310] H. Lei and N. Mirghafori. Word-conditioned phone n-grams for speaker recognition. Proc. ICASSP, Honolulu, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[311] M. Gerber, R. Beutler, and B. Pfister. Quasi text-independent speaker verification based on pattern matching. In Proceedings of Interspeech. ISCA, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[312] R. Bertolami, S. Uchida, M. Zimmermann, and H. Bunke. Non-uniform slant correction for handwritten text line recognition. In Proc. 9th Int. Conf. on Document Analysis and Recognition, pages 18–22, 2007. [ bib ]
Keywords: Report_VII, IM2.VP
[313] D. Hakkani-Tur and G. Tur. Statistical sentence extraction for information distillation. Proc. ICASSP, Honolulu, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[314] D. Lalanne, E. Bertini, P. Hertzog, and P. Bados. Visual analysis of corporate network intelligence: abstracting and reasoning on yesterdays for acting today. 2007. [ bib ]
Keywords: Report_VII, IM2.HMI
[315] K. Kryszczuk and A. Drygajlo. Improving classification with class-independent quality measures: q-stack in face verification. In Proc. 2nd Int. Conference in Biometrics (ICB 2007), Seoul, South Korea, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[316] J. Hennebert. Please repeat: my voice is my password. from the basics to real-life implementations of speaker verification technologies. In Invited lecture at the Information Security Summit (IS2 2007), Prague, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[317] S. Marchand-Maillet, E. Bruno, A. Nürnberger, and M. Detyniecki. Adaptive multimedia retrieval: user, context and feedback. Springer, 2007. [ bib ]
Keywords: Report_VI, IM2.MCA
[318] F. Galán, M. Nuttin, E. Lew, P. W. Ferrez, G. Vanacker, J. Philips, H. van Brussel, and J. del R. Millán. An asynchronous and non-invasive brain-actuated wheelchair. In Proceedings of the 13th International Symposium on Robotics Research, volume 128, Hiroshima, Japan, 2007. [ bib ]
Objectives: To develop a robust asynchronous and non-invasive brain-computer interface (BCI) for brain-actuated wheelchair driving, and to assess the system robustness over time and context. Methods: Two subjects were asked to mentally drive a simulated wheelchair from a starting point to a goal following a pre-specified path in a simulated environment. Each subject participated in 5 experimental sessions integrated by 10 trials each. The experimental sessions were carried on with different elapsed times between them (since one hour to two months) to assess the system robustness over time.The path was divided in seven stretches to assess the robustness over context. Results: The two subjects were able to reach 90% (subject 1) and 80% (subject 2) of the final goals one day after the calibration of the BCI system, and 100% (subject 1) and 70% (subject 2) two months later. Different performances were obtained over the different path stretches.

Keywords: IM2.BCI, Report_VII
[319] A. Lovitt, J. P. Pinto, and H. Hermansky. On confusions in a phoneme recognizer. 2007. IDIAP-RR 07-10. [ bib | .ps.gz | .pdf ]
In this paper, we analyze the confusions patterns at three places in the hybrid phoneme recognition system. The confusions are analyzed at the pronunciation, the posterior probability, and the phoneme recognizer levels. The confusions show significant structure that is similar at all levels. Some confusions also correlate with human psychoacoustic experiments in white masking noise. These structures imply that not all errors should be counted equally and that some phoneme distinctions are arbitrary. Understanding these confusion patterns can improve the performance of a recognizer by eliminating problematic phoneme distinctions. These principles are applied to a phoneme recognition system and the results show a marked improvement in the phone error rate. Confusion pattern analysis leads to a better way of choosing phoneme sets for recognition.

Keywords: Report_VI, IM2.AP
[320] H. Bay, A. Ess, T. Tuytelaars, and L. van Gool. Speeded-up robust features (surf). Computer Vision and Image Understanding (CVIU), 2007. [ bib ]
Keywords: Report_VII, IM2.VP.MCA, joint
[321] O. Koval, S. Voloshynovskiy, and T. Pun. Analysis of multimodal binary detection systems based on dependent/independent modalities. In Proceedings of the IEEE 2007 International Workshop on Multimedia Signal Processing, Chania, Crete, Greece, 2007. [ bib ]
Keywords: Report_VII, IM2.MPR
[322] R. Rytsar and T. Pun. Computational aspects of the eeg forward problem solution for real head model using finite element. In 29th Annual Int. Conf. IEEE Engineering in Medicine and Biology Society, Lyon, France, 2007. [ bib ]
Keywords: Report_VI, IM2.MPR
[323] E. Bertini, P. Hertzog, and D. Lalanne. Spiralview: a visual tool to improve monitoring and understanding of security data in corporate. In IEEE Symposium on Visual Analytics Science and Technology 2007 (VAST'07), page to appear, Sacramento, CA (USA), 2007. [ bib ]
Keywords: Report_VI, IM2.HMI
[324] H. Romsdorfer and B. Pfister. Text analysis and language identification for polyglot text-to-speech synthesis. Speech Communication (Elsevier), 2007. (to appear). [ bib ]
Keywords: Report_VI, IM2.AP
[325] X. Anguera, C. Wooters, J. M. Pardo, and J. Hernando. Automatic weighting for the combination of tdoa and acoustic features in speaker diarization for meetings. Proc. ICASSP, Honolulu, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[326] X. Anguera, T. Shinozaki, C. Wooters, and J. Hernando. Model complexity selection and cross-validation em training for robust speaker diarization. Proc. ICASSP, Honolulu, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[327] Y. Liu and E. Shriberg. Comparing evaluation metrics for sentence boundary detection. Proc. ICASSP, Honolulu, 2007. [ bib ]
Keywords: Report_VI, IM2.AP
[328] M. Liwicki and H. Bunke. Handwriting recognition of whiteboard notes – studying the influence of training set size and type. Int. Journal of Pattern Recognition and Art. Intelligence, 21(1):83–98, 2007. [ bib ]
Keywords: Report_VI, IM2.VP
[329] S. Ba and J. M. Odobez. Probabilistic head pose tracking evaluation in single and multiple camera setups. In Classification of Events, Activities and Relationship Evaluation and Workshop, 2007. IDIAP-RR 07-21. [ bib | .ps.gz | .pdf ]
This paper presents our participation in the CLEAR 07 evaluation workshop head pose estimation tasks where two head pose estimation tasks were to be addressed. The first task estimates head poses with respect to (w.r.t.) a single camera capturing people seated in a meeting room scenario. The second task consisted of estimating the head pose of people moving in a room from four cameras w.r.t. a global room coordinate. To solve the first task, we used a probabilistic exemplar-based head pose tracking method using a mixed state particle filter based on a represention in a joint state space of head localization and pose variable. This state space representation allows the combined search for both the optimal head location and pose. To solve the second task, we first applied the same head tracking framework to estimate the head pose w.r.t each of the four camera. Then, using the camera calibration parameters, the head poses w.r.t. individual cameras were transformed into head poses w.r.t to the global room coordinates, and the measures obtained from the four cameras were fused using reliability measures based on skin detection. Good head pose tracking performances were obtained for both tasks.

Keywords: Report_VI, IM2.VP
[330] A. Stolcke, S. Kajarekar, L. Ferrer, and E. Shriberg. Speaker recognition with session variability normalization based on mllr adaptation transforms. IEEE Transactions on Audio, Speech, and Language Processing, 15:1987–1998, 2007. [ bib ]
Keywords: Report_VII, IM2.AP
[331] F. Galán, P. W. Ferrez, F. Oliva, J. Guàrdia, and J. del R. Millán. Feature extraction for multi-class bci using canonical variates analysis. IDIAP-RR 23, IDIAP, 2007. Submitted for publication. [ bib | .ps.gz | .pdf ]
emphObjective: To propose a new feature extraction method with canonical solution for multi-class Brain-Computer Interfaces (BCI). The proposed method should provide a reduced number of canonical discriminant spatial patterns (CDSP) and rank the channels sorted by power discriminability (DP) between classes. emphMethods: The feature extractor relays in Canonical Variates Analysis (CVA) which provides the CDSP between the classes. The number of CDSP is equal to the number of classes minus one. We analyze EEG data recorded with 64 electrodes from 4 subjects recorded in 20 sessions. They were asked to execute twice in each session three different mental tasks (left hand imagination movement, rest, and words association) during 7 seconds. A ranking of electrodes sorted by power discriminability between classes and the CDSP were computed. After splitting data in training and test sets, we compared the classification accuracy achieved by Linear Discriminant Analysis (LDA) in frequency and temporal domains. emphResults: The average LDA classification accuracies over the four subjects using CVA on both domains are equivalent (57.89% in frequency domain and 59.43% in temporal domain). These results, in terms of classification accuracies, are also reflected in the similarity between the ranking of relevant channels in both domains. emphConclusions: CVA is a simple feature extractor with canonical solution useful for multi-class BCI applications that can work on temporal or frequency domain.

Keywords: Report_VI, IM2.BMI
[332] G. Lathoud and J. M. Odobez. Short-term spatio-temporal clustering applied to multiple moving speakers. IEEE Transactions on Audio, Speech and Language Processing, 2007. [ bib ]
Distant microphones permit to process spontaneous multi-party speech with very little constraints on speakers, as opposed to close-talking microphones. Minimizing the constraints on speakers permits a large diversity of applications, including meeting summarization and browsing, surveillance, hearing aids, and more natural human-machine interaction. Such applications of distant microphones require to determine where and when the speakers are talking. This is inherently a multisource problem, because of background noise sources, as well as the natural tendency of multiple speakers to talk over each other. Moreover, spontaneous speech utterances are highly discontinuous, which makes difficult to track the multiple speakers with classical filtering approaches, such as Kalman Filtering of Particle Filters. As an alternative, this paper proposes a probabilistic framework to determine the trajectories of multiple moving speakers in the short-term only – i.e. only while they speak. Instantaneous location estimates that are close in space and time are grouped into “short-term clusters” in a principled manner. Each short-term cluster determines the precise start and end times of an utterance, and a short-term spatial trajectory. Contrastive experiments clearly show the benefit of using short-term clustering, on real indoor recordings with seated speakers in meetings, as well as multiple moving speakers.

Keywords: Report_VI, IM2.AP.MPR, joint publication
[333] A. Humm, J. Hennebert, and R. Ingold. Database and evaluation protocols for user authentication using combined handwriting and speech modalities. Technical report, Department of Informatics, University of Fribourg, Switzerland, 2007. [ bib ]
Keywords: Report_VII, IM2.HMI
[334] S. Ba and J. M. Odobez. Multi-person visual focus of attention from head pose and meeting contextual cues. Idiap-RR Idiap-RR-47-2008, Idiap, 2008. IDIAP-RR 08-47. [ bib | .pdf ]
This paper presents a model for visual focus of attention (VFOA ) and conversational estimation in meetings from audio-visual perceptual cues. Rather than independently recognizing the VFOA of each participant from his own head pose, we propose to recognize participants' VFOA jointly in order to introduce context dependent interaction models that relates to group activity and the social dynamics of communication. To this end, we designed a dynamic Bayesian network (DBN) , whose hidden states are the joint VFOA of all participants, and the meeting conversational events. The observation used to infer the hidden states are the people's head poses and speaking status. Interaction models are introduced in the DBN by the use of the conversational events, and the projection screen activity which are contextual cues that affect the temporal evolution of the joint VFOA sequence, allowing us to model group dynamics that accounts for people's tendency to share the same focus, or to have their VFOA driven by contextual cues such as projection screen activity or conversational events. The model is rigorously evaluated on a publicly available dataset of 4 real meetings of a total duration of 1h30 minutes.

Keywords: IM2.VP, Report_VIII
[335] B. Favre, R. Grishman, D. Hillard, H. Ji, D. Hakkani-Tur, and M. Ostendorf. Punctuating speech for information extraction. IEEE ICASSP, Las Vegas, NV, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[336] H. Hung, Y. Huang, G. Friedland, and D. Gatica-Perez. Estimating the dominant person in multi-party conversations using speaker diarization strategies. IEEE ICASSP, Las Vegas, NV, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[337] M. Liwicki and H. Bunke. Recognition of whiteboard notes – online, offline and combination. World Scientific, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[338] P. W. Ferrez and J. del R. Millán. Eeg-based brain-computer interaction: improved accuracy by automatic single-trial error detection. In Advances in Neural Information Processing Systems 20, pages 441–448, 2008. [ bib ]
Brain-computer interfaces (BCIs), as any other interaction modality based on physiological signals and body channels (e.g., muscular activity, speech and gestures), are prone to errors in the recognition of subject's intent. An elegant approach to improve the accuracy of BCIs consists in a verification procedure directly based on the presence of error-related potentials (ErrP) in the EEG recorded right after the occurrence of an error. Six healthy volunteer subjects with no prior BCI experience participated in a new human-robot interaction experiment where they were asked to mentally move a cursor towards a target that can be reached within a few steps using motor imagination. This experiment confirms the previously reported presence of a new kind of ErrP. These Interaction ErrP exhibit a first sharp negative peak followed by a positive peak and a second broader negative peak ( 290, 350 and 470 ms after the feedback, respectively). But in order to exploit these ErrP we need to detect them in each single trial using a short window following the feedback associated to the response of the classifier embedded in the BCI. We have achieved an average recognition rate of correct and erroneous single trials of 81.8% and 76.2%, respectively. Furthermore, we have achieved an average recognition rate of the subject's intent while trying to mentally drive the cursor of 73.1%. These results show that it's possible to simultaneously extract useful information for mental control to operate a brain-actuated device as well as cognitive states such as error potentials to improve the quality of the brain-computer interaction. Finally, using a well-known inverse model (sLORETA), we show that the main focus of activity at the occurrence of the ErrP are, as expected, in the pre-supplementary motor area and in the anterior cingulate cortex.

Keywords: IM2.BCI, Report_VII
[339] S. H. K. Parthasarathi and H. Hermansky. A data-driven approach to speech/non-speech detection. Idiap-RR Idiap-RR-23-2008, IDIAP, 2008. [ bib ]
We present a data-driven approach to weighting the temporal context of signal energy to be used in a simple speech/non-speech detector (SND). The optimal weights are obtained using linear discriminant analysis (LDA). Regularization is performed to handle numerical issues inherent to the usage of correlated features. The discriminant so obtained is interpreted as a filter in the modulation spectral domain. Experimental evaluations on the test data set, in terms of average frame-level error rate over different SNR levels, show that the proposed method yields an absolute performance gain of 10.9%, 17.5%, 7.9% and 8.3% over ITU's G.729B, ETSI's AMR1, AMR2 and a state-of-the-art multi-layer perceptron based system, respectively. This shows that even a simple feature such as full-band energy, when employed with a large-enough context, shows promise for applications.

Keywords: IM2.BMI, Report_VII
[340] S. Ba and J. M. Odobez. Recognizing visual focus of attention from head pose in natural meetings. accepted for publication in IEEE Trans. on System, Man and Cybernetics: Part B, Man,, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[341] E. Kokiopoulou and P. Frossard. Minimum distance between pattern transformation manifolds: algorithm and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008. [ bib ]
Keywords: Report_VII, IM2.DMA.VP, joint
[342] H. Bunke, P. Dickinson, M. Neuhaus, and M. Stettler. Matching of hypergraphs – algorithms, applications, and experiments. In H. Bunke, A. Kandel, and M. Last, editors, Applied Pattern Recognition, pages 131–154. Springer, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[343] A. Singla and D. Hakkani-Tur. Cross-lingual sentence extraction for information distillation. to appear in Proceedings of Interspeech 2008, Brisbane, Australia, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[344] S. H. K. Parthasarathi, P. Motlicek, and H. Hermansky. Exploiting contextual information for speech/non-speech detection. In Text, Speech and Dialogue, volume 5246 of Series of Lecture Notes In Artificial Intelligence (LNAI), pages 451–459. Springer-Verlag Berlin, Heidelberg, 2008. [ bib | .pdf ]
In this paper, we investigate the effect of temporal context for speech/non-speech detection (SND). It is shown that even a simple feature such as full-band energy, when employed with a large-enough context, shows promise for further investigation. Experimental evaluations on the test data set, with a state-of-the-art multi-layer perceptron based SND system and a simple energy threshold based SND method, using the F-measure, show an absolute performance gain of 4.4% and 5.4% respectively. The optimal contextual length was found to be 1000 ms. Further numerical optimizations yield an improvement (3.37% absolute), resulting in an absolute gain of 7.77% and 8.77% over the MLP based and energy based methods respectively. ROC based performance evaluation also reveals promising performance for the proposed method, particularly in low SNR conditions.

Keywords: IM2.AP, Report_VIII
[345] E. Kokiopoulou, S. Pirillos, and P. Frossard. Graph-based classification for multiple observations of transformed patterns. IEEE Int. Conf. Pattern Recognition (ICPR), 2008. [ bib ]
Keywords: Report_VII, IM2.DMA.VP, joint publication
[346] H. Hung and D. Gatica-Perez. Identifying dominant people in meetings from audio-visual sensors. In Proc. IEEE Int. Conf. on Automatic Face and Gesture Recognition (FG), Special Session on Multi-Sensor HCI for Smart Environments, Amsterdam, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[347] A. Schlapbach, H. Bunke, and F. Wettstein. Estimating the readability of handwritten text – a support vector regression based approach. In Proc. 19th Int. Conf. on Pattern Recognition. IEEE, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[348] A. Humm, J. Hennebert, and R. Ingold. Spoken signature for user authentication. SPIE Journal of Electronic Imaging, 17, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[349] K. Kryszczuk and A. Drygajlo. On quality of quality measures for classification. pages 19–28, Heidelberg, 2008. Springer. [ bib ]
Keywords: IM2.MPR, Report_VIII
[350] W. Li, K. Kumatani, J. Dines, M. Magimai-Doss, and H. Bourlard. A neural network based regression approach for recogninizing simultaneous speech. In Joint Workshop on Machine Learning and Multimodal Interaction, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[351] R. Chavarriaga, F. Galán, and J. del R. Millán. Asynchronous detection and classification of oscillatory brain activity. In 16 European Signal Processing Conference (EUSIPCO 2008), 2008. IDIAP-RR 08-36. [ bib ]
The characterization and recognition of electrical signatures of brain activity constitutes a real challenge. Applications such as Brain-Computer Interfaces (BCI) are based on the accurate identification of mental processes in order to control external devices. Traditionally, classification of brain activity patterns relies on the assumption that the neurological phenomena that characterize mental states is continuously present in the signal. However, recent evidence shows that some mental processes are better characterized by episodic activity that is not necessarily synchronized with external stimuli. In this paper, we present a method for classification of mental states based on the detection of this episodic activity. Instead of performing classification on all available data, the proposed method identifies informative samples based on the class sample distribution in a projected canonical feature space. Classification results are compared to traditional methods using both artificial data and real EEG recordings.

Keywords: IM2.BMI, Report_VII
[352] N. Scaringella. Timbre and rhythmic trap-tandem features for music information retrieval. In "Int. Conf. on Music Information Retrieval (ISMIR)", 2008. [ bib | .pdf ]
The enormous growth of digital music databases has led to a comparable growth in the need for methods that help users organize and access such information. One area in particular that has seen much recent research activity is the use of automated techniques to describe audio content and to allow for its identification, browsing and retrieval. Conventional approaches to music content description rely on features characterizing the shape of the signal spectrum in relatively short-term frames. In the context of Automatic Speech Recognition (ASR), Hermansky citeHermansky_1 described an interesting alternative to short-term spectrum features, the TRAP-TANDEM approach which uses long-term band-limited features trained in a supervised fashion. We adapt this idea to the specific case of music signals and propose a generic system for the description of temporal patterns. The same system with different settings is able to extract features describing either timbre or rhythmic content. The quality of the generated features is demonstrated in a set of music retrieval experiments and compared to other state-of-the-art models.

Keywords: IM2.AP, Report_VIII
[353] G. S. V. S. Sivaram and H. Hermansky. Introducing temporal asymmetries in feature extraction for automatic speech recognition. In Interspeech 2008, 2008. IDIAP-RR 08-25. [ bib ]
We propose a new auditory inspired feature extraction technique for automatic speech recognition (ASR). Features are extracted by filtering the temporal trajectory of spectral energies in each critical band of speech by a bank of finite impulse response (FIR) filters. Impulse responses of these filters are derived from a modified Gabor envelope in order to emulate asymmetries of the temporal receptive field (TRF) profiles observed in higher level auditory neurons. We obtain 11.4% relative improvement in word error rate on OGI-Digits database and, 3.2% relative improvement in phoneme error rate on TIMIT database over the MRASTA technique.

Keywords: IM2.AP, Report_VII
[354] G. Aradilla, H. Bourlard, and M. Magimai-Doss. Posterior features applied to speech recognition tasks with limited training data. Idiap-RR Idiap-RR-15-2008, IDIAP, 2008. [ bib ]
This paper describes an approach where posterior-based features are applied in those recognition tasks where the amount of training data is insufficient to obtain a reliable estimate of the speech variability. A template matching approach is considered in this paper where posterior features are obtained from a MLP trained on an auxiliary database. Thus, the speech variability present in the features is reduced by applying the speech knowledge captured on the auxiliary database. When compared to state-of-the-art systems, this approach outperforms acoustic-based techniques and obtains comparable results to grapheme-based approaches. Moreover, the proposed method can be directly combined with other posterior-based HMM systems. This combination successfully exploits the complementarity between templates and parametric models.

Keywords: IM2.AP, Report_VII
[355] F. Galán, M. Nuttin, D. Vanhooydonck, E. Lew, P. W. Ferrez, J. Philips, and J. del R. Millán. Continuous brain-actuated control of an intelligent wheelchair by human eeg. In 4th International Brain-Computer Interface Workshop & Training Course, 2008. IDIAP-RR 08-53. [ bib ]
The objective of this study is to assess the feasibility of controlling an asynchronous and non-invasive brain-actuated wheelchair by human EEG. Three subjects were asked to mentally drive the wheelchair to 3 target locations using 3 mental commands. These mental commands were respectively associated with the three wheelchair steering behaviors: emphturn left, emphturn right, and emphmove forward. The subjects participated in 30 randomized trials (10 trials per target). The performance was assessed in terms of percentage of reached targets calculated in function of the distance between the final wheelchair position and the target at each trial. To assess the brain-actuated control achieved by the subjects, their performances were compared with the performance achieved by a random BCI. The subjects drove the wheelchair closer than 1 meter from the target in 20%, 37%, and 7% of the trials, and closer than 2 meters in 37%, 53%, and 27% of the trials, respectively. The random BCI drove it closer than 1 and 2 meters in 0% and 13% of the trials, respectively. The results show that the subjects could achieve a significant level of mental control, even if far from optimal, to drive an intelligent wheelchair, thus demonstrating the feasibility of continuously controlling complex robotics devices using an asynchronous and non-invasive BCI.

Keywords: IM2.BMI, Report_VII
[356] S. Y. Zhao and N. Morgan. Multi-stream spectro-temporal features for robust speech recognition. In 9th International Conference of the ISCA (Interspeech 2008), Brisbane, Australia, pages 898–901, 2008. [ bib ]
Keywords: IM2.AP, Report_VIII
[357] S. Ganapathy, P. Motlicek, H. Hermansky, and H. Garudadri. Temporal masking for bit-rate reduction in audio codec based on frequency domain linear prediction. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 4781–4784, 2008. IDIAP-RR 07-48. [ bib | DOI ]
Audio coding based on Frequency Domain Linear Prediction (FDLP) uses auto-regressive model to approximate Hilbert envelopes in frequency sub-bands for relatively long temporal segments. Although the basic technique achieves good quality of the reconstructed signal, there is a need for improving the coding efficiency. In this paper, we present a novel method for the application of temporal masking to reduce the bit-rate in a FDLP based codec. Temporal masking refers to the hearing phenomenon, where the exposure to a sound reduces response to following sounds for a certain period of time (up to 200 ms). In the proposed version of the codec, a first order forward masking model of the human ear is implemented and informal listening experiments using additive white noise are performed to obtain the exact noise masking thresholds. Subsequently, this masking model is employed in encoding the sub-band FDLP carrier signal. Application of the temporal masking in the FDLP codec results in a bit-rate reduction of about 10% without degrading the quality. Performance evaluation is done with Perceptual Evaluation of Audio Quality (PEAQ) scores and with subjective listening tests.

Keywords: IM2.AP, Report_VII
[358] A. Faria and N. Morgan. When a mismatch can be good: large vocabulary speech recognition trained with idealized tandem features. Proceedings of the ACM Symposium on Applied Computing, Fortaleza, Brazil, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[359] A. Faria and N. Morgan. Corrected tandem features for acoustic model training. accepted for IEEE ICASSP, Las Vegas, NV, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[360] A. Vinciarelli, M. Pantic, H. Bourlard, and A. Pentland. Social signal processing: state-of-the-art and future perspectives of an emerging domain. In Proceedings of the ACM International Conference on Multimedia, 2008. [ bib ]
The ability to understand and manage social signals of a person we are communicating with is the core of social intelligence. Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life. This paper argues that next-generation computing needs to include the essence of social intelligence â the ability to ecognize human social signals and social behaviours like politeness, and disagreement â in order to become more effective and more efficient. Although each one of us understands the importance of social signals in everyday life situations, and in spite of recent advances in machine analysis of relevant behavioural cues like blinks, smiles, crossed arms, laughter, and similar, design and development of automated systems for Social Signal Processing (SSP) are rather difficult. This paper surveys the past efforts in solving these problems by a computer, it summarizes the relevant findings in social psychology, and it proposes a set of recommendations for enabling the development of the next generation of socially-aware computing.

Keywords: IM2.MCA, Report_VII
[361] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua. Multi-camera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):267–282, 2008. [ bib ]
Given two to four synchronized video streams taken at eye level and from different angles, we show that we can effectively combine a generative model with dynamic programming to accurately follow up to six individuals across thousands of frames in spite of significant occlusions and lighting changes. In addition, we also derive metrically accurate trajectories for each one of them. Our contribution is twofold. First, we demonstrate that our generative model can effectively handle occlusions in each time frame independently, even when the only data available comes from the output of a simple background subtraction algorithm and when the number of individuals is unknown a priori. Second, we show that multi-person tracking can be reliably achieved by processing individual trajectories separately over long sequences, provided that a reasonable heuristic is used to rank these individuals and avoid confusing them with one another.

Keywords: IM2.VP, IM2.MPR, Report_VIII
[362] G. S. V. S. Sivaram and H. Hermansky. Emulating temporal receptive fields of auditory mid-brain neurons for automatic speech recognition. In Proc. 16th European Signal Processing Conference (EUSIPCO), 2008. IDIAP-RR 08-24. [ bib ]
This paper proposes modifications to the Multi-resolution RASTA (MRASTA) feature extraction technique for the automatic speech recognition (ASR). By emulating asymmetries of the temporal receptive field (TRF) profiles of auditory mid-brain neurons, we obtain more than 13% relative improvement in word error rate on OGI-Digits database. Experiments on TIMIT database confirm that proposed modifications are indeed useful.

Keywords: IM2.AP, Report_VII
[363] R. Bertolami and H. Bunke. Integration of n-gram language models in multiple classifier systems for offline handwritten text line recognition. Int. Journal of Pattern Recognition and Art. Intelligence, 22(7):1301–1321, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[364] U. Hoffmann, A. Yazdani, J. M. Vesin, and T. Ebrahimi. Bayesian feature selection applied in a p300 brain- computer interface. In 16th European Signal Processing Conference, 2008. [ bib | http ]
Feature selection is a machine learning technique that has many interesting applications in the area of brain- computer interfaces (BCIs). Here we show how automatic relevance determination (ARD), which is a Bayesian feature selection technique, can be applied in a BCI system. We present an computationally efficient algorithm that uses ARD to com- pute sparse linear discriminants. The algorithm is tested with data recorded in a P300 BCI and with P300 data from the BCI competition 2004. The achieved classification ac- curacy is competitive with the accuracy achievable with a support vector machine (SVM). At the same time the compu- tational complexity of the presented algorithm is much lower than that of the SVM. Moreover, it is shown how visualiza- tion of the computed discriminant vectors allows to gain in- sights about the neurophysiological mechanisms underlying the P300 paradigm.

Keywords: Report_VII, IM2.MCA.BMI
[365] F. Galán, M. Nuttin, E. Lew, P. W. Ferrez, G. Vanacker, J. Philips, and J. del R. Millán. A brain-actuated wheelchair: asynchronous and non-invasive brain-computer interfaces for continuous control of robots. Clinical Neurophysiology, (119):2159–2169, 2008. [ bib ]
Objective: To assess the feasibility and robustness of an asynchronous and non-invasive EEG-based Brain-Computer Interface (BCI) for continuous mental control of a wheelchair. Methods: In experiment 1 two subjects were asked to mentally drive both a real and a simulated wheelchair from a starting point to a goal along a pre-specified path. Here we only report experiments with the simulated wheelchair for which we have extensive data in a complex environment that allows a sound analysis. Each subject participated in 5 experimental sessions, each consisting of 10 trials. The time elapsed between two consecutive experimental sessions was variable (from one hour to two months) to assess the system robustness over time. The pre-specified path was divided in 7 stretches to assess the system robustness in different contexts. To further assess the performance of the brain-actuated wheelchair, subject 1 participated in a second experiment consisting of 10 trials where he was asked to drive the simulated wheelchair following 10 different complex and random paths never tried before. Results: In experiment 1 the two subjects were able to reach 100% (subject 1) and 80% (subject 2) of the final goals along the pre-specified trajectory in their best sessions. Different performances were obtained over time and path stretches, what indicates that performance is time and context dependent. In experiment 2, subject 1 was able to reach the final goal in 80% of the trials. Conclusions: The results show that subjects can rapidly master our asynchronous EEG-based BCI to control a wheelchair. Also, they can autonomously operate the BCI over long periods of time without the need for adaptive algorithms externally tuned by a human operator to minimize the impact of EEG non-stationarities. This is possible because of two key components: first, the inclusion of a shared control system between the BCI system and the intelligent simulated wheelchair; second, the selection of stable user-specific EEG features that maximize the separability between the mental tasks. Significance: These results show the feasibility of continuously controlling complex robotics devices using an asynchronous and non-invasive BCI.

Keywords: IM2.BMI, Report_VII
[366] M. Rigamonti. A framework for structuring multimedia archives and for browsing efficiently through multimodal links. PhD thesis, University of Fribourg, Switzerland, 2008. [ bib ]
Keywords: IM2.HMI, Report_VIII
[367] F. Dufaux and T. Ebrahimi. H.264/avc video scrambling for privacy protection. In IEEE International Conference on Image Processing (ICIP2008), 2008. [ bib ]
In this paper, we address the problem of privacy in video surveillance systems. More specifically, we consider the case of H.264/AVC which is the state-of-the-art in video coding. We assume that Regions of Interest (ROI), containing privacy-sensitive information, have been identified. The content of these regions are then concealed using scrambling. More specifically, we introduce two region-based scrambling techniques. The first one pseudo-randomly flips the sign of transform coefficients during encoding. The second one is performing a pseudo-random permutation of transform coefficients in a block. The Flexible Macroblock Ordering (FMO) mechanism of H.264/AVC is exploited to discriminate between the ROI which are scrambled and the background which remains clear. Experimental results show that both techniques are able to effectively hide private information in ROI, while the scene remains comprehensible. Furthermore, the loss in coding efficiency stays small, whereas the required additional computational complexity is negligible.

Keywords: Report_VII, IM2.MCA,video surveillance; privacy; scrambling
[368] T. Tommasi, F. Orabona, and B. Caputo. Discriminative cue integration for medical image annotation. Pattern Recognition Letters, 2008. Special Issue on Automatic Annotation of Medical Images (ImageCLEF 2007, in Press. [ bib | .pdf ]
Automatic annotation of medical images is an increasingly important tool for physicians in their daily activity. Hospitals nowadays produce an increasing amount of data. Manual annotation is very costly and prone to human mistakes. This paper proposes a multi-cue approach to automatic medical image annotation. We represent images using global and local features. These cues are then combined using three alternative approaches, all based on the Support Vector Machine algorithm. We tested our methods on the IRMA database, and with two of the three approaches proposed here we participated in the 2007 ImageCLEFmed benchmark evaluation, in the medical image annotation track. These algorithms ranked first and fifth respectively among all submission. Experiments using the third approach also confirm the power of cue integration for this task.

Keywords: IM2.VP, IM2.MPR, Report_VIII
[369] A. Schlapbach, M. Liwicki, and H. Bunke. A writer identification system for on-line whiteboard data. Pattern Recognition, 41:2381–2397, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[370] M. Rayner, N. Tsourakis, M. Georgescul, and P. Bouillon. Building mobile spoken dialogue applications using regulus. In European Language Resources Association (ELRA), editor, Proceedings of the Sixth International Language Resources and Evaluation (LREC'08), Marrakech, Morocco, 2008. [ bib ]
Regulus is an Open Source platform that supports construction of rule-based medium-vocabulary spoken dialogue applications. It has already been used to build several substantial speech-enabled applications, including NASAs Clarissa procedure navigator and Geneva Universitys MedSLT medical speech translator. System like these would be far more useful if they were available on a hand-held device, rather than, as with the present version, on a laptop. In this paper we describe the Open Source framework we have developed, which makes it possible to run Regulus applications on generally available mobile devices, using a distributed client-server architecture that offers transparent and reliable integration with different types of ASR systems. We describe the architecture, an implemented calendar application prototype hosted on a mobile device, and an evaluation. The evaluation shows that performance on the mobile device is as good as performance on a normal desktop PC.

Keywords: IM2.HMI, Report_VII
[371] I. Bogdanova, A. Bur, and H. Hügli. Visual attention on the sphere [in press]. IEEE Transactios on Image Processing, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[372] S. Ba and J. M. Odobez. Multi-party focus of attention recognition in meetings from head pose and multimodal contextual cues. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Las-Vegas, 2008. [ bib ]
We address the problem of recognizing the visual focus of attention (VFOA) of meeting participants from their head pose and contextual cues. The main contribution of the paper is the use of a head pose posterior distribution as a representation of the head pose information contained in the image data. This posterior encodes the probabilities of the different head poses given the image data, and constitute therefore a richer representation of the data than the mean or the mode of this distribution, as done in all previous work. These observations are exploited in a joint interaction model of all meeting participants pose observations, VFOAs, speaking status and of environmental contextual cues. Numerical experiments on a public database of 4 meetings of 22min on average show that this change of representation allows for a 5.4% gain with respect to the standard approach using head pose as observation.

Keywords: Report_VII, IM2.VP
[373] A. Nijholt, D. Tan, B. Allison, J. del R. Millán, M. Moore, and B. Graimann. Brain-computer interfaces for hci and games. In Proceedings of the 26th Annual CHI Conference on Human Factors in Computing Systems, Extended Abstracts, Florence, Italy, 2008. [ bib ]
In this workshop we study the research themes and the state-of-the-art of brain-computer interaction. Braincomputer interface research has seen much progress in the medical domain, for example for prosthesis control or as biofeedback therapy for the treatment of neurological disorders. Here, however, we look at brain-computer interaction especially as it applies to research in Human-Computer Interaction (HCI). Through this workshop and continuing discussions, we aim to define research approaches and applications that apply to disabled and able-bodied users across a variety of real-world usage scenarios. Entertainment and game design is one of the application areas that will be considered.

Keywords: IM2.BMI, Report_VII
[374] K. Riedhammer, D. Gillick, B. Favre, and D. Hakkani-Tur. Packing the meeting summarization knapsack. to appear in Proceedings of Interspeech 2008, Brisbane, Australia, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[375] D. Grangier and S. Bengio. A discriminative kernel-based model to rank images from text queries. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2008. [ bib ]
This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.

Keywords: IM2.MPR, Report_VII
[376] M. Sorci, G. Antonini, B. Cerretani, J. Cruz Mota, T. Rubin, M. Bierlaire, and J. Ph. Thiran. Modelling human perception of static facial expressions. In Face and Gesture Recognition 2008, 2008. [ bib | http ]
Data collected through a recent web-based survey show that the perception (i.e. labeling) of a human facial expression by a human observer is a subjective process, which results in a lack of a unique ground-truth, as intended in the standard classification framework. In this paper we propose the use of Discrete Choice Models(DCM) for human perception of static facial expressions. Random utility functions are defined in order to capture the attractiveness, perceived by the human observer for an expression class, when asked to assign a label to an actual expression image. The utilities represent a natural way for the modeler to formalize her prior knowledge on the process. Starting with a model based on Facial Action Coding Systems (FACS), we subsequently defines two other models by adding two new sets of explanatory variables. The model parameters are learned through maximum likelihood estimation and a cross-validation procedure is used for validation purposes.

Keywords: Report_VII, IM2.VP, LTS; LTS5; Facial Expressions modeling; discrete choice models
[377] J. Berclaz, F. Fleuret, and P. Fua. Principled detection-by-classification from multiple views. In proceedings of the International Conference on Computer Vision Theory and Applications, volume 2, pages 375–382, 2008. [ bib ]
Machine-learning based classification techniques have been shown to be effective at detecting objects in com- plex scenes. However, the final results are often obtained from the alarms produced by the classifiers through a post-processing which typically relies on ad hoc heuristics. Spatially close alarms are assumed to be triggered by the same target and grouped together. Here we replace those heuristics by a principled Bayesian approach, which uses knowledge about both the classifier response model and the scene geometry to combine multiple classification answers. We demonstrate its effectiveness for multi-view pedestrian detection. We estimate the marginal probabilities of presence of people at any location in a scene, given the responses of classifiers evaluated in each view. Our approach naturally takes into account both the occlusions and the very low metric accuracy of the classifiers due to their invariance to translation and scale. Results show our method produces one order of magnitude fewer false positives than a method that is representative of typical state-of-the-art approaches. Moreover, the framework we propose is generic and could be applied to any detection-by-classification task.

Keywords: IM2.MPR, IM2.VP, Report_VIII
[378] A. Thomas, S. Ganapathy, and H. Hermansky. Hilbert envelope based spectro-temporal features for phoneme recognition in telephone speech. In Interspeech 2008, 2008. IDIAP-RR 08-18. [ bib ]
In this paper, we present a spectro-temporal feature extraction technique using sub-band Hilbert envelopes of relatively long segments of speech signal. Hilbert envelopes of the sub-bands are estimated using Frequency Domain Linear Prediction (FDLP). Spectral features are derived by integrating the sub-band Hilbert envelopes in short-term frames and the temporal features are formed by converting the FDLP envelopes into modulation frequency components. These are then combined at the phoneme posterior level and are used as the input features for a phoneme recognition system. In order to improve the robustness of the proposed features to telephone speech, the sub-band temporal envelopes are gain normalized prior to feature extraction. Phoneme recognition experiments on telephone speech in the HTIMIT database show significant performance improvements for the proposed features when compared to other robust feature techniques (average relative reduction of 11% in phoneme error rate).

Keywords: IM2.AP, Report_VII
[379] M. Pronobis and M. Magimai-Doss. Integrating audio and vision for robust automatic gender recognition. Idiap-RR Idiap-RR-73-2008, Idiap, 2008. [ bib | .pdf ]
We propose a multi-modal Automatic Gender Recognition (AGR) system based on audio-visual cues and present its thorough evaluation in realistic scenarios. First, we analyze robustness of different audio and visual features under varying conditions and create two uni-modal AGR systems. Then, we build an integrated audio-visual system by fusing information from each modality at the classifier level. Our extensive studies on the BANCA corpus comprising datasets of varying complexity show that: (a) the audio-based system is more robust than the vision-based system; (b) integration of audio-visual cues yields a resilient system and improves performance in noisy conditions.

Keywords: IM2.MPR, Report_VIII
[380] M. Rigamonti. A framework for structuring multimedia archives and for browsing efficiently through multimodal links. PhD thesis, University of Fribourg, Switzerland, 2008. [ bib ]
Keywords: Report_VII, IM2.HMI
[381] O. Koval, S. Voloshynovskiy, F. Beekhof, and T. Pun. Security analysis of robust perceptual hashing. In E. J. Delp III, P. W. Wong, J. Dittmann, and N. D. Memon, editors, Steganography, and Watermarking of Multimedia Contents X, volume 6819 of Proceedings of SPIE, (SPIE, Bellingham, WA 2008) 681906, 2008. [ bib ]
Keywords: Report_VII, IM2.MPR
[382] M. Liwicki and H. Bunke. Combining on-line and off-line blstm networks for handwritten text line recognition. In Proc. 11th Int. Conf. on Frontiers in Handwriting Recognition, pages 31–36, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[383] B. Dumas, D. Lalanne, and R. Ingold. Demonstration : hephaistk, une boîte à outils pour le prototypage d'interfaces multimodales. In Proceedings of 20e Conférence sur l'Interaction Homme-Machine (IHM 08), pages 215–216, Metz (France), 2008. [ bib ]
Keywords: IM2.HMI, Report_VIII
[384] G. Bologna, B. Deville, M. Vinckenbosch, and T. Pun. Pairing colored socks and following a red serpentine with sounds of musical instruments. In ICAD 08, International Conference on Auditory Displays, Paris, France, June 24–27, 2008. [ bib ]
Keywords: IM2.MCA, Report_VIII
[385] C. Wooters and M. Huijbregts. The icsi rt07s speaker diarization system. In Multimodal Technologies for Perception of Humans. Lecture Notes in Computer Science, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[386] J. F. Paiement, S. Bengio, and D. Eck. Probabilistic models for melodic prediction. Idiap-RR Idiap-RR-50-2008, IDIAP, 2008. Submitted for publication. [ bib | .ps.gz | .pdf ]
Chord progressions are the building blocks from which tonal music is constructed. The choice of a particular representation for chords has a strong impact on statistical modeling of the dependence between chord symbols and the actual sequences of notes in polyphonic music. Melodic prediction is used in this paper as a benchmark task to evaluate the quality of four chord representations using two probabilistic model architectures derived from Input/Output Hidden Markov Models (IOHMMs).

Keywords: IM2.AP, Report_VIII
[387] J. F. Paiement, Y. Grandvalet, and S. Bengio. Predictive models for music. Idiap-RR Idiap-RR-51-2008, IDIAP, 2008. Submitted for publication. [ bib | .ps.gz | .pdf ]
Modeling long-term dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce generative models for melodies. We decompose melodic modeling into two subtasks. We first propose a rhythm model based on the distributions of distances between subsequences. Then, we define a generative model for melodies given chords and rhythms based on modeling sequences of Narmour features. The rhythm model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases. Using a similar evaluation procedure, the proposed melodic model consistently outperforms an Input/Output Hidden Markov Model. Furthermore, sampling these models given appropriate musical contexts generates realistic melodies.

Keywords: IM2.AP, Report_VIII
[388] J. Berclaz, F. Fleuret, and P. Fua. Multi-camera tracking and atypical motion detection with behavioral maps. In Proceedings of the European Conference on Computer Vision (ECCV), pages 112–125, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[389] T. Varga and H. Bunke. Perturbation models for generating synthetic training data in handwriting recognition. In S. Marinai and H. Fujisawa, editors, Machine Learning in Document Analysis and Recognition, pages 333–360. Springer, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[390] J. Yao and J. M. Odobez. Fast human detection from videos using covariance features. In European Conference on Computer Vision, workshop on Visual Surveillance (ECCV-VS), 2008. [ bib | .pdf ]
In this paper, we present a fast method to detect humans from videos captured in surveillance applications. It is based on a cascade of LogitBoost classifiers relying on features mapped from the Riemanian manifold of region covariance matrices computed from input image features. The method was extended in several ways. First, as the mapping process is slow for high dimensional feature space, we propose to select weak classifiers based on subsets of the complete image feature space. In addition, we propose to combine these sub-matrix covariance features with the means of the image features computed within the same subwindow, which are readily available from the covariance extraction process. Finally, in the context of video acquired with stationary cameras, we propose to fuse image features from the spatial and temporal domains in order to jointly learn the correlation between appearance and foreground information based on background subtraction. Our method evaluated on a large set of videos coming from several databases (CAVIAR, PETS, ...), and can process from 5 to 20 frames/sec (for a 384x288 video) while achieving similar or better performance than existing methods.

Keywords: IM2.VP, Report_VIII
[391] D. Vergyri, A. Mandal, W. Wang, A. Stolcke, J. Zheng, M. Graciarena, D. Rybach, C. Gollan, R. Schlater, K. Kirchoff, A. Faria, and N. Morgan. Development of the sri/nightingale arabic asr system. to appear in Proceedings of Interspeech 2008, Brisbane, Australia, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[392] J. Berclaz, F. Fleuret, and P. Fua. Multi-camera tracking and atypical motion detection with behavioral maps. In The 10th European Conference on Computer Vision (ECCV 2008), 2008. [ bib ]
We introduce a novel behavioral model to describe pedestrians motions, which is able to capture sophisticated motion patterns resulting from the mixture of different categories of random trajectories. Due to its simplicity, this model can be learned from video sequences in a totally unsupervised manner through an Expectation-Maximization procedure. When integrated into a complete multi-camera tracking system, it improves the tracking performance in ambiguous situations, compared to a standard ad-hoc isotropic Markovian motion model. Moreover, it can be used to compute a score which characterizes atypical individual motions. Experiments on outdoor video sequences demonstrate both the improvement of tracking performance when compared to a state-of-the-art tracking system and the reliability of the atypical motion detection.

Keywords: IM2.MPR, Report_VII
[393] G. Aradilla. Acoustic models for posterior features in speech recognition. PhD thesis, Ecole Polytechnique Fédérale de Lausanne, Lausanne , Switzerland, 2008. PhD Thesis no 4164. [ bib ]
In this thesis, we investigate the use of posterior probabilities of sub-word units directly as input features for automatic speech recognition (ASR). These posteriors, estimated from data-driven methods, display some favourable properties such as increased speaker invariance, but unlike conventional speech features also hold some peculiarities, such that their components are non-negative and sum up to one. State-of-the-art acoustic models for ASR rely on general-purpose similarity measures like Euclidean-based distances or likelihoods computed from Gaussian mixture models (GMMs), hence, they do not explicitly take into account the particular properties of posterior-based speech features. We explore here the use of the Kullback-Leibler (KL) divergence as similarity measure in both non-parametric methods using templates and parametric models that rely on an architecture based on hidden Markov models (HMMs). Traditionally, template matching (TM)-based ASR uses cepstral features and requires a large number of templates to capture the natural variability of spoken language. Thus, TM-based approaches are generally oriented to speaker-dependent and small vocabulary recognition tasks. In our work, we use posterior features to represent the templates and test utterances. Given the discriminative nature of posterior features, we show that a limited number of templates can accurately characterize a word. Experiments on different databases show that using KL divergence as local similarity measure yields significantly better performance than traditional TM-based approaches. The entropy of posterior features can also be used to further improve the results. In the context of HMMs, we propose a novel acoustic model where each state is parameterized by a reference multinomial distribution and the state score is based on the KL divergence between the reference distribution and the posterior features. Besides the fact that the KL divergence is a natural dissimilarity measure between posterior distributions, we further motivate the use of the KL divergence by showing that the proposed model can be interpreted in terms of maximum likelihood and information theoretic clustering. Furthermore, the KL-based acoustic model can be seen as a general case of other known acoustic models for posterior features such as hybrid HMM/MLP and discrete HMM. The presented approach has been extended to large vocabulary recognition tasks. When compared to state-of-the-art HMM/GMM, the KL-based acoustic model yields comparable results while using significantly fewer parameters.

Keywords: IM2.AP, Report_VII
[394] A. Vinciarelli, M. Pantic, H. Bourlard, and A. Pentland. Social signals, their function, and automatic analysis: a survey. In Proceedings of International Conference on Multimodal Interfaces (to appear), 2008. [ bib ]
Social Signal Processing (SSP) aims at the analysis of social behaviour in both Human-Human and Human-Computer interactions. SSP revolves around automatic sensing and interpretation of social signals, complex aggregates of nonverbal behaviours through which individuals express their attitudes towards other human (and virtual) participants in the current social context. As such, SSP integrates both engineering (speech analysis, computer vision, etc.) and human sciences (social psychology, anthropology, etc.) as it requires multimodal and multidisciplinary approaches. As of today, SSP is still in its early infancy, but the domain is quickly developing, and a growing number of works is appearing in the literature. This paper provides an introduction to nonverbal behaviour involved in social signals and a survey of the main results obtained so far in SSP. It also outlines possibilities and challenges that SSP is expected to face in the next years if it is to reach its full maturity.

Keywords: IM2.MCA, Report_VII
[395] J. F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck. A distance model for rhythms. In 25th International Conference on Machine Learning (ICML), 2008. IDIAP-RR 08-33. [ bib | .ps.gz | .pdf ]
Modeling long-term dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce a model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.

Keywords: IM2.AP, Report_VIII
[396] D. Grangier. Machine Learning for Information Retrieval. PhD thesis, École Polytechnique Fédérale de Lausanne, 2008. Thèse Ecole polytechnique fédérale de Lausanne EPFL, no 4088 (2008), Faculté des sciences et techniques de l'ingénieur STI, Section de génie électrique et électronique, Institut de génie électrique et électronique IEL (Laboratoire de l'IDIAP LIDIAP). Dir.: Hervé Bourlard, Sami Bengio. [ bib | .pdf ]
In this thesis, we explore the use of machine learning techniques for information retrieval. More specifically, we focus on ad-hoc retrieval, which is concerned with searching large corpora to identify the documents relevant to user queries. This identification is performed through a ranking task. Given a user query, an ad-hoc retrieval system ranks the corpus documents, so that the documents relevant to the query ideally appear above the others. In a machine learning framework, we are interested in proposing learning algorithms that can benefit from limited training data in order to identify a ranker likely to achieve high retrieval performance over unseen documents and queries. This problem presents novel challenges compared to traditional learning tasks, such as regression or classification. First, our task is a ranking problem, which means that the loss for a given query cannot be measured as a sum of an individual loss suffered for each corpus document. Second, most retrieval queries present a highly unbalanced setup, with a set of relevant documents accounting only for a very small fraction of the corpus. Third, ad-hoc retrieval corresponds to a kind of "double" generalization problem, since the learned model should not only generalize to new documents but also to new queries. Finally, our task also presents challenging efficiency constraints, since ad-hoc retrieval is typically applied to large corpora. The main objective of this thesis is to investigate the discriminative learning of ad-hoc retrieval models. For that purpose, we propose different models based on kernel machines or neural networks adapted to different retrieval contexts. The proposed approaches rely on different online learning algorithms that allow efficient learning over large corpora. The first part of the thesis focuses on text retrieval. In this case, we adopt a classical approach to the retrieval ranking problem, and order the text documents according to their estimated similarity to the text query. The assessment of semantic similarity between text items plays a key role in that setup and we propose a learning approach to identify an effective measure of text similarity. This identification is not performed relying on a set of queries with their corresponding relevant document sets, since such data are especially expensive to label and hence rare. Instead, we propose to rely on hyperlink data, since hyperlinks convey semantic proximity information that is relevant to similarity learning. This setup is hence a transfer learning setup, where we benefit from the proximity information encoded by hyperlinks to improve the performance over the ad-hoc retrieval task. We then investigate another retrieval problem, i.e. the retrieval of images from text queries. Our approach introduces a learning procedure optimizing a criterion related to the ranking performance. This criterion adapts our previous learning objective for learning textual similarity to the image retrieval problem. This yields an image ranking model that addresses the retrieval problem directly. This approach contrasts with previous research that relies on an intermediate image annotation task. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. In the last part of the thesis, we show that the objective function used in the previous retrieval problems can be applied to the task of keyword spotting, i.e. the detection of given keywords in speech utterances. For that purpose, we formalize this problem as a ranking task: given a keyword, the keyword spotter should order the utterances so that the utterances containing the keyword appear above the others. Interestingly, this formulation yields an objective directly maximizing the area under the receiver operating curve, the most common keyword spotter evaluation measure. This objective is then used to train a model adapted to this intrinsically sequential problem. This model is then learned with a procedure derived from the algorithm previously introduced for the image retrieval task. To conclude, this thesis introduces machine learning approaches for ad-hoc retrieval. We propose learning models for various multi-modal retrieval setups, i.e. the retrieval of text documents from text queries, the retrieval of images from text queries and the retrieval of speech recordings from written keywords. Our approaches rely on discriminative learning and enjoy efficient training procedures, which yields effective and scalable models. In all cases, links with prior approaches were investigated and experimental comparisons were conducted.

Keywords: discriminative learning, image retrieval, Information Retrieval, learning to rank, machine learning, online learning, spoken keyword spotting, text retrieval, IM2.AP,Report_VIII
[397] D. Jayagopi, H. Hung, C. Yeo, and D. Gatica-Perez. Predicting the dominant clique in meetings through fusion of nonverbal cues. In ACM MM 2008, 2008. IDIAP-RR 08-08. [ bib ]
This paper addresses the problem of automatically predicting the dominant clique (i.e., the set of K-dominant people) in face-to-face small group meetings recorded by multiple audio and video sensors. For this goal, we present a framework that integrates automatically extracted nonverbal cues and dominance prediction models. Easily computable audio and visual activity cues are automatically extracted from cameras and microphones. Such nonverbal cues, correlated to human display and perception of dominance, are well documented in the social psychology literature. The effectiveness of the cues were systematically investigated as single cues as well as in unimodal and multimodal combinations using unsupervised and supervised learning approaches for dominant clique estimation. Our framework was evaluated on a five-hour public corpus of teamwork meetings with third-party manual annotation of perceived dominance. Our best approaches can exactly predict the dominant clique with 80.8% accuracy in four-person meetings in which multiple human annotators agree on their judgments of perceived dominance.

Keywords: IM2.MCA, Report_VII
[398] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision, 77(1-3):259–289, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[399] J. Kludas, S. Marchand-Maillet, and E. Bruno. Exploiting document feature interactions for efficient information fusion in high dimensional spaces. In Proceedings of the First International Workshops on Image Processing Theory, Tools and Applications (IPTA'2008), Sousse, Tunisia, 2008. (invited). [ bib | .pdf ]
Keywords: IM2.MCA, Report_VIII
[400] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. Simplemkl. Journal of Machine Learning Research, 9:2491–2521, 2008. [ bib | .pdf ]
Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. For the support vector machine, an efficient and general multiple kernel learning algorithm, based on semi-infinite linear progamming, has been recently proposed. This approach has opened new perspectives since it makes MKL tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs numerous iterations for converging towards a reasonable solution. In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is defined as a linear combination of multiple kernels. We propose an algorithm, named SimpleMKL, for solving this MKL problem and provide a new insight on MKL algorithms based on mixed-norm regularization by showing that the two approaches are equivalent. We show how SimpleMKL can be applied beyond binary classification, for problems like regression, clustering (one-class classification) or multiclass classification. Experimental results show that the proposed algorithm converges rapidly and that its efficiency compares favorably to other MKL algorithms. Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classification problems.

Keywords: IM2.MPR, Report_VIII
[401] M. Soleymani, G. Chanel, J. Kierkels, and T. Pun. affective ranking of movie scenes using physiological signals and content analysis. In 2nd ACM Workshop on the Many Faces of Multimedia Semantics, ACM MM08, Vacnouver, Canada, 2008. [ bib ]
Keywords: IM2.MCA, Report_VIII
[402] A. Schlapbach. Writer identification and verification, volume 311. IOS Press, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[403] S. Ganapathy, S. Thomas, and H. Hermansky. Modulation frequency features for phoneme recognition in noisy speech. Journal of Acoustical Society of America - Express Letters, 2008. [ bib | .pdf ]
In this letter, a new feature extraction technique based on modulation spectrum derived from syllable-length segments of sub-band temporal envelopes is proposed. These sub-band envelopes are derived from auto-regressive modelling of Hilbert envelopes of the signal in critical bands, processed by both a static (logarithmic) and a dynamic (adaptive loops) compression. These features are then used for machine recognition of phonemes in telephone speech. Without degrading the performance in clean conditions, the proposed features show significant improvements compared to other state-of-the-art speech analysis techniques. In addition to the overall phoneme recognition rates, the performance with broad phonetic classes is reported.

Keywords: IM2.AP, Report_VIII
[404] M. Soleymani, G. Chanel, J. Kierkels, and T. Pun. affective characterization of movie scenes based on multimedia content analysis and user's physiological emotional responses. In IEEE International Symposium on Multimedia, Berkeley, US, 2008. [ bib ]
Keywords: IM2.MCA, Report_VIII
[405] H. Hung, Y. Huang, C. Yeo, and D. Gatica-Perez. Associating audio-visual activity cues in a dominance estimation framework. In CVPR Workshop on Human Communicative Behavior, Ankorage, 2008. [ bib ]
Keywords: Report_VII, IM2.MPR
[406] K. Kryszczuk and A. Drygajlo. On quality of quality measures for classification. In Biometrics and Identity Management, Lecture Notes in Computer Science 5372, pages 19–28, Heidelberg, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[407] K. Kryszczuk and A. Drygajlo. What do quality measures predict in biometrics. In 16th European Signal Processing Conference, Lausanne, Switzerland, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[408] M. Soleymani, G. Chanel, J. Kierkels, and T. Pun. valence-arousal representation of movie scenes based on multimedia content analysis and user's physiological emotional responses. 5th Joint Workshop on Machine Learning and Multimodal Interaction, 2008. [ bib ]
Keywords: IM2.MCA, Report_VIII
[409] D. Gillick, D. Hakkani-Tur, and M. Levit. Unsupervised learning of edit parameters for matching name variants. to appear in Proceedings of Interspeech 2008, Brisbane, Australia, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[410] A. Pronobis, O. Martinez Monos, and B. Caputo. Svm-based discriminative accumulation scheme for place recognition. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA08), 2008. [ bib | .pdf ]
ntegrating information coming from different sensors is a fundamental capability for autonomous robots. For complex tasks like topological localization, it would be desirable to use multiple cues, possibly from different modalities, so to achieve robust performance. This paper proposes a new method for integrating multiple cues. For each cue we train a large margin classifier which outputs a set of scores indicating the confidence of the decision. These scores are then used as input to a Support Vector Machine, that learns how to weight each cue, for each class, optimally during training. We call this algorithm SVM-based Discriminative Accumulation Scheme (SVM-DAS). We applied our method to the topological localization task, using vision and laser-based cues. Experimental results clearly show the value of our approach.

Keywords: IM2.MPR, Report_VIII
[411] A. Torii, M. Havlena, T. Pajdla, and B. Leibe. Measuring camera translation by the dominant apical angle. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR'08), 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[412] A. Ess, B. Leibe, K. Schindler, and L. van Gool. A mobile vision system for robust multi-person tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR'08), 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[413] N. Cornelis, B. Leibe, K. Cornelis, and L. van Gool. 3d urban scene modeling integrating recognition and reconstruction. International Journal of Computer Vision, 78(2-3):121–141, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[414] T. Weise, B. Leibe, and L. van Gool. Accurate and robust registration for in-hand modeling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR'08), 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[415] K. Schindler, L. van Gool, and B. de Gelder. Recognizing emotions expressed by body pose: a biologically inspired neural model. Neural Networks, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[416] J. Kludas, E. Bruno, and S. Marchand-Maillet. Exploiting synergistic and redundant features for multimedia document classification. In 32nd Annual Conference of the German Classification Society - Advances in Data Analysis, Data Handling and Business Intelligence (GfKl 2008), Hamburg, Germany, 2008. [ bib ]
Keywords: Report_VII, IM2.MCA
[417] R. A. Negoescu and D. Gatica-Perez. Analyzing flickr groups. In Proceedings of the 2008 international conference on Content-based image and video retrieval (CIVR '08), number Idiap-RR-03-2008, 2008. To appear in Proceedings of CIVR'08. [ bib ]
There is an explosion of community-generated multimedia content available online. In particular, Flickr constitutes a 200-million photo sharing system where users participate following a variety of social motivations and themes. Flickr groups are increasingly used to facilitate the explicit definition of communities sharing common interests, which translates into large amounts of content (e.g. pictures and associated tags) about specific subjects. However, to our knowledge, an in-depth analysis of user behavior in Flickr groups remains open, as does the existence of effective tools to find relevant groups. Using a sample of about 7 million user-photos and about 51000 Flickr groups, we present a novel statistical group analysis that highlights relevant patterns of photo-to-group sharing practices. Furthermore, we propose a novel topic-based representation model for groups, computed from aggregated group tags. Groups are represented as multinomial distributions over semantically meaningful latent topics learned via unsupervised probabilistic topic modeling. We show this representation to be useful for automatically discovering groups of groups and topic expert-groups, for designing new group-search strategies, and for obtaining new insights of the semantic structure of Flickr groups.

Keywords: IM2.VP, Report_VII
[418] G. Aradilla, H. Bourlard, and M. Magimai-Doss. Using kl-based acoustic models in a large vocabulary recognition task. Idiap-RR Idiap-RR-14-2008, IDIAP, 2008. [ bib ]
Posterior probabilities of sub-word units have been shown to be an effective front-end for ASR. However, attempts to model this type of features either do not benefit from modeling context-dependent phonemes, or use an inefficient distribution to estimate the state likelihood. This paper presents a novel acoustic model for posterior features that overcomes these limitations. The proposed model can be seen as a HMM where the score associated with each state is the KL divergence between a distribution characterizing the state and the posterior features from the test utterance. This KL-based acoustic model establishes a framework where other models for posterior features such as hybrid HMM/MLP and discrete HMM can be seen as particular cases. Experiments on the WSJ database show that the KL-based acoustic model can significantly outperform these latter approaches. Moreover, the proposed model can obtain comparable results to complex systems, such as HMM/GMM, using significantly fewer parameters.

Keywords: IM2.AP, Report_VII
[419] S. Voloshynovskiy, O. Koval, R. Villán, F. Beekhof, and T. Pun. Authentication of biometric identification documents via mobile devices. Journal of Electronic Imaging, 2008. [ bib ]
Keywords: Report_VII, IM2.MPR
[420] G. Garipelli, R. Chavarriaga, and J. del R. Millan. Recognition of anticipatory behavior from human eeg. In 4th Intl. Brain-Computer Interface Workshop and Training Course. Graz University, Austria, 2008. IDIAP-RR 08-52. [ bib ]
Anticipation increases the efficiency of a daily task by partial advance activation of neural substrates involved in it. Single trial recognition of this activation can be exploited for a novel anticipation based Brain Computer Interface (BCI). In the current work we compare different methods for the recognition of Electroencephalogram (EEG) correlates of this activation on single trials as a first step towards building such a BCI. To do so, we recorded EEG from 9 subjects performing a classical Contingent Negative Variation (CNV) paradigm (usually reported for studying anticipatory behavior in neurophysiological experiments) with GO and NOGO conditions. We first compare classification accuracies with features such as Least Square fitting Line (LSFL) parameters and Least Square Fitting Polynomial (LSFP) coefficients using a Quadratic Discriminant Analysis (QDA) classifier. We then test the best features with complex classifiers such as Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs).

Keywords: IM2.BMI, Report_VII
[421] I. Bogdanova, A. Bur, and H. Hügli. The spherical approach to omnidirectional visual attention. In XVI European Signal Processing Conference (EUSIPCO 2008), Proc. EUSIPCO, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[422] P. Prodanov, A. Drygajlo, J. Richiardi, and A. Alexander. Low-level grounding in a multimodal mobile service robot conversational system using graphical models. Intelligent Service Robotics, 1:3–26, 2008. [ bib | DOI ]
Keywords: report_VII, IM2.MPR
[423] J. Mariéthoz, S. Bengio, and Y. Grandvalet. Kernel based text-independnent speaker verification. Idiap-RR Idiap-RR-68-2008, Idiap, 2008. [ bib | .pdf ]
Keywords: IM2.AP, Report_VIII
[424] D. Weinshall, H. Hermansky, A. Zweig, J. Luo, H. Jimison, F. Ohl, and M. Pavel. Beyond novelty detection: Incongruent events, when general and specific classifiers disagree. In Advances in Neural Information Processing Systems 21, 2008. [ bib | .pdf ]
Unexpected stimuli are a challenge to any machine learning algorithm. Here we identify distinct types of unexpected events, focusing on 'incongruent events' - when 'general level' and 'specific level' classifiers give conflicting predictions. We define a formal framework for the representation and processing of incongruent events: starting from the notion of label hierarchy, we show how partial order on labels can be deduced from such hierarchies. For each event, we compute its probability in different ways, based on adjacent levels (according to the partial order) in the label hierarchy. An incongruent event is an event where the probability computed based on some more specific level (in accordance with the partial order) is much smaller than the probability computed based on some more general level, leading to conflicting predictions. We derive algorithms to detect incongruent events from different types of hierarchies, corresponding to class membership or part membership. Respectively, we show promising results with real data on two specific problems: Out Of Vocabulary words in speech recognition, and the identification of a new sub-class (e.g., the face of a new individual) in audio-visual facial object recognition.

Keywords: IM2.MPR, Report_VIII
[425] S. Ganapathy, P. Motlicek, and H. Hermansky. Modified discrete cosine transform for encoding residual signals in frequency domain linear prediction. Idiap-RR Idiap-RR-74-2008, Idiap, 2008. [ bib | .pdf ]
Keywords: IM2.AP, Report_VIII
[426] J. P. Pinto, I. Szoke, S. R. Mahadeva Prasanna, and H. Hermansky. Fast approximate spoken term detection from sequence of phonemes. In The 31st Annual International ACM SIGIR Conference 20-24 July 2008, 31st International ACM SIGIR Conference, pages 28–33, 2008. IDIAP-RR 08-45. [ bib ]
We investigate the detection of spoken terms in conversational speech using phoneme recognition with the objective of achieving smaller index size as well as faster search speed. Speech is processed and indexed as a sequence of one best phoneme sequence. We propose the use of a probabilistic pronunciation model for the search term to compensate for the errors in the recognition of phonemes. This model is derived using the pronunciation of the word and the phoneme confusion matrix. Experiments are performed on the conversational telephone speech database distributed by NIST for the 2006 spoken term detection. We achieve about 1500 times smaller index size and 14 times faster search speed compared to the state-of-the-art system using phoneme lattice at the cost of relatively lower detection performance.

Keywords: IM2.AP, Report_VII
[427] R. Sala Llonch, E. Kokiopoulou, I. Tosic, and P. Frossard. 3d face recognition using sparse spherical representations. IEEE Int. Conf. Pattern Recognition (ICPR), 2008. [ bib ]
Keywords: Report_VII, IM2.DMA.VP, joint publication
[428] S. Stoyanchev, G. Tur, and D. Hakkani-Tur. Name-aware speech recognition for interactive question answering. IEEE ICASSP, Las Vegas, NV, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[429] B. Deville, G. Bologna, M. Vinckenbosch, and T. Pun. guiding the focus of attention of blind people with visual saliency. In Workshop on Computer Vision Applications for the Visually Impaired (CVAVI 08), 2008. [ bib ]
Keywords: IM2.MCA, Report_VIII
[430] T. Tommasi, F. Orabona, and B. Caputo. Clef2008 image annotation task: an svm confidence-based approach. Idiap-RR Idiap-RR-77-2008, Idiap, 2008. CLEF 2008 Working Notes. [ bib | .pdf ]
This paper presents the algorithms and results of our participation to the medi- cal image annotation task of ImageCLEFmed 2008. Our previous experience in the same task in 2007 suggests that combining multiple cues with di erent SVM-based approaches is very e ective in this domain. Moreover it points out that local features are the most discriminative cues for the problem at hand. On these basis we decided to integrate two di erent local structural and textural descriptors. Cues are combined through simple concatenation of the feature vectors and through the Multi-Cue Ker- nel. The trickiest part of the challenge this year was annotating images coming mainly from classes with only few examples in the training set. We tackled the problem on two fronts: (1) we introduced a further integration strategy using SVM as an opinion maker. It consists in combining the first two opinions on the basis of a technique to evaluate the confidence of the classifier's decisions. This approach produces class labels with ”don't know” wildcards opportunely placed; (2) we enriched the poorly populated training classes adding virtual examples generated slightly modifying the original images. We submitted several runs considering di erent combination of the proposed techniques. Our team was called ”idiap”. The run using jointly the low cue- integration technique, the confidence-based opinion fusion and the virtual examples, scored 74.92 ranking first among all submissions.

Keywords: IM2.VP, Report_VIII
[431] A. Popescu-Belis, E. Boertjes, J. Kilgour, P. Poller, S. Castronovo, T. Wilson, A. Jaimes, and J. Carletta. The amida automatic content linking device: Just-in-time document retrieval in meetings. In A. Popescu-Belis and R. Stiefelhagen, editors, Machine Learning for Multimodal Interaction V, volume 5237 of LNCS, pages 272–283. Springer-Verlag, 2008. [ bib | DOI | .pdf ]
The AMIDA Automatic Content Linking Device (ACLD) is a just-in-time document retrieval system for meeting environments. The ACLD listens to a meeting and displays information about the documents from the group's history that are most relevant to what is being said. Participants can view an outline or the entire content of the documents, if they feel that these documents are potentially useful at that moment of the meeting. The ACLD proof-of-concept prototype places meeting-related documents and segments of previously recorded meetings in a repository and indexes them. During a meeting, the ACLD continually retrieves the documents that are most relevant to keywords found automatically using the current meeting speech. The current prototype simulates the real-time speech recognition that will be available in the near future. The software components required to achieve these functions communicate using the Hub, a client/server architecture for annotation exchange and storage in real-time. Results and feedback for the first ACLD prototype are outlined, together with plans for its future development within the AMIDA EU integrated project. Potential users of the ACLD supported the overall concept, and provided feedback to improve the user interface and to access documents beyond the group's own history.

Keywords: IM2.MCA, IM2.HMI, Report_VIII
[432] E. Kokiopoulou, P. Frossard, and D. Gkorou. Optimal polynomial filtering for accelerating distributed consensus. IEEE Int. Symp. on Information Theory (ISIT), 2008. [ bib ]
Keywords: Report_VII, IM2.DMA.VP, joint
[433] H. Hung, Y. Huang, G. Friedland, and D. Gatica-Perez. Estimating the dominant person in multi-party conversations using speaker diarization strategies. In ICASSP 08, 2008. [ bib ]
Keywords: Report_VII, IM2.MPR
[434] A. Popescu-Belis, H. Bourlard, and S. Renals. Machine learning for multimodal interaction iv (revised selected papers from mlmi 2007, brno, 28-30 june 2007). LNCS 4892. Springer-Verlag, Berlin/Heidelberg, 2008. [ bib ]
Keywords: Report_VII, IM2.DMA
[435] A. Popescu-Belis and R. Stiefelhagen. Machine learning for multimodal interaction v (proceedings of mlmi 2008, utrecht, 8-10 september 2008). LNCS 5237. Springer-Verlag, Berlin/Heidelberg, 2008. [ bib ]
Keywords: Report_VII, IM2.DMA
[436] D. Gatica-Perez and K. Farrahi. What did you do today? discovering daily routines from large-scale mobile data. In ACM International Conference on Multimedia (ACMMM), 2008. IDIAP-RR 08-49. [ bib ]
We present a framework built from two Hierarchical Bayesian topic models to discover human location-driven routines from mobile phones. The framework uses location-driven bag representations of people's daily activities obtained from celltower connections. Using 68 000 hours of real-life human data from the Reality Mining dataset, we successfully discover various types of routines. The first studied model, Latent Dirichlet Allocation (LDA), automatically discovers characteristic routines for all individuals in the study, including “going to work at 10am", “leaving work at night", or “staying home for the entire evening". In contrast, the second methodology with the Author Topic model (ATM) finds routines characteristic of a selected groups of users, such as “being at home in the mornings and evenings while being out in the afternoon", and ranks users by their probability of conforming to certain daily routines.

Keywords: IM2.MCA, Report_VII
[437] F. Valente and H. Hermansky. Hierarchical and parallel processing of modulation spectrum for asr applications. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 4165–4168, 2008. IDIAP-RR 07-45. [ bib | DOI ]
The modulation spectrum is an efficient representation for describing dynamic information in signals. In this work we investigate how to exploit different elements of the modulation spectrum for extraction of information in automatic recognition of speech (ASR). Parallel and hierarchical (sequential) approaches are investigated. Parallel processing combines outputs of independent classifiers applied to different modulation frequency channels. Hierarchical processing uses different modulation frequency channels sequentially. Experiments are run on a LVCSR task for meetings transcription and results are reported on the RT05 evaluation data. Processing modulation frequencies channels with different classifiers provides a consistent reduction in WER (2% absolute w.r.t. PLP baseline). Hierarchical processing outperforms parallel processing. The largest WER reduction is obtained trough sequential processing moving from high to low modulation frequencies. This model is consistent with several perceptual and physiological studies on auditory processing.

Keywords: IM2.AP, Report_VII
[438] B. Dumas, D. Lalanne, D. Guinard, R. Koenig, and R. Ingold. Strengths and weaknesses of software architectures for the rapid creation of tangible and multimodal interfaces. pages 47–54, 2008. [ bib ]
Keywords: Report_VII, IM2.HMI
[439] H. Bourlard and S. Renals. Recognition and understanding of meetings overview of the european ami and amida projects. In LangTech 2008, 2008. IDIAP-RR 08-27. [ bib ]
The AMI and AMIDA projects are concerned with the recognition and interpretation of multiparty (face-to-face and remote) meetings. Within these projects we have developed the following: (1) an infrastructure for recording meetings using multiple microphones and cameras; (2) a one hundred hour, manually annotated meeting corpus; (3) a number of techniques for indexing, and summarizing of meeting videos using automatic speech recognition and computer vision, and (4) a extensible framework for browsing, and searching of meeting videos. We give an overview of the various techniques developed in AMI (mainly involving face-to-face meetings), their integration into our meeting browser framework, and future plans for AMIDA (Augmented Multiparty Interaction with Distant Access), the follow-up project to AMI. Technical and business information related to these two projects can be found at www.amiproject.org, respectively on the Scientific and Business portals.

Keywords: IM2.AP, Report_VII
[440] B. Dumas, D. Lalanne, and R. Ingold. Prototyping multimodal interfaces with smuiml modeling language. pages 63–66, 2008. [ bib ]
Keywords: Report_VII, IM2.HMI
[441] J. Kludas, E. Bruno, and S. Marchand-Maillet. Can feature information interaction help for information fusion in multimedia problems? To appear in Multimedia Tools and Applications Journal special issue on "Metadata Mining for Image Understanding", 2008. [ bib ]
Keywords: Report_VII, IM2.MCA
[442] N. Tsourakis, A. Lisowska, P. Bouillon, and M. Rayner. From desktop to mobile: adapting a successful voice interaction platform for use in mobile devices. In Third ACM MobileHCI Workshop on Speech in Mobile and Pervasive Environments (SiMPE), 2008. 2nd-5th, September. [ bib ]
Keywords: IM2.HMI Report_VII
[443] A. Popescu-Belis, P. Baudrion, M. Flynn, and P. Wellner. Towards an objective test for meeting browsers: the bet4tqb pilot experiment. In A. Popescu-Belis, H. Bourlard, and S. Renals, editors, Machine Learning for Multimodal Interaction IV, LNCS 4892, pages 108–119. Springer-Verlag, Berlin/Heidelberg, 2008. [ bib | DOI ]
Keywords: Report_VII, IM2.HMI
[444] A. Schlapbach, F. Wettstein, and H. Bunke. Automatic estimation of the readability of handwritten text. In Proc. 16th European Signal Processing Conference, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[445] A. Popescu-Belis, E. Boertjes, J. Kilgour, P. Poller, S. Castronovo, T. Wilson, A. Jaimes, and J. Carletta. The amida automatic content linking device: just-in-time document retrieval in meetings. In A. Popescu-Belis and R. Stiefelhagen, editors, Machine Learning for Multimodal Interaction V (Proceedings of MLMI 2008, Utrecht, 8-10 September 2008), LNCS 5237, pages 273–284. Springer-Verlag, Berlin/Heidelberg, 2008. [ bib ]
Keywords: Report_VII, IM2.HMI
[446] L. Dollé, M. Khamassi, B. Girard, A. Guillot, and R. Chavarriaga. Analyzing interactions between navigation strategies using a computational model of action selection. In Spatial Cognition 2008 (SC '08), Lecture Notes in Computer Science, pages 71–86, 2008. IDIAP-RR 08-48. [ bib ]
For animals as well as for humans, the hypothesis of multiple memory systems involved in different navigation strategies is supported by several biological experiments. However, due to technical limitations, it remains difficult for experimentalists to eluciyearhow these neural systems interact. We present how a computational model of selection between navigation strategies can be used to analyse phenomena that cannot be directly observed in biological experiments. We reproduce an experiment where the rat's behaviour is assumed to be ruled by two different navigation strategies (a cue-guided and a map-based one). Using a modelling approach, we can explain the experimental results in terms of interactions between these systems, either competing or cooperating at specific moments of the experiment. Modelling such systems can help biological investigations to explain and predict the animal behaviour.

Keywords: IM2.MCA, Report_VII
[447] G. Gonzalez, F. Fleuret, and P. Fua. Automated delineation of dendritic networks in noisy image stacks. In The 10th European Conference on Computer Vision, 2008. [ bib ]
Keywords: IM2.VP, Report_VII
[448] K. Kumatani, J. McDonough, S. Schacht, D. Klakow, P. N. Garner, and W. Li. Filter bank design for subband adaptive beamforming and application to speech recognition. Idiap-RR Idiap-RR-02-2008, IDIAP, 2008. [ bib | .ps.gz | .pdf ]
beginabstract We present a new filter bank design method for subband adaptive beamforming. Filter bank design for adaptive filtering poses many problems not encountered in more traditional applications such as subband coding of speech or music. The popular class of perfect reconstruction filter banks is not well-suited for applications involving adaptive filtering because perfect reconstruction is achieved through alias cancellation, which functions correctly only if the outputs of individual subbands are emphnot subject to arbitrary magnitude scaling and phase shifts. In this work, we design analysis and synthesis prototypes for modulated filter banks so as to minimize each aliasing term individually. We then show that the emphtotal response error can be driven to zero by constraining the analysis and synthesis prototypes to be emphNyquist(M) filters. We show that the proposed filter banks are more robust for aliasing caused by adaptive beamforming than conventional methods. Furthermore, we demonstrate the effectiveness of our design technique through a set of automatic speech recognition experiments on the multi-channel, far-field speech data from the emphPASCAL Speech Separation Challenge. In our system, speech signals are first transformed into the subband domain with the proposed filter banks, and thereafter the subband components are processed with a beamforming algorithm. Following beamforming, post-filtering and binary masking are performed to further enhance the speech by removing residual noise and undesired speech. The experimental results prove that our beamforming system with the proposed filter banks achieves the best recognition performance, a 39.6% word error rate (WER), with half the amount of computation of that of the conventional filter banks while the perfect reconstruction filter banks provided a 44.4% WER. endabstract

Keywords: IM2.AP, Report_VIII
[449] J. Anemuller, J. H. Back, B. Caputo, M. Havlena, J. Luo, H. Kayser, B. Leibe, P. Motlicek, T. Pajdla, M. Pavel, A. Torii, L. van Gool, A. Zweig, and H. Hermansky. The dirac awear audio-visual platform for detection of unexpected and incongruent events. In Proceedings of the International Conference on Multimodal Interfaces, 2008. [ bib | .pdf ]
It is of prime importance in everyday human life to cope with and respond appropriately to events that are not foreseen by prior experience. Machines to a large extent lack the ability to respond appropriately to such inputs. An important class of unexpected events is defined by incongruent combinations of inputs from different modalities and therefore multimodal information provides a crucial cue for the identification of such events, e.g., the sound of a voice is being heard while the person in the fieldof- view does not move her lips. In the project DIRAC (”Detection and Identification of Rare Audio-visual Cues”) we have been developing algorithmic approaches to the detection of such events, as well as an experimental hardware platform to test it. An audio-visual platform (”AWEAR” - audio-visual wearable device) has been constructed with the goal to help users with disabilities or a high cognitive load to deal with unexpected events. Key hardware components include stereo panoramic vision sensors and 6-channel worn-behind-the-ear (hearing aid) microphone arrays. Data have been recorded to study audio-visual tracking, a/v scene/object classification and a/v detection of incongruencies.

Keywords: IM2.DMA, Report_VIII
[450] K. Kumatani, J. McDonough, D. Klakow, P. N. Garner, and W. Li. Maximum negentropy beamforming. Idiap-RR Idiap-RR-07-2008, IDIAP, 2008. [ bib | .ps.gz | .pdf ]
Keywords: Report_VII, IM2.AP
[451] G. Friedland and O. Vinyals. Live speaker identification in conversations. In ACM Multimedia 2008, Vancouver, Canada, pages 1017–1018, 2008. [ bib ]
Keywords: IM2.AP, Report_VIII
[452] H. Hung and G. Friedland. Towards audio-visual on-line diarization of participants in group meetings. In European Conference on Computer Vision (ECCV) 2008, Marseille, France, 2008. [ bib ]
Keywords: IM2.AP, Report_VIII
[453] O. Vinyals and G. Friedland. A hardware-independent fast logarithm approximation with adjustable accuracy. In 10th IEEE International Symposium on Multimedia, Berkeley, CA, USA, pages 61–65, 2008. [ bib ]
Keywords: IM2.AP, Report_VIII
[454] O. Vinyals and G. Friedland. Modulation spectrogram features for speaker diarization. In Interspeech 2008, Brisbane, Australia, pages 630–633, 2008. [ bib ]
Keywords: IM2.AP, Report_VIII
[455] K. Boakye, O. Vinyals, and G. Friedland. Two's a crowd: improving speaker diarization by automatically identifying and excluding overlapped speech. In Interspeech 2008, Brisbane, Australia, pages 32–35, 2008. [ bib ]
Keywords: IM2.AP, Report_VIII
[456] F. Fleuret and D. Geman. Stationary features and cat detection. Journal of Machine Learning Research (JMLR), 9:2549–2578, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[457] A. Thomas, S. Ganapathy, and H. Hermansky. Hilbert envelope based features for far-field speech recognition. In MLMI 2008. Utrecht, The Netherlands, 2008. IDIAP-RR 08-42. [ bib ]
Automatic speech recognition (ASR) systems, trained on speech signals from close-talking microphones, generally fail in recognizing far-field speech. In this paper, we present a Hilbert Envelope based feature extraction technique to alleviate the artifacts introduced by room reverberations. The proposed technique is based on modeling temporal envelopes of the speech signal in narrow sub-bands using Frequency Domain Linear Prediction (FDLP). ASR experiments on far-field speech using the proposed FDLP features show significant performance improvements when compared to other robust feature extraction techniques (average relative improvement of 43 % in word error rate).

Keywords: IM2.AP, Report_VII
[458] M. Soleymani, J. Kierkels, G. Chanel, E. Bruno, S. Marchand-Maillet, and T. Pun. Estimating emotions and tracking interest during movie watching based on multimedia content and physiological responses. In Joint (IM)2-Interactive Multimodal Information Management and Affective Sciences NCCRs meeting, Riederalp, Switzerland, 2008. [ bib ]
Keywords: Report_VII, IM2.MCA
[459] P. W. Ferrez and J. del R. Millán. Simultaneous real-time detection of motor imagery and error-related potentials for improved bci accuracy. In Proceedings of the 4th International Brain-Computer Interface Workshop and Training Course, 2008. [ bib ]
Brain-computer interfaces (BCIs), as any other interaction modality based on physiological signals and body channels (e.g., muscular activity, speech and gestures), are prone to errors in the recognition of subject's intent. An elegant approach to improve the accuracy of BCIs consists of a verification procedure directly based on the presence of error-related potentials (ErrP) in the EEG recorded right after the occurrence of an error. Two healthy volunteer subjects with little prior BCI experience participated in a real-time human-robot interaction experiment where they were asked to mentally move a cursor towards a target that can be reached within a few steps using motor imagery. These experiments confirm the previously reported presence of a new kind of ErrP. These Interaction ErrP exhibit a first sharp negative peak followed by a positive peak and a second broader negative peak ( 270, 330 and 430 ms after the feedback, respectively). The objective of the present study was to simultaneously detect erroneous responses of the interface and classifying motor imagery at the level of single trials in a real-time system. We have achieved online an average recognition rate of correct and erroneous single trials of 84.7% and 78.8%, respectively. The off-line post-analysis showed that the BCI error rate without the integration of ErrP detection is around 30% for both subjects. However, when integrating ErrP detection, the average online error rate drops to 7%, multiplying the bit rate by more than 3. These results show that it's possible to simultaneously extract in real-time useful information for mental control to operate a brain-actuated device as well as correlates of cognitive states such as error-related potentials to improve the quality of the brain-computer interaction.

Keywords: IM2.BCI, Report_VII
[460] B. Leibe, A. Ettlin, and B. Schiele. Learning semantic object parts for object categorization. Image and Vision Computing, 26(1):15–26, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[461] D. Vijayasenan, F. Valente, and H. Bourlard. Integration of tdoa features in information bottleneck framework for fast speaker diarization. In Interspeech 2008, 2008. IDIAP-RR 08-26. [ bib | .ps.gz | .pdf ]
In this paper we address the combination of multiple feature streams in a fast speaker diarization system for meeting recordings. Whenever Multiple Distant Microphones (MDM) are used, it is possible to estimate the Time Delay of Arrival (TDOA) for different channels. In citexavi_comb

Keywords: IM2.AP, Report_VIII
[462] A. Popescu-Belis, M. Flynn, P. Wellner, and P. Baudrion. Task-based evaluation of meeting browsers: from bet task elicitation to user behavior analysis. In LREC 2008 (6th International Conference on Language Resources and Evaluation), Marrakech, Morocco, 2008. [ bib ]
This paper presents recent results of the application of the task-based Browser Evaluation Test (BET) to meeting browsers, that is, interfaces to multimodal databases of meeting recordings. The tasks were defined by browser-neutral BET observers. Two groups of human subjects used the Transcript-based Query and Browsing interface (TQB), and attempted to solve as many BET tasks a pairs of true/false statements to disambiguate a as possible in a fixed amount of time. Their performance was measured in terms of precision and speed. Results indicate that the browser as annotation-based search functionality is frequently used, in particular the keyword search. A more detailed analysis of each test question for each participant confirms that despite considerable variation across strategies, the use of queries is correlated to successful performance.

Keywords: Report_VII, IM2.HMI
[463] P. Estrella, A. Popescu-Belis, and M. King. Improving contextual quality models for mt evaluation based on evaluators' feedback. In LREC 2008 (6th International Conference on Language Resources and Evaluation), Marrakech, Morocco, 2008. [ bib ]
The Framework for Machine Translation Evaluation (FEMTI), introduced by the ISLE Evaluation Working Group, contains guidelines for defining a quality model used to evaluate an MT system, in relation to the purpose and context of use of the system. In this paper, we report results from a recent experiment aimed at transferring knowledge from MT evaluation experts into the FEMTI guidelines, in particular, to populate relations denoting the influence of the context of use of a system on its evaluation. The results of this hands-on exercise carried out as part of a tutorial, are publicly available at http://www.issco.unige.ch/femti/.

Keywords: Report_VII, IM2.DMA
[464] F. De Simone, M. Ansorge, and T. Ebrahimi. A multi-channel objective model for the full-reference assessment of color pictures. In 2nd K-space Jamboree Workshop, 2008. [ bib | http ]
This paper presents a new approach for the design of a full reference objective quality metric for the assessment of color pictures. Our goal is to build a multi-channel metric based on the perceptual weighting of single-channel metrics. A psycho-visual experiment is thus designed in order to determine the values of the weighting factors. This metric is expected to provide a new useful tool for the quality assessment of compressed pictures in the framework of codec performance evaluation.

Keywords: IM2.MCA, Report_VIII
[465] R. Bertolami, C. Gutmann, L. Spitz, and H. Bunke. Shape code based lexicon reduction for offline handwriting recognition. In Proc. 8th IAPR Int. Workshop on Document Analysis Systems, pages 158–163, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[466] D. Vijayasenan, F. Valente, and H. Bourlard. Combination of agglomerative and sequential clustering for speaker diarization. In International Conference on Acoustics, Speech and Signal Processing, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[467] F. Orabona, J. Keshet, and B. Caputo. The projectron: a bounded kernel-based perceptron. In Int. Conf. on Machine Learning, 2008. IDIAP-RR 08-30. [ bib | .ps.gz | .pdf ]
We present a discriminative online algorithm with a bounded memory growth, which is based on the kernel-based Perceptron. Generally, the required memory of the kernel-based Perceptron for storing the online hypothesis is not bounded. Previous work has been focused on discarding part of the instances in order to keep the memory bounded. In the proposed algorithm the instances are not discarded, but projected onto the space spanned by the previous online hypothesis. We derive a relative mistake bound and compare our algorithm both analytically and empirically to the state-of-the-art Forgetron algorithm (Dekel et al, 2007). The first variant of our algorithm, called Projectron, outperforms the Forgetron. The second variant, called Projectron , outperforms even the Perceptron.

Keywords: IM2.MPR, Report_VIII
[468] A. Popescu-Belis, H. Bourlard, and S. Renals. Machine learning for multimodal interaction iv, volume 4892 of LNCS. Springer-Verlag, Berlin/Heidelberg, 2008. http://www.springeronline.com/978-3-540-78154-7. [ bib ]
This book constitutes the thoroughly refereed post-proceedings of the 4th International Workshop on Machine Learning for Multimodal Interaction, MLMI 2007, held in Brno, Czech Republic, in June 2007. The 25 revised full papers presented together with 1 invited paper were carefully selected during two rounds of reviewing and revision from 60 workshop presentations. The papers are organized in topical sections on multimodal processing, HCI, user studies and applications, image and video processing, discourse and dialogue processing, speech and audio processing, as well as the PASCAL speech separation challenge.

Keywords: IM2.MPR, Report_VII
[469] T. Dutoit, L. Couvreur, and H. Bourlard. How does a dictation machine recognize speech ? In Applied Signal Processing–A MATLAB approach, chapter 4, pages 104–148. Springer MA, 2008. [ bib | .pdf ]
Keywords: IM2.AP, Report_VIII
[470] D. Gatica-Perez and K. Farrahi. Discovering human routines from cell phone data with topic models. In IEEE International Symposium on Wearable Computers (ISWC), 2008. IDIAP-RR 08-32. [ bib ]
We present a framework to automatically discover people's routines from information extracted by cell phones. The framework is built from a probabilistic topic model learned on novel bag type representations of activity-related cues (location, proximity and their temporal variations over a day) of peoples' daily routines. Using real-life data from the Reality Mining dataset, covering 68 000 hours of human activities, we can successfully discover location-driven (from cell tower connections) and proximity-driven (from Bluetooth information) routines in an unsupervised manner. The resulting topics meaningfully characterize some of the underlying co-occurrence structure of the activities in the dataset, including “going to work early/late", “being home all day", “working constantly", “working sporadically" and “meeting at lunch time".

Keywords: IM2.MCA, Report_VII
[471] K. Schindler and D. Suter. Object detection by global contour shape. Pattern Recognition, 2008. [ bib ]
Keywords: Report_VII, IM2.VP.MCA, joint
[472] G. Garipelli, R. Chavarriaga, and J. del R. Millán. Fast recognition of anticipation related potentials. IEEE Transactions on Biomedical Engineering, 2008. In press. [ bib ]
Anticipation increases the efficiency of daily tasks by partial advance activation of neural substrates involved in it. Here we develop a method for the recognition of electroencephalogram (EEG) correlates of this activation as early as possible on single trials which is essential for Brain-Computer Interaction (BCI). We explore various features from the EEG recorded in a Contingent Negative Variation (CNV) paradigm. We also develop a novel technique called Time Aggregation of Classification (TAC) for fast and reliable decisions that combines the posterior probabilities of several classifiers trained with features computed from temporal blocks of EEG until a certainty threshold is reached. Experiments with 9 naive subjects performing the CNV experiment with GO and NOGO conditions with an inter-stimulus interval of 4 s show that the performance of the TAC method is above 70% for four subjects, around 60% for two other subjects, and random for the remaining subjects. On average over all subjects, more than 50% of the correct decisions are made at 2 s, without needing to wait until 4 s.

Keywords: IM2.BMI, Report_VII
[473] H. Bourlard, R. Chavarriaga, F. Galán, and J. del R. Millán. Characterizing the eeg correlates of exploratory behavior. IEEE Transactions on Neural Systems & Rehabilitation Engineering, 2008. IDIAP-RR 08-28. [ bib ]
This study aims to characterize the EEG correlates of exploratory behavior. Decision making in an uncertain environment raises a conflict between two opposing needs: gathering information about the environment and exploiting this knowledge in order to optimize the decision. Exploratory behavior has already been studied using fMRI. Based on a usual paradigm in reinforcement learning, this study has shown bilateral activation in the frontal and parietal cortex. To our knowledge, no previous study has been done on it using EEG. The study of the exploratory behavior using EEG signals raises two difficulties. First, the labels of trial as exploitation or exploration cannot be directly derived from the subject action. In order to access this information, a model of how the subject makes his decision must be built. The exploration related information can be then derived from it. Second, because of the complexity of the task, its EEG correlates are not necessarily time locked with the action. So the EEG processing methods used should be designed in order to handle signals that shift in time across trials. Using the same experimental protocol as the fMRI study, results show that the bilateral frontal and parietal areas are also the most discriminant. This strongly suggests that the EEG signal also conveys information about the exploratory behavior.

Keywords: IM2.BMI, Report_VII
[474] W. Li. Effective post-processing for single-channel frequency-domain speech enhancement. Number Idiap-RR-71-2007, pages 149–152, 2008. Submitted for publication. [ bib | DOI ]
Conventional frequency-domain speech enhancement filters improve signal-to-noise ratio (SNR), but also produce speech distortions. This paper describes a novel post-processing algorithm devised for the improvement of the quality of the speech processed by a conventional filter. In the proposed algorithm, the speech distortion is first compensated by adding the original noisy speech, and then the noise is reduced by a post-filter. Experimental results on speech quality show the effectiveness of the proposed algorithm in lower speech distortions. Based on our isolated word recognition experiments conducted in 15 real car environments, a relative word error rate (WER) reduction of 10.5% is obtained compared to the conventional filter.

Keywords: IM2.AP, Report_VII
[475] S. Ba and J. M. Odobez. Multi-person visual focus of attention from head pose and meeting contextual cues. Technical Report 47, IDIAP Research Report 47, submitted to the IEEE Transactions on Pattern Analysis and Machine Intelligence, second revision, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[476] K. Kryszczuk and A. Drygajlo. What do quality measures predict in biometrics. pages –,–29, Lausanne, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[477] A. Shahrokni, T. Drummond, F. Fleuret, and P. Fua. Classification-based probabilistic modeling of texture transition for fast line search tracking and delineation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008. [ bib ]
Keywords: IM2.BMI, Report_VIII
[478] R. Bertolami and H. Bunke. Ensemble methods to improve the performance of an english handwritten text line recognizer. In D. Doerman and S. Jaeger, editors, Arabic and Chinese Handwriting Recognition, LNCS 4768, pages 265–277. Springer, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[479] J. Keshet and S. Bengio. Automatic speech and speaker recognition: large margin and kernel methods. John Wiley & Sons, 2008. [ bib ]
This is the first book dedicated to uniting research related to speech and speaker recognition based on the recent advances in large margin and kernel methods. The first part of the book presents theoretical and practical foundations of large margin and kernel methods, from support vector machines to large margin methods for structured learning. The second part of the book is dedicated to acoustic modeling of continuous speech recognizers, where the grounds for practical large margin sequence learning are set. The third part introduces large margin methods for discriminative language modeling. The last part of the book is dedicated to the application of keyword spotting, speaker verification and spectral clustering. The book is an important reference to researchers and practitioners in the field of modern speech and speaker recognition. The purpose of the book is twofold; first, to set the theoretical foundation of large margin and kernel methods relevant to speech recognition domain; second, to propose a practical guide on implementation of these methods to the speech recognition domain. The reader is presumed to have basic knowledge of large margin and kernel methods and of basic algorithms in speech and speaker recognition.

Keywords: IM2.AP, Report_VII
[480] K. Boakye, B. Trueba-Hornero, O. Vinyals, and G. Friedland. Overlapped speech detection for improved speaker diarization in multiparty meetings. In International Conference on Acoustics, Speech, and Signal Processing, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[481] D. Jayagopi, S. Ba, J. M. Odobez, and D. Gatica-Perez. Predicting two facets of social verticality in meetings from five-minute time slices and nonverbal cues. In Proc. Int. Conf. on Multimodal Interfaces (ICMI), Special Session on Social Signal Processing, Chania, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[482] J. del R. Millán, P. W. Ferrez, F. Galán, E. Lew, and R. Chavarriaga. Non-invasive brain-machine interaction. International Journal of Pattern Recognition and Artificial Intelligence, 2008. [ bib ]
The promise of Brain-Computer Interfaces (BCI) technology is to augment human capabilities by enabling interaction with computers through a conscious and spontaneous modulation of the brainwaves after a short training period. Indeed, by analyzing brain electrical activity online, several groups have designed brain-actuated devices that provide alternative channels for communication, entertainment and control. Thus, a person can write messages using a virtual keyboard on a computer screen and also browse the internet. Alternatively, subjects can operate simple computer games, or brain games, and interact with educational software. Work with humans has shown that it is possible for them to move a cursor and even to drive a wheelchair. This paper briefly reviews the field of BCI, with a focus on non-invasive systems based on electroencephalogram (EEG) signals. It also describes three brain-actuated devices we have developed: a virtual keyboard, a brain game, and a wheelchair. Finally, it shortly discusses current research directions we are pursuing in order to improve the performance and robustness of our BCI system, especially for real-time control of brainactuated robots.

Keywords: IM2.BMI, Report_VII
[483] A. Thomas, V. Ferrari, B. Leibe, T. Tuytelaars, and L. van Gool. Using recognition to guide a robot's attention. In Robotics Science and Systems, 2008. in press. [ bib ]
Keywords: Report_VII, IM2.VP
[484] K. Schindler and L. van Gool. Action snippets: how many frames does human action recognition require? In IEEE Conference on Computer Vision and Pattern Recognition (CVPR'08). IEEE Press, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[485] E. Indermühle, M. Liwicki, and H. Bunke. Recognition of handwritten historical documents: hmm -adaptation vs. writer specific training. In Proc. 11th Int. Conf. on Frontiers in Handwriting Recognition, pages 186–191, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[486] D. Jayagopi, B. Raducanu, and D. Gatica-Perez. Characterizing conversational group dynamics using nonverbal behavior. In Proc. IEEE Int. Conf. on Multimedia (ICME), NewYork, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[487] B. Caputo. Class specific object recognition using kernel gibbs distributions. ELectronic Letters on Computer vision and Image Analysis, 7(2):96–109, 2008. Special Issue on Computational Modelling of Objects Represented in Images. [ bib | .pdf ]
Keywords: IM2.BMI, Report_VIII
[488] D. Morrison, S. Marchand-Maillet, and E. Bruno. Semantic clustering of images using patterns of relevance feedback. In Proceedings of the 6th International Workshop on Content-based Multimedia Indexing (CBMI'2008), London, UK, 2008. [ bib ]
Keywords: Report_VII, IM2.MCA
[489] D. Grandjean and T. Pun, editors. Multimodality in emotions and for their assessment, 2008. Workshop at Joint (IM)2-Interactive Multimodal Information Management and Affective Sciences NCCRs meeting. [ bib ]
Keywords: Report_VII, IM2.MCA
[490] D. Lalanne, M. Rigamonti, R. Ingold, F. Evéquoz, and B. Dumas. An ego-centric and tangible approach to meeting indexing and browsing, volume Volume 4892 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg, computer science edition, 2008. [ bib | DOI ]
This article presents an ego-centric approach for indexing and browsing meetings. The method considers two concepts: meetings data alignment with personal information to enable ego-centric browsing and live intentional annotation of meetings through tangible actions to enable ego-centric indexing. The article first motivates and introduces these concepts and further presents brief states-of-the-art of the domain of tangible user interaction, of document-centric multimedia browsing, a traditional tangible object to transport information, and of personal information management. The article then presents our approach in the context of meeting and details our methods to bridge the gap between meeting data and personal information. Finally the article reports the progress of the integration of this approach within Fribourgs meeting room.

Keywords: IM2.DMA, Report_VII
[491] J. Luo, B. Caputo, A. Zweig, J. H. Back, and J. Anemuller. Object category detection using audio-visual cues. In International Conference on Computer Vision Systems (ICVS08), Santorini, Greece, 2008. [ bib ]
Categorization is one of the fundamental building blocks of cognitive systems. Object categorization has traditionally been addressed in the vision domain, even though cognitive agents are intrinsically multimodal. Indeed, biological systems combine several modalities in order to achieve robust categorization. In this paper we propose a multimodal approach to object category detection, using audio and visual information. The auditory channel is modeled on biologically motivated spectral features via a discriminative classifier. The visual channel is modeled by a state of the art part based model. Multimodality is achieved using two fusion schemes, one high level and the other low level. Experiments on six different object categories, under increasingly difficult conditions, show strengths and weaknesses of the two approaches, and clearly underline the open challenges for multimodal category detection.

Keywords: IM2.AP, Report_VII
[492] K. Kumatani, J. McDonough, B. Rauch, D. Klakow, P. N. Garner, and W. Li. Beamforming with a maximum negentropy criterion. IEEE Transactions on Audio Speech and Language Processing, 17(5):994–1008, 2008. [ bib | .pdf ]
In this paper, we address a beamforming application based on the capture of far-field speech data from a single speaker in a real meeting room. After the position of the speaker is estimated by a speaker tracking system, we construct a subband-domain beamformer in generalized sidelobe canceller (GSC) configuration. In contrast to conventional practice, we then optimize the active weight vectors of the GSC so as to obtain an output signal with maximum negentropy (MN). This implies the beamformer output should be as non-Gaussian as possible. For calculating negentropy, we consider the and the generalized Gaussian (GG) pdfs. After MN beamforming, Zelinski post- filtering is performed to further enhance the speech by remov- ing residual noise. Our beamforming algorithm can suppress noise and reverberation without the signal cancellation problems encountered in the conventional beamforming algorithms. We demonstrate this fact through a set of acoustic simulations. More- over, we show the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC- WSJ-AV), a corpus of data captured with real far-field sensors, in a realistic acoustic environment, and spoken by real speakers. On the MC-WSJ-AV evaluation data, the delay-and-sum beamformer with post-filtering achieved a word error rate (WER) of 16.5%. MN beamforming with the pdf achieved a 15.8% WER, which was further reduced to 13.2% with the GG pdf, whereas the simple delay-and-sum beamformer provided a WER of 17.8%. To the best of our knowledge, no lower error rates at present have been reported in the literature on this ASR task.

Keywords: IM2.AP, Report_VIII
[493] G. Zeng and L. van Gool. Multi-label image segmentation via point-wise repetition. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[494] S. Gammeter, A. Ess, T. Jaeggli, B. Leibe, K. Schindler, and L. van Gool. Articulated multibody tracking under egomotion. In European Conference on Computer Vision (ECCV'08), LNCS. Springer, 2008. in press. [ bib ]
Keywords: Report_VII, IM2.VP
[495] T. Quack, B. Leibe, and L. van Gool. World-scale mining of objects and events from community photo collections. In Conference on Image and Video Retrieval (CIVR'08). ACM, 2008. [ bib ]
Keywords: Report_VII, IM2.MCA
[496] B. Leibe, K. Schindler, N. Cornelis, and L. van Gool. Coupled object detection and tracking from static cameras and moving vehicles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[497] S. Pellegrini, K. Schindler, and D. Nardi. A generalization of the icp algorithm for articulated bodies. In M. Everingham and C. Needham, editors, British Machine Vision Conference (BMVC'08), 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[498] M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy. Composite kernel learning. In A. McCallum and S. Roweis, editors, Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008), pages 1040–1047. Omnipress, 2008. IDIAP-RR 08-59. [ bib | .pdf ]
The Support Vector Machine (SVM) is an acknowledged powerful tool for building classifiers, but it lacks flexibility, in the sense that the kernel is chosen prior to learning. Multiple Kernel Learning (MKL) enables to learn the kernel, from an ensemble of basis kernels, whose combination is optimized in the learning process. Here, we propose Composite Kernel Learning to address the situation where distinct components give rise to a group structure among kernels. Our formulation of the learning problem encompasses several setups, putting more or less emphasis on the group structure. We characterize the convexity of the learning problem, and provide a general wrapper algorithm for computing solutions. Finally, we illustrate the behavior of our method on multi-channel data where groups correpond to channels.

Keywords: IM2.MPR, Report_VIII
[499] S. Voloshynovskiy, O. Koval, F. Beekhof, and T. Pun. Multimodal authentication based on random projections and distributed coding. In MM&Sec 2008, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[500] S. Ganapathy, P. Motlicek, H. Hermansky, and H. Garudadri. Autoregressive modelling of hilbert envelopes for wide-band audio coding. In AES 124th Convention, Audio Engineering Society, 2008. IDIAP-RR 08-40. [ bib ]
Frequency Domain Linear Prediction (FDLP) represents the technique for approximating temporal envelopes of a signal using autoregressive models. In this paper, we propose a wide-band audio coding system exploiting FDLP. Specifically, FDLP is applied on critically sampled sub-bands to model the Hilbert envelopes. The residual of the linear prediction forms the Hilbert carrier, which is transmitted along with the envelope parameters. This process is reversed at the decoder to reconstruct the signal. In the objective and subjective quality evaluations, the FDLP based audio codec at 66 kbps provides competitive results compared to the state-of-art codecs at similar bit-rates.

Keywords: IM2.AP, Report_VII
[501] J. Kludas, E. Bruno, and S. Marchand-Maillet. Can feature information interaction help for information fusion in multimedia problems? In First International Workshop on Metadata Mining for Image Understanding, pages 23–33, Funchal, Madeira, 2008. [ bib ]
Keywords: Report_VII, IM2.MCA
[502] N. Garg and D. Hakkani-Tur. Speaker role detection in meetings using lexical information and social network analysis. Technical Report TR-08-004, International Computer Science Institute, Berkeley, CA, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[503] A. Humm, J. Hennebert, and R. Ingold. Combined handwriting and speech modalities for user authentication. IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 38, 2008. [ bib ]
Keywords: Report_VII, IM2.HMI
[504] A. Humm, J. Hennebert, and R. Ingold. Spoken signature for user authentication. SPIE Journal of Electronic Imaging, 17, 2008. [ bib ]
Keywords: Report_VII, IM2.HMI
[505] K. Kumatani, J. McDonough, D. Klakow, P. N. Garner, and W. Li. Adaptive beamforming with a maximum negentropy criterion,. In The Joint Workshop on Hands-free Speech Communication and Microphone Arrays, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[506] K. Kumatani, J. McDonough, S. Schacht, D. Klakow, P. N. Garner, and W. Li. Filter bank design based on minimization of individual aliasing terms for minimum mutual information subband adaptive beamforming. In International Conferance on Acoustics Speech and Signal Processing, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[507] M. van den Berg, E. Koller-Meier, and L. van Gool. Fast body posture estimation using volumetric features. In IEEE Visual Motion Computing (MOTION), 2008. [ bib ]
Keywords: Report_VII, IM2.VP, Haarlets, LDA, pose estimation, 3D, hull reconstruction
[508] T. Spindler, C. Wartmann, L. Hovestadt, D. Roth, L. van Gool, and A. Steffen. Privacy in video surveilled spaces. Journal of Computer Security, 16(2):199–222, 2008. [ bib ]
Keywords: Report_VII, IM2.VP, Surveillance, cryptography, computer vision, building automation
[509] T. Quack, H. Bay, and L. van Gool. Object recognition for the internet of things. In Internet of Things 2008, 2008. in press. [ bib ]
Keywords: Report_VII, IM2.MCA
[510] P. N. Garner. A weighted finite state transducer tutorial. Idiap-Com Idiap-Com-03-2008, IDIAP, 2008. [ bib ]
The concepts of WFSTs are summarised, including structural and stochastic optimisations. A typical composition process for ASR is described. Some experiments show that care should be taken with silence models.

Keywords: IM2.AP, Report_VII
[511] J. del R. Millán. Brain-controlled robots. IEEE Intelligent Systems, 2008. [ bib | .pdf ]
The idea of moving robots or prosthetic devices not by manual control, but by mere üi 1/2thinkingüi 1/2 (i.e., the brain activity of human subjects) has fascinated researchers for the last 30 years, but it is only now that first experiments have shown the possibility to do so. How can brainwaves be used to directly control robots? Most of the hope for braincontrolled robots comes from invasive approaches that provide detailed single neuron activity recorded from microelectrodes implanted in the brain [1]. The motivation for these invasive approaches is that it has been widely shown that motor parameters related to hand and arm movements are encoded in a distributed and redundant way by ensembles of neurons in the motor system of the brainüi 1/2motor, premotor and posterior parietal cortex. For humans, however, it is preferable to use non-invasive approaches to avoid health risks and the associated ethical concerns. Most non-invasive brain-computer interfaces (BCI) use electroencephalogram (EEG) signals; i.e., the electrical brain activity recorded from electrodes placed on the scalp. The main source of the EEG is the synchronous activity of thousands of cortical neurons. Thus, EEG signals suffer from a reduced spatial resolution and increased noise due to measurements on the scalp. As a consequence, current EEG-based brain-actuated devices are limited by a low channel capacity and are considered too slow for controlling rapid and complex sequences of robot movements. But, recently, we have shown for the first time that online analysis of EEG signals, if used in combination with advanced robotics and machine learning techniques, is sufficient for humans to continuously control a mobile robot [2] and a wheelchair [3]. In this article we will review our work on non-invasive brain-controlled robots and discuss some of the challenges ahead.

Keywords: IM2.BMI, Report_VIII
[512] J. Richiardi, A. Drygajlo, and L. Todesco. Promoting diversity in gaussian mixture ensembles: an application to signature verification. In Biometrics and Identity Management, Lecture Notes in Computer Science 5372, pages 140–149, Heidelberg, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[513] D. Gatica-Perez and K. Farrahi. Daily routine classification from mobile phone data. In Workshop on Machine Learning and Multimodal Interaction (MLMI08), 2008. IDIAP-RR 07-62. [ bib ]
The automatic analysis of real-life, long-term behavior and dynamics of individuals and groups from mobile sensor data constitutes an emerging and challenging domain. We present a framework to classify people's daily routines (defined by day type, and by group affiliation type) from real-life data collected with mobile phones, which include physical location information (derived from cell tower connectivity), and social context (given by person proximity information derived from Bluetooth). We propose and compare single- and multi-modal routine representations at multiple time scales, each capable of highlighting different features from the data, to determine which best characterized the underlying structure of the daily routines. Using a massive data set of 87000 hours spanning four months of the life of 30 university students, we show that the integration of location and social context and the use of multiple time-scales used in our method is effective, producing accuracies of over 80% for the two daily routine classification tasks investigated, with significant performance differences with respect to the single-modal cues.

Keywords: IM2.MCA, Report_VII
[514] K. Kamangar, D. Hakkani-Tur, G. Tur, and M. Levit. An iterative unsupervised learning method for information distillation. accepted for IEEE ICASSP, Las Vegas, NV, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[515] F. Camastra and A. Vinciarelli. Machine learning for audio, image and video analysis, volume XVI of Advanced Information and Knowledge Processing. Springer Verlag, theory and applications edition, 2008. [ bib ]
Machine Learning involves several scientific domains including mathematics, computer science, statistics and biology, and is an approach that enables computers to automatically learn from data. Focusing on complex media and how to convert raw data into useful information, this book offers both introductory and advanced material in the combined fields of machine learning and image/video processing. The machine learning techniques presented enable readers to address many real world problems involving complex data. Examples covering areas such as automatic speech and handwriting transcription, automatic face recognition, and semantic video segmentation are included, along with detailed introductions to algorithms and examples of their applications. The book is organized in four parts: The first focuses on technical aspects, basic mathematical notions and elementary machine learning techniques. The second provides an extensive survey of most relevant machine learning techniques for media processing, while the third part focuses on applications and shows how techniques are applied in actual problems. The fourth part contains detailed appendices that provide notions about the main mathematical instruments used throughout the text. Students and researchers needing a solid foundation or reference, and practitioners interested in discovering more about the state-of-the-art will find this book invaluable. Examples and problems are based on data and software packages publicly available on the web.

Keywords: IM2.MCA, Report_VII
[516] F. Fleuret and D. Geman. Stationary features and cat detection. Journal of Machine Learning Research, 2008. [ bib ]
Most discriminative techniques for detecting instances from object categories in still images consist of looping over a partition of a pose space with dedicated binary classifiers. The efficiency of this strategy for a complex pose, i.e., for fine-grained descriptions, can be assessed by measuring the effect of sample size and pose resolution on accuracy and computation. Two conclusions emerge: i) fragmenting the training data, which is inevitable in dealing with high in-class variation, severely reduces accuracy; ii) the computational cost at high resolution is prohibitive due to visiting a massive pose partition. To overcome data-fragmentation we propose a novel framework centered on pose-indexed features which assign a response to a pair consisting of an image and a pose, and are designed to be stationary: the probability distribution of the response is always the same if an object is actually present. Such features allow for efficient, one-shot learning of pose-specific classifiers. To avoid expensive scene processing, we arrange these classifiers in a hierarchy based on nested partitions of the pose; as in previous work on coarse-to-fine search, this allows for efficient processing. The hierarchy is then "folded" for training: all the classifiers at each level are derived from one base predictor learned from all the data. The hierarchy is "unfolded" for testing: parsing a scene amounts to examining increasingly finer object descriptions only when there is sufficient evidence for coarser ones. In this way, the detection results are equivalent to an exhaustive search at high resolution. We illustrate these ideas by detecting and localizing cats in highly cluttered greyscale scenes.

Keywords: IM2.VP, Report_VII
[517] H. Ketabdar and H. Bourlard. Enhanced phone posteriors for improving speech recognition systems. Idiap-RR Idiap-RR-39-2008, IDIAP, 2008. [ bib ]
Using phone posterior probabilities has been increasingly explored for improving automatic speech recognition (ASR) systems. In this paper, we propose two approaches for hierarchically enhancing these phone posteriors, by integrating long acoustic context, as well as prior phonetic and lexical knowledge. In the first approach, phone posteriors estimated with a Multi-Layer Perceptron (MLP), are used as emission probabilities in HMM forward-backward recursions. This yields new enhanced posterior estimates integrating HMM topological constraints (encoding specific phonetic and lexical knowledge), and context. posteriors are post-processed by a secondary MLP, in order to learn inter and intra dependencies between the phone posteriors. These dependencies are prior phonetic knowledge. The learned knowledge is integrated in the posterior estimation during the inference (forward pass) of the second MLP, resulting in enhanced phone posteriors. We investigate the use of the enhanced posteriors in hybrid HMM/ANN and Tandem configurations. We propose using the enhanced posteriors as replacement, or as complementary evidences to the regular MLP posteriors. The proposed method has been tested on different small and large vocabulary databases, always resulting in consistent improvements in frame, phone and word recognition rates.

Keywords: IM2.AP, Report_VII
[518] J. Yao and J. M. Odobez. Multi-camera 3d person tracking with particle filter in a surveillance environment. In 16th European Signal processing Conference (EUSIPCO), 2008. [ bib | .pdf ]
In this work we present and evaluate a novel 3D approach to track single people in surveillance scenarios, using multiple cameras. The problem is formulated in a Bayesian filtering framework, and solved through sampling approximations (i.e. using a particle filter). Rather than relying on a 2D state to represent people, as is most commonly done, we directly exploit 3D knowledge by tracking people in the 3D world. A novel dynamical model is presented that accurately models the coupling between people orientation and motion direction. In addition, people are represented by three 3D elliptic cylinders which allow to introduce a spatial color layout useful to discriminate the tracked person from potential distractors. Thanks to the particle filter approach, integrating background subtraction and color observations from multiple cameras is straightforward. Alltogether, the approach is quite robust to occlusion and large variations in people appearence, even when using a single camera, as demonstrated by numerical performance evaluation on real and challenging data from an underground station.

Keywords: IM2.VP, Report_VIII
[519] A. Carreras, G. Cordara, J. Delgado, F. Dufaux, G. Francini, T. M. Ha, E. Rodriguez, and R. Tous. A search and retrieval framework for the management of copyrighted audiovisual content. In 50th International Symposium ELMAR 2008, 2008. [ bib | http ]
This paper presents a search and retrieval framework that enables the management of Intellectual Property in the World Wide Web. This twofold framework helps users to detect digital rights infringements of their copyrighted content. In order to detect possible copyright infringments, first the system crawls the Web to search replicas of users images, and later evaluates if the copies have been taken respecting the terms stated by the owner. On the other hand, this framework also helps users in finding something interesting in the Web. It will provide copyrighted content to users according to their preferences and to intellectual property rights integrating search and retrieval with digital rights management tools.

Keywords: Report_VII, IM2.MCA
[520] L. Goldmann, T. Adamek, P. Vajda, M. Karaman, R. Mörzinger, E. Galmar, T. Sikora, N. O'Connor, T. Ha-Minh, T. Ebrahimi, P. Schallauer, and B. Huet. Towards fully automatic image segmentation evaluation. In Advanced Concepts for Intelligent Vision Systems (ACIVS), Lecture Notes in Computer Science, Berlin, 2008. Springer. [ bib | http ]
Keywords: Report_VII, IM2.MCA
[521] F. De Simone, D. Ticca, F. Dufaux, M. Ansorge, and T. Ebrahimi. A comparative study of color image compression standards using perceptually driven quality metrics. In SPIE Optics and Photonics, 2008. [ bib ]
The task of comparing the performance of different codecs is strictly related to the research in the field of objective quality metrics. Even if several objective quality metrics have been proposed in literature, the lack of standardization in the field of objective quality assessment and the lack of extensive and reliable comparisons of the performance of the different state-of-the-art metrics often make the results obtained using objective metrics not very reliable. In this paper we aim at comparing the performance of three of the existing alternatives for compression of digital pictures, i.e. JPEG, JPEG 2000, and JPEG XR compression, by using different objective Full Reference metrics and considering also perceptual quality metrics which take into account the color information of the data under analysis.

Keywords: Report_VII, IM2.MCA, Image compression; codec performance; Full-Reference quality assessment; perceptual quality metrics
[522] U. Hoffmann, J. Naruniec, A. Yazdani, and T. Ebrahimi. Face detection using discrete gabor jets and color information. In SIGMAP 2008 - International Conference on Signal Processing and Multimedia Applications, 2008. [ bib | http ]
Face detection allows to recognize and detect human faces and provides information about their location in a given image. Many applications such as biometrics, face recognition, and video surveillance employ face detection as one of their main modules. Therefore, improvement in the performance of existing face detection systems and new achievements in this field of research are of significant importance. In this paper a hierarchical classification approach for face detection is presented. In the first step, discrete Gabor jets (DGJ) are used for extracting features related to the brightness information of images and a preliminary classification is made. Afterwards, a skin detection algorithm, based on modeling of colored image patches, is employed as a post-processing of the results of DGJ-based classification. It is shown that the use of color efficiently reduces the number of false positives while maintaining a high true positive rate. Finally, a comparison is made with the OpenCV implementation of the Viola and Jones face detector and it is concluded that higher correct classification rates can be attained using the proposed face detector.

Keywords: Report_VII, IM2.VP,Face Detection; Colored Image Patch Model; Discrete Gabor Jets; Linear Discriminant Analysis
[523] J. Kludas, E. Bruno, and S. Marchand-Maillet. Exploiting synergistic and redundant features for multimedia document classification. In 32nd Annual Conference of the German Classification Society - Advances in Data Analysis, Data Handling and Business Intelligence (GfKl 2008), Hamburg, Germany, 2008. [ bib | .pdf ]
Keywords: IM2.MCA, Report_VIII
[524] B. Dumas, D. Lalanne, and R. Ingold. Démonstration : hephaistk, une boîte à outils pour le prototypage d'interfaces multimodales. 2008. [ bib ]
Keywords: Report_VII, IM2.HMI
[525] E. Kokiopoulou and P. Frossard. Semantic coding by supervised dimensionality reduction. IEEE Transactions on Multimedia, 10(2), 2008. [ bib ]
Keywords: Report_VII, IM2.DMA.VP, joint
[526] M. Knox, N. Morgan, and N. Mirghafori. Getting the last laugh: automatic laughter segmentation in meetings. In 9th International Conference of the ISCA (Interspeech 2008), Brisbane, Australia, pages 797–800, 2008. [ bib ]
Keywords: IM2.AP, Report_VIII
[527] S. Ganapathy, A. Thomas, and H. Hermansky. Front-end for far-field speech recognition based on frequency domain linear prediction. In Interspeech 2008, 2008. IDIAP-RR 08-17. [ bib ]
Automatic Speech Recognition (ASR) systems usually fail when they encounter speech from far-field microphone in reverberant environments. This is due to the application of short-term feature extraction techniques which do not compensate for the artifacts introduced by long room impulse responses. In this paper, we propose a front-end, based on Frequency Domain Linear Prediction (FDLP), that tries to remove reverberation artifacts present in far-field speech. Long temporal segments of far-field speech are analyzed in narrow frequency sub-bands to extract FDLP envelopes and residual signals. Filtering the residual signals with gain normalized inverse FDLP filters result in a set of sub-band signals which are synthesized to reconstruct the signal back. ASR experiments on far-field speech data processed by the proposed front-end show significant improvements (relative reduction of 30 % in word error rate) compared to other robust feature extraction techniques.

Keywords: IM2.AP, Report_VII
[528] X. Perrin, R. Chavarriaga, C. Ray, R. Siegwart, and J. del R. Millán. A comparative psychophysical and eeg study of different feedback modalities for hri. In Human-Robot Interaction (HRI08), Amsterdam, 2008. [ bib ]
This paper presents a comparison between six different ways to convey navigational information provided by a robot to a human. Visual, auditory, and tactile feedback mo-da-li-ties were selected and designed to suggest a direction of travel to a human user, who can then decide if he agrees or not with the robot's proposition. This work builds upon a previous research on a novel semi-autonomous navigation system in which the human supervises an autonomous system, providing corrective monitoring signals whenever necessary.We recorded both qualitative (user impressions based on selected criteria and ranking of their feelings) and quantitative (response time and accuracy) information regarding different types of feedback. In addition, a preliminary analysis of the influence of the different types of feedback on brain activity is also shown. The result of this study may provide guidelines for the design of such a human-robot interaction system, depending on both the task and the human user.

Keywords: Report_VII, IM2.BMI, joint publication
[529] A. Popescu-Belis. Dimensionality of dialogue act tagsets: an empirical analysis of large corpora. Language Resources and Evaluation, 42(1):99–107, 2008. [ bib | DOI ]
This article compares one-dimensional and multi-dimensional dialogue act tagsets used for automatic labeling of utterances. The influence of tagset dimensionality on tagging accuracy is first discussed theoretically, then based on empirical data from human and automatic annotations of large scale resources, using four existing tagsets: DAMSL, SWBD-DAMSL, ICSI-MRDA and MALTUS. The Dominant Function Approximation proposes that automatic dialogue act taggers could focus initially on finding the main dialogue function of each utterance, which is empirically acceptable and has significant practical relevance.

Keywords: Report_VII, IM2.DMA
[530] M. Knox, N. Morgan, and N. Mirghafori. Getting the last laugh: automatic laughter segmentation in meetings. to appear in Proceedings of Interspeech 2008, Brisbane, Australia, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[531] B. Noris, K. Benmachiche, and A. Billard. Calibration-free eye gaze direction detection with gaussian processes. In International Conference on Computer Vision Theory and Applications (VISAPP 2008), 2008. [ bib ]
In this paper we present a solution for eye gaze detection from a wireless head mounted camera designed for children aged between 6 months and 18 months. Due to the constraints of working with very young children, the system does not seek to be as accurate as other state-of-the-art eye trackers, however it requires no calibration process from the wearer. Gaussian Process Regression and Support Vector Machines are used to analyse the raw pixel data from the video input and return an estimate of the childs gaze direction. A confidence map is used to determine the accuracy the system can expect for each coordinate on the image. The best accuracy so far obtained by the system is 2.34 ? on adult subjects, tests with children remain to be done.

Keywords: IM2.MPR Report_VII
[532] J. F. Paiement. Probabilistic models for music. PhD thesis, École Polytechnique Fédérale de Lausanne, 2008. Thèse Ecole polytechnique fédérale de Lausanne EPFL, no 4148 (2008), Faculté des sciences et techniques de l'ingénieur STI, Institut de génie électrique et électronique IEL (Laboratoire de l'IDIAP LIDIAP). Dir.: Hervé Bourlard, Samy Bengio. [ bib | .pdf ]
This thesis proposes to analyse symbolic musical data under a statistical viewpoint, using state-of-the-art machine learning techniques. Our main argument is to show that it is possible to design generative models that are able to predict and to generate music given arbitrary contexts in a genre similar to a training corpus, using a minimal amount of data. For instance, a carefully designed generative model could guess what would be a good accompaniment for a given melody. Conversely, we propose generative models in this thesis that can be sampled to generate realistic melodies given harmonic context. Most computer music research has been devoted so far to the direct modeling of audio data. However, most of the music models today do not consider the musical structure at all. We argue that reliable symbolic music models such a the ones presented in this thesis could dramatically improve the performance of audio algorithms applied in more general contexts. Hence, our main contributions in this thesis are three-fold: We have shown empirically that long term dependencies are present in music data and we provide quantitative measures of such dependencies; We have shown empirically that using domain knowledge allows to capture long term dependencies in music signal better than with standard statistical models for temporal data. We describe many probabilistic models aimed to capture various aspects of symbolic polyphonic music. Such models can be used for music prediction. Moreover, these models can be sampled to generate realistic music sequences; We designed various representations for music that could be used as observations by the proposed probabilistic models.

Keywords: chord progressions, generative models, machine learning, melodies, music, probabilistic models, IM2.AP, Report_VIII
[533] A. Schlapbach and H. Bunke. Off-line writer identification and verification using gaussian mixture models. In S. Marinai, editor, Machine Learning in Document Analysis and Recognition, pages 409–428. Springer, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[534] H. Hung, D. Jayagopi, S. Ba, J. M. Odobez, and D. Gatica-Perez. Investigating automatic dominance estimation in groups from visual attention and speaking activity. In Proc. ICMI, Chania, Greece, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[535] T. Tommasi, F. Orabona, and B. Caputo. Cue integration for medical image annotation. In Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19-21, 2007, Revised Selected Papers, LNCS. Springer-Verlag, 2008. [ bib | .pdf ]
This paper presents the algorithms and results of our par- ticipation to the image annotation task of ImageCLEFmed 2007. We proposed a multi-cue approach where images are represented both by global and local descriptors. These cues are combined following two SVM- based strategies. The first algorithm, called Discriminative Accumulation Scheme (DAS), trains an SVM for each feature, and considers as output of each classifier the distance from the separating hyperplane. The final decision is taken on a linear combination of these distances. The second algorithm, that we call Multi Cue Kernel (MCK), uses a new Mercer kernel which can accept as input different features while keeping them separated. The DAS algorithm obtained a score of 29.9, which ranked fifth among all submissions. The MCK algorithm with the one-vs-all and with the one-vs-one multiclass extensions of SVM scored respec- tively 26.85 and 27.54. These runs ranked first and second among all submissions.

Keywords: IM2.VP, IM2.MPR, Report_VIII
[536] S. Voloshynovskiy, O. Koval, and T. Pun. Multimodal authentication based on random projections and distributed coding. In Proceedings of the 10th ACM Workshop on Multimedia & Security, Oxford, UK, 2008. [ bib ]
Keywords: Report_VII, IM2.MPR
[537] H. Ketabdar and H. Bourlard. Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In International Conference on Acoustics, Speech, and Signal Processing, 2008. [ bib ]
Phone posteriors has recently quite often used (as additional features or as local scores) to improve state-of-the-art automatic speech recognition (ASR) systems. Usually, better phone posterior estimates yield better ASR performance. In the present paper we present some initial, yet promising, work towards hierarchically improving these phone posteriors, by implicitly integrating phonetic and lexical knowledge. In the approach investigated here, phone posteriors estimated with a multilayer perceptron (MLP) and short (9 frames) temporal context, are used as input to a second MLP, spanning a longer temporal context (e.g. 19 frames of posteriors) and trained to refine the phone posterior estimates. The rationale behind this is that at the output of every MLP, the information stream is getting simpler (converging to a sequence of binary posterior vectors), and can thus be further processed (using a simpler classifier) by looking at a larger temporal window. Longer term dependencies can be interpreted as phonetic, sub-lexical and lexical knowledge. The resulting enhanced posteriors can then be used for phone and word recognition, in the same way as regular phone posteriors, in hybrid HMM/ANN or Tandem systems. The proposed method has been tested on TIMIT, OGI Numbers and Conversational Telephone Speech (CTS) databases, always resulting in consistent and significant improvements in both phone and word recognition rates.

Keywords: Report_VII, IM2.AP
[538] B. Schouten, N. Juul, A. Drygajlo, and M. Tistarelli. Biometrics and identity management. Springer, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[539] S. Favre, H. Salamin, A. Vinciarelli, D. Hakkani-Tur, and N. Garg. Role recognition for meeting participants: an approach based on lexical information and social network analysis. In ACM International Conference on Multimedia, 2008. [ bib ]
This paper presents experiments on the automatic recognition of roles in meetings. The proposed approach combines two sources of information: the lexical choices made by people playing different roles on one hand, and the Social Networks describing the interactions between the meeting participants on the other hand. Both sources lead to role recognition results significantly higher than chance when used separately, but the best results are obtained with their combination. Preliminary experiments obtained over a corpus of 138 meeting recordings (over 45 hours of material) show that around 70% of the time is labeled correctly in terms of role.

Keywords: IM2.MCA, Report_VII
[540] G. Chanel, C. Rebetez, M. Betrancourt, and T. Pun. boredom, engagement and anxiety as indicators for adaptation to difficulty in games. In ACM Mindtrek conference, Tampere, Finland, 2008. [ bib ]
Keywords: IM2.MCA, Report_VIII
[541] H. Hung, D. Jayagopi, S. Ba, J. M. Odobez, and D. Gatica-Perez. Investigating automatic dominance estimation in groups from visual attention and speaking activity. In International Conference on Multimodal Interfaces (ICMI), 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[542] R. Tous, A. Carreras, J. Delgado, G. Cordara, F. Gianluca, E. Peig, F. Dufaux, and G. Galinski. An architecture for tv content distributed search and retrieval using the mpeg query format (mpqf). In International Workshop on Ambient Media Delivery and Interactive Television (AMDIT 2008), 2008. [ bib | http ]
Traditional broadcasting of TV contents begins to coexist with new models of user aware content delivery. The definition of interoperable interfaces for precise content search and retrieval between the different involved parties is a requirement for the deployment of the new audiovisual distribution services. This paper presents the design of an architecture based on the MPEG Query Format (MPQF) for providing the necessary interoperability to deploy distributed audiovisual content search and retrieval networks between content producers, distributors, aggregators and consumer devices. A service-oriented architecture based on Web Services technology is defined. This paper also presents how the architecture can be applied to a real scenario, the XAC (Xarxa IP Audiovisual de Catalunya, Audiovisual IP Network of Catalonia). As far as we know, this is the first paper to apply MPQF to TV Content Distributed Search and Retrieval.

Keywords: Report_VII, IM2.MCA,information retrieval; multimedia retrieval; distributed information retrieval; MPQF; MPEG Query Format; TV
[543] B. Deville, G. Bologna, M. Vinckenbosch, and T. Pun. Guiding the focus of attention of blind people with visual saliency. In Workshop on Computer Vision Applications for the Visually Impaired (CVAVI 08), Satellite Workshop of theEuropean Conference on Computer Vision (ECCV 2008), Marseille, France, October 18, 2008. [ bib ]
Keywords: IM2.MCA, Report_VIII
[544] D. Jayagopi. Predicting two facets of social verticality in meetings from five-minute time slices and nonverbal cues. In Proc. ICMI, Chania, Greece, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[545] E. Shriberg. Higher level features in speaker recognition. in C. Muller (Ed.) Speaker Classification I. Springer-Verlag, New York, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[546] B. Dumas, D. Lalanne, and R. Ingold. Prototyping multimodal interfaces with smuiml modeling language. In Proceedings of CHI 2008 Workshop on UIDLs for Next Generation User Interfaces (CHI 2008 workshop), pages 63–66, Florence (Italy), 2008. [ bib ]
Keywords: IM2.HMI, Report_VIII
[547] B. Dumas, D. Lalanne, D. Guinard, R. Koenig, and R. Ingold. Strengths and weaknesses of software architectures for the rapid creation of tangible and multimodal interfaces. In Proceedings of 2nd international conference on Tangible and Embedded Interaction (TEI 2008), pages 47–54, Bonn (Germany), 2008. [ bib ]
Keywords: IM2.HMI, Report_VIII
[548] K. Smith, S. Ba, D. Gatica-Perez, and J. M. Odobez. Tracking the visual focus of attention for a varying number of wandering people. IEEE Trans. on Pattern Analysis and Machine Intelligence,, 30(7):1212–1229, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[549] C. Carincotte, X. Naturel, M. Hick, J. M. Odobez, J. Yao, A. Bastide, and B. Corbucci. Understanding metro station usage using closed circuit television cameras analysis. In 11th International IEEE Conference on Intelligent Transportation Systems (ITSC), 2008. [ bib | .pdf ]
In this paper, we propose to show how video data available in standard CCTV transportation systems can represent a useful source of information for transportation infrastructure management, optimization and planning if adequately analyzed (e.g. to facilitate equipment usage understanding, to ease diagnostic and planning for system managers). More precisely, we present two algorithms allowing to estimate the number of people in a camera view and to measure the platform time-occupancy by trains. A statistical analysis of the results of each algorithm provide interesting insights regarding station usage. It is also shown that combining information from the algorithms in different views provide a finer understanding of the station usage. An end-user point of view confirms the interest of the proposed analysis.

Keywords: IM2.VP, Report_VIII
[550] S. Zhao and N. Morgan. Multi-stream spectro-temporal features for robust speech recognition. to appear in Proceedings of Interspeech 2008, Brisbane, Australia, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[551] M. Soleymani, G. Chanel, J. Kierkels, and T. Pun. Valence-arousal representation of movie scenes based on multimedia content analysis and user's physiological emotional responses. In MLMI 2008, 5th Joint Workshop on Machine Learning and Multimodal Interaction, Utrecht, The Netherlands, 2008. (PhD student poster session, with extended abstract). [ bib ]
Keywords: Report_VII, IM2.MCA
[552] E. Bruno, N. Moüenne-Loccoz, and S. Marchand-Maillet. Design of multimodal dissimilarity spaces for retrieval of multimedia documents. To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008. [ bib ]
Keywords: Report_VII, IM2.MCA
[553] H. Ketabdar. Enhancing posterior based speech recognition systems. PhD thesis, Ecole Polytechnique Fédérale de Lausanne, Lausanne , Switzerland, 2008. Thèse Ecole polytechnique fédérale de Lausanne EPFL, no 4218 (2008), Faculté des sciences et techniques de l'ingénieur STI, Section de génie électrique et électronique, Institut de génie électrique et électronique IEL (Laboratoire de l'IDIAP LIDIAP). Dir.: Hervé Bourlard. [ bib | .pdf ]
The use of local phoneme posterior probabilities has been increasingly explored for improving speech recognition systems. Hybrid hidden Markov model / artificial neural network (HMM/ANN) and Tandem are the most successful examples of such systems. In this thesis, we present a principled framework for enhancing the estimation of local posteriors, by integrating phonetic and lexical knowledge, as well as long contextual information. This framework allows for hierarchical estimation, integration and use of local posteriors from the phoneme up to the word level. We propose two approaches for enhancing the posteriors. In the first approach, phoneme posteriors estimated with an ANN (particularly multi-layer Perceptron - MLP) are used as emission probabilities in HMM forward-backward recursions. This yields new enhanced posterior estimates integrating HMM topological constraints (encoding specific phonetic and lexical knowledge), and long context. In the second approach, a temporal context of the regular MLP posteriors is post-processed by a secondary MLP, in order to learn inter and intra dependencies among the phoneme posteriors. The learned knowledge is integrated in the posterior estimation during the inference (forward pass) of the second MLP, resulting in enhanced posteriors. The use of resulting local enhanced posteriors is investigated in a wide range of posterior based speech recognition systems (e.g. Tandem and hybrid HMM/ANN), as a replacement or in combination with the regular MLP posteriors. The enhanced posteriors consistently outperform the regular posteriors in different applications over small and large vocabulary databases.

Keywords: IM2.AP, Report_VIII
[554] O. Vinyals and G. Friedland. Modulation spectrogram features for speaker diarization. to appear in proceedings of Interspeech 2008, Brisbane, Australia, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[555] O. Vinyals and G. Friedland. Towards semantic analysis of conversations: a system for the live identification of speakers in meetings. to appear in Proceedings of IEEE International Conference on Semantic Computing, Santa Clara, CA, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[556] O. Vinyals and G. Friedland. Live speaker identification in meetings: "who is speaking now?". Technical Report TR-08-001, International Computer Science Institute, Berkeley, CA, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[557] J. Richiardi, A. Drygajlo, and L. Todesco. Promoting diversity in gaussian mixture ensembles: an application to signature verification. pages 140–149, Heidelberg, 2008. Springer. [ bib ]
Keywords: IM2.MPR, Report_VIII
[558] S. Ba and J. M. Odobez. Visual focus of attention estimation from head pose posterior probability distributions. In IEEE Proc. Int. Conf. on Multimedia and Expo (ICME), Hannover, 2008. [ bib ]
We address the problem of recognizing the visual focus of attention (VFOA) of meeting participants from their head pose and contextual cues. The main contribution of the paper is the use of a head pose posterior distribution as a representation of the head pose information contained in the image data. This posterior encodes the probabilities of the different head poses given the image data, and constitute therefore a richer representation of the data than the mean or the mode of this distribution, as done in all previous work. These observations are exploited in a joint interaction model of all meeting participants pose observations, VFOAs, speaking status and of environmental contextual cues. Numerical experiments on a public database of 4 meetings of 22min on average show that this change of representation allows for a 5.4% gain with respect to the standard approach using head pose as observation.

Keywords: Report_VII, IM2.VP
[559] E. Grossmann, J. A. Gaspar, and F. Orabona. Calibration from statistical properties of the visual world. In European Conf. on Computer Vision, 2008. IDIAP-RR 08-63. [ bib | .ps.gz | .pdf ]
What does a blind entity need in order to determine the geometry of the set of photocells that it carries through a changing lightfield? In this paper, we show that very crude knowledge of some statistical properties of the environment is sufficient for this task. We show that some dissimilarity measures between pairs of signals produced by photocells are strongly related to the angular separation between the photocells. Based on real-world data, we model this relation quantitatively, using dissimilarity measures based on the correlation and conditional entropy. We show that this model allows to estimate the angular separation from the dissimilarity. Although the resulting estimators are not very accurate, they maintain their performance throughout different visual environments, suggesting that the model encodes a very general property of our visual world. Finally, leveraging this method to estimate angles from signal pairs, we show how distance geometry techniques allow to recover the complete sensor geometry.

Keywords: IM2.VP, Report_VIII
[560] A. Popescu-Belis and R. Stiefelhagen. Machine learning for multimodal interaction v, volume 5237 of LNCS. Springer-Verlag, Berlin/Heidelberg, 2008. [ bib ]
Keywords: IM2.MPR, Report_VII
[561] M. M. Ullah, A. Pronobis, B. Caputo, J. Luo, P. Jensfelt, and H. I. Christensen. Towards robust place recognition for robot localization. In IEEE International Conference on Robotics ad Automation, 2008. [ bib | .pdf ]
Localization and context interpretation are two key competences for mobile robot systems. Visual place recognition, as opposed to purely geometrical models, holds promise of higher flexibility and association of semantics to the model. Ideally, a place recognition algorithm should be robust to dynamic changes and it should perform consistently when recognizing a room (for instance a corridor) in different geographical locations. Also, it should be able to categorize places, a crucial capability for transfer of knowledge and continuous learning. In order to test the suitability of visual recognition algorithms for these tasks, this paper presents a new database, acquired in three different labs across Europe. It contains image sequences of several rooms under dynamic changes, acquired at the same time with a perspective and omnidirectional camera, mounted on a socket. We assess this new database with an appearance based algorithm that combines local features with support vector machines through an ad-hoc kernel. Results show the effectiveness of the approach and the value of the database

Keywords: IM2.VP, Report_VIII
[562] J. P. Pinto, H. Hermansky, B. Yegnanarayana, and M. Magimai-Doss. Exploiting contextual information for improved phoneme recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP 2008), pages 4449–4452, 2008. IDIAP-RR 07-65. [ bib | DOI ]
In this paper, we investigate the significance of contextual information in a phoneme recognition system using the hidden Markov model - artificial neural network paradigm. Contextual information is probed at the feature level as well as at the output of the multilayerd perceptron. At the feature level, we analyse and compare different methods to model sub-phonemic classes. To exploit the contextual information at the output of the multilayered perceptron, we propose the hierarchical estimation of phoneme posterior probabilities. The best phoneme (excluding silence) recognition accuracy of 73.4% on the TIMIT database is comparable to that of the state-of-the-art systems, but more emphasis is on analysis of the contextual information.

Keywords: IM2.AP, Report_VII
[563] P. Motlicek, S. Ganapathy, and H. Hermansky. Entropy coding of quantized spectral components in fdlp audio codec. Idiap-RR Idiap-RR-71-2008, Idiap, 2008. [ bib | .pdf ]
Audio codec based on Frequency Domain Linear Prediction (FDLP) exploits auto-regressive modeling to approximate instantaneous energy in critical frequency sub-bands of relatively long input segments. Current version of the FDLP codec operating at 66 kbps has shown to provide comparable subjective listening quality results to the state-of-the-art codecs on similar bit-rates even without employing strategic blocks, such as entropy coding or simultaneous masking. This paper describes an experimental work to increase compression efficiency of the FDLP codec provided by employing entropy coding. Unlike traditionally used Huffman coding in current audio coding systems, we describe an efficient way to exploit Arithmetic coding to entropy compress quantized magnitude spectral components of the sub-band FDLP residuals. Such approach outperforms Huffman coding algorithm and provides more than 3 kbps bit-rate reduction.

Keywords: IM2.AP, Report_VIII
[564] A. Popescu-Belis. Reference-based vs. task-based evaluation of human language technology. In LREC 2008 ELRA Workshop on Evaluation: "Looking into the Future of Evaluation: When automatic metrics meet task-based and performance-based approaches", pages 12–16, Marrakech, Morocco, 2008. ELRA. [ bib ]
This paper starts from the ISO distinction of three types of evaluation procedures â internal, external and in use â and proposes to match these types to the three types of human language technology (HLT) systems: analysis, generation, and interactive. The paper explains why internal evaluation is not suitable to measure the qualities of HLT systems, and shows that reference-based external evaluation is best adapted to âanalysisâ systems, task-based evaluation to âinteractiveâ systems, while âgenerationâ systems can be subject to both types of evaluation. In particular, some limits of reference-based external evaluation are shown in the case of generation systems. Finally, the paper shows that contextual evaluation, as illustrated by the FEMTI framework for MT evaluation, is an effective method for getting reference-based evaluation closer to the users of a system.

Keywords: Report_VII, IM2.DMA
[565] A. Thomas, S. Ganapathy, and H. Hermansky. Recognition of reverberant speech using frequency domain linear prediction. IEEE Signal Processing Letters, 2008. IDIAP-RR 08-41. [ bib ]
Performance of a typical automatic speech recognition (ASR) system severely degrades when it encounters speech from reverberant environments. Part of the reason for this degradation is the feature extraction techniques that use analysis windows which are much shorter than typical room impulse responses. We present a feature extraction technique based on modeling temporal envelopes of the speech signal in narrow sub-bands using Frequency Domain Linear Prediction (FDLP). FDLP provides an all-pole approximation of the Hilbert envelope of the signal obtained by linear prediction on cosine transform of the signal. ASR experiments on speech data degraded with a number of room impulse responses (with varying degrees of distortion) show significant performance improvements for the proposed FDLP features when compared to other robust feature extraction techniques (average relative reduction of 24 % in word error rate). Similar improvements are also obtained for far-field data which contain natural reverberation in background noise. These results are achieved without any noticeable degradation in performance for clean speech.

Keywords: IM2.AP, Report_VII
[566] P. W. Ferrez and J. del R. Millán. Error-related eeg potentials generated during simulated brain-computer interaction. IEEE Trans. on Biomedical Engineering, 55(3):923–929, 2008. [ bib ]
Keywords: IM2.BMI, Report_VIII
[567] J. P. Pinto and H. Hermansky. Combining evidence from a generative and a discriminative model in phoneme recognition. In Proceedings of Interspeech 2008, 2008. IDIAP-RR 08-20. [ bib ]
We investigate the use of the log-likelihood of the features obtained from a generative Gaussian mixture model, and the posterior probability of phonemes from a discriminative multilayered perceptron in multi-stream combination for recognition of phonemes. Multi-stream combination techniques, namely early integration and late integration are used to combine the evidence from these models. By using multi-stream combination, we obtain a phoneme recognition accuracy of 74% on the standard TIMIT database, an absolute improvement of 2.5% over the single best stream.

Keywords: IM2.AP, Report_VII
[568] J. Meynet and J. Ph. Thiran. Ensembles of svms using an information theoretic criterion. Pattern Recognition Letters, 2008. ITS. [ bib ]
Training Support Vector Machines (SVMs) can become very challenging in large scale datasets. The problem can be addressed by training several lower complexity SVMs on local subsets of the training set. In fact, combining the resulting SVMs in parallel can significantly reduce the training complexity and also improve the classification performances. In order to obtain effective classifier ensembles, classifiers need to be both diverse and individually accurate. In this paper we propose an algorithm for training ensembles of SVMs by taking into account the diversity between each parallel classifier. For this, we use an information theoretic criterion that expresses a trade-off between individual accuracy and diversity. The parallel SVMs are trained jointly using an adaptation of the Kernel-Adatron algorithm for learning on-line multiple SVMs. The results are compared to standard multiple SVMs techniques on reference large scale datasets.

Keywords: Report_VII, IM2.VP, support vector machines; combination of classifiers; ensembles; information theory; diversity; lts5
[569] U. Hoffmann, J. M. Vesin, T. Ebrahimi, and K. Diserens. An efficient p300-based brain-computer interface for disabled subjects. Journal of Neuroscience Methods, 167(1):115–125, 2008. Datasets and MATLAB-Code are available at http://bci.epfl.ch. [ bib | DOI ]
A brain-computer interface (BCI) is a communication system that translates brain-activity into commands for a computer or other devices. In other words, a BCI allows users to act on their environment by using only brain-activity, without using peripheral nerves and muscles. In this paper, we present a BCI that achieves high classification accuracy and high bitrates for both disabled and able-bodied subjects. The system is based on the P300 evoked potential and is tested with five severely disabled and four able-bodied subjects. For four of the disabled subjects classification accuracies of 100% are obtained. The bitrates obtained for the disabled subjects range between 10 and 25 bits/min. The effect of different electrode configurations and machine learning algorithms on classification accuracy is tested. Further factors that are possibly important for obtaining good classification accuracy in P300-based BCI systems for disabled subjects are discussed.

Keywords: Report_VII, IM2.BMI,LTS1
[570] J. Meynet and J. Ph. Thiran. Information theoretic combination of classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008. ITS. [ bib | DOI ]
Combining several classifiers has proved to be an effective machine learning technique. Two concepts clearly influence the performances of an ensemble of classifiers: the diversity between classifiers and the individual accuracies of the classifiers. In this paper we propose an information theoretic framework to establish a link between these quantities. As they appear to be contradictory, we propose an information theoretic score (ITS) that expresses a trade-off between individual accuracy and diversity. This technique can be directly used, for example, for selecting an optimal ensemble in a pool of classifiers. We perform experiments in the context of overproduction and selection of classifiers. We show that the selection based on the ITS outperforms state-of-the-art diversity-based selection techniques.

Keywords: Report_VII, IM2.VP, combination of classifiers, information theory, support ; vector machines, diversity, majority voting, ensembles, lts5
[571] L. Gui, J. Ph. Thiran, and N. Paragios. Cooperative object segmentation and behavior inference in image sequences. International Journal of Computer Vision, 2008. [ bib | DOI ]
Keywords: Report_VII, IM2.VP, image segmentation; behavior inference; gesture recognition; LTS5
[572] R. Bertolami and H. Bunke. Including language model information in the combination of handwritten text line recognizers. In Proc. 11th Int. Conf. on Frontiers in Handwriting Recognition, pages 25–30, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[573] R. Bertolami and H. Bunke. Hidden markov model based ensemble methods for offline handwritten text line recognition. Pattern Recognition, 41(11):3452–3460, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[574] K. Kryszczuk and A. Drygajlo. Credence estimation and error prediction in biometric identity verification. Signal Processing, 88(4):916–925, 2008. [ bib | http ]
Keywords: Report_VII, IM2.MPR, Biometric identity verification
[575] E. Kokiopoulou, P. Frossard, and O. Verscheure. Fast keyword detection with sparse time-frequency models. IEEE Int. Conf. on Multimedia & Expo (ICME), 2008. [ bib ]
Keywords: Report_VII, IM2.DMA.VP, joint publication
[576] K. Kryszczuk and A. Drygajlo. Impact of feature correlations on separation between bivariate normal distributions. In 19th International Conference on Pattern Recognition, Tampa, Florida, USA, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[577] J. P. Pinto, G. S. V. S. Sivaram, and H. Hermansky. Reverse correlation for analyzing mlp posterior features in asr. In 11th International Conference on Text, Speech and Dialogue (TSD), pages 469–476, 2008. IDIAP-RR 08-13. [ bib | DOI ]
In this work, we investigate the reverse correlation technique for analyzing posterior feature extraction using an multilayered perceptron trained on multi-resolution RASTA (MRASTA) features. The filter bank in MRASTA feature extraction is motivated by human auditory modeling. The MLP is trained based on an error criterion and is purely data driven. In this work, we analyze the functionality of the combined system using reverse correlation analysis.

Keywords: IM2.AP, Report_VII
[578] X. Naturel and J. M. Odobez. Detecting queues at vending machines: a statistical layered approach. In Proc. Int. Conf. on Pattern Recognition (ICPR), 2008. [ bib | .pdf ]
This paper presents a method for monitoring activities at a ticket vending machine in a video-surveillance context. Rather than relying on the output of a tracking module, which is prone to errors, the events are direclty recognized from image measurements. This especially does not require tracking. A statistical layered approach is proposed, where in the first layer, several sub-events are defined and detected using a discriminative approach. The second layer uses the result of the first and models the temporal relationships of the high-level event using a Hidden Markov Model (HMM). Results are assessed on 3h30 hours of real video footage coming from Turin metro station.

Keywords: IM2.MPR, Report_VIII
[579] T. Tommasi, F. Orabona, and B. Caputo. An svm confidence-based approach to medical image annotation. In C. Peters, D. Giampiccolo, and N. Ferro, editors, Evaluating Systems for Multilingual and Multimodal Information Access – 9th Workshop of the Cross-Language Evaluation Forum, LNCS, 2008. [ bib | .pdf ]
This paper presents the algorithms and results of the ”idiap” team participation to the ImageCLEFmed annotation task in 2008. On the basis of our successful experience in 2007 we decided to integrate two different local structural and textural descriptors. Cues are com- bined through concatenation of feature vectors and through the Multi- Cue Kernel. The challenge this year was to annotate images coming mainly from classes with only few training examples. We tackled the problem on two fronts: (1) we introduced a further integration strategy using SVM as an opinion maker; (2) we enriched the poorly populated classes adding virtual examples. We submitted several runs considering different combinations of the proposed techniques. The run jointly using the feature concatenation, the confidence-based opinion fusion and the virtual examples ranked first among all submissions.

Keywords: IM2.MPR, Report_VIII
[580] B. Mesot. Switching linear dynamical systems for noise robust speech recognition of isolated degits. PhD thesis, STI School of Engineering, EPFL, Lausanne, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[581] G. Gonzalez, F. Fleuret, and P. Fua. Automated delineation of dendritic networks in noisy image stacks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 214–227, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[582] J. del R. Millán. Brain-controlled robots. In IEEE International Conference on Robotics and Automation (ICRA 2008), ATR Computational Neuroscience Laboratories, 2008. [ bib | DOI ]
The idea of moving robots or prosthetic devices not by manual control, but by mere üiÂ?`Âfrac1/2thinkingüiÂ?`Âfrac1/2 (i.e., the brain activity of human subjects) has fascinated researchers for the last 30 years, but it is only now that first experiments have shown the possibility to do so. How can brainwaves be used to directly control robots? Most of the hope for braincontrolled robots comes from invasive approaches that provide detailed single neuron activity recorded from microelectrodes implanted in the brain [1]. The motivation for these invasive approaches is that it has been widely shown that motor parameters related to hand and arm movements are encoded in a distributed and redundant way by ensembles of neurons in the motor system of the brainüiÂ?`Âfrac1/2motor, premotor and posterior parietal cortex. For humans, however, it is preferable to use non-invasive approaches to avoid health risks and the associated ethical concerns. Most non-invasive brain-computer interfaces (BCI) use electroencephalogram (EEG) signals; i.e., the electrical brain activity recorded from electrodes placed on the scalp. The main source of the EEG is the synchronous activity of thousands of cortical neurons. Thus, EEG signals suffer from a reduced spatial resolution and increased noise due to measurements on the scalp. As a consequence, current EEG-based brain-actuated devices are limited by a low channel capacity and are considered too slow for controlling rapid and complex sequences of robot movements. But, recently, we have shown for the first time that online analysis of EEG signals, if used in combination with advanced robotics and machine learning techniques, is sufficient for humans to continuously control a mobile robot [2] and a wheelchair [3]. In this article we will review our work on non-invasive brain-controlled robots and discuss some of the challenges ahead.

Keywords: IM2.BMI, Report_VII
[583] D. Vergyri, A. Mandal, W. Wang, A. Stolcke, J. Zheng, M. Graciarena, D. Rybach, C. Gollan, R. Schlater, K. Kirchoff, A. Faria, and N. Morgan. Development of the sri/nightingale arabic asr system. In 9th International Conference of the ISCA (Interspeech 2008), Brisbane, Australia, pages 1437–1440, 2008. [ bib ]
Keywords: IM2.AP, Report_VIII
[584] S. Favre, H. Salamin, and A. Vinciarelli. Role recognition in multiparty recordings using social affiliation networks and discrete distributions. In The Tenth International Conference on Multimodal Interfaces (ICMI 2008), number Idiap-RR-64-2008, 2008. [ bib ]
This paper presents an approach for the recognition of roles in multiparty recordings. The approach includes two major stages: extraction of Social Affiliation Networks (speaker diarization and representation of people in terms of their social interactions), and role recognition (application of discrete probability distributions to map people into roles). The experiments are performed over several corpora, including broadcast data and meeting recordings, for a total of roughly 90 hours of material. The results are satisfactory for the broadcast data (around 80 percent of the data time correctly labeled in terms of role), while they still must be improved in the case of the meeting recordings (around 45 percent of the data time correctly labeled). In both cases, the approach outperforms significantly chance.

Keywords: IM2.MCA, Report_VII
[585] A. Thomas, S. Ganapathy, and H. Hermansky. Spectro-temporal features for automatic speech recognition using linear prediction in spectral domain. In 16th European Signal Processing Conference (EUSIPCO 2008), 2008. IDIAP-RR 08-05. [ bib ]
Frequency Domain Linear Prediction (FDLP) provides an efficient way to represent temporal envelopes of a signal using auto-regressive models. For the input speech signal, we use FDLP to estimate temporal trajectories of sub-band energy by applying linear prediction on the cosine transform of sub-band signals. The sub-band FDLP envelopes are used to extract spectral and temporal features for speech recognition. The spectral features are derived by integrating the temporal envelopes in short-term frames and the temporal features are formed by converting these envelopes into modulation frequency components. These features are then combined in the phoneme posterior level and used as the input features for a hybrid HMM-ANN based phoneme recognizer. The proposed spectro-temporal features provide a phoneme recognition accuracy of 69.1 % (an improvement of 4.8 % over the Perceptual Linear Prediction (PLP) base-line) for the TIMIT database.

Keywords: IM2.AP, Report_VII
[586] A. Faria and N. Morgan. Corrected tandem features for acoustic model training. In International Conference on Acoustics, Speech, and Signal Processing, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[587] L. Matena, A. Jaimes, and A. Popescu-Belis. Graphical representation of meetings on mobile devices. In MobileHCI 2008 Demonstrations (10th ACM International Conference on Human-Computer Interaction with Mobile Devices and Services), Amsterdam, 2008. [ bib ]
The AMIDA Mobile Meeting Assistant is a system that allows remote participants to attend a meeting through a mobile device. The system improves the engagement in the meeting of the remote participants with respect to voice-only solutions thanks to the use of visual annotations and the capture of slides. The visual focus of attention of meeting participants and other annotations serve to reconstruct a 2D or a 3D representation of the meeting on a mobile device (smart phone). A rst version of the system has been implemented, and feedback from a user study and from industrial partners shows that the Mobile Meeting Assistant's functionalities are positively appreciated, and sets priorities for future developments.

Keywords: Report_VII, IM2.HMI
[588] P. Besson, V. Popovici, J. M. Vesin, J. Ph. Thiran, and M. Kunt. Extraction of audio features specific to speech production for multimodal speaker detection. IEEE Transactions on Multimedia, 10(1):63–73, 2008. [ bib | DOI ]
Keywords: Report_VII, IM2.MPR, joint publication, LTS1; LTS5; speaker detection; multimodal; feature extraction; besson p.
[589] O. Koval, S. Voloshynovskiy, F. Caire, and P. Bas. Privacy-preserving multimodal person and object identification. In MM&Sec 2008, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[590] G. Bologna, B. Deville, M. Vinckenbosch, and T. Pun. a perceptual interface for vision substitution in a color matching experiment. In Proceeding on IEEE IJCNN, IEEE World congress on computational intelligence, 2008. [ bib ]
Keywords: IM2.MCA, Report_VIII
[591] J. Meynet, T. Arsan, J. Cruz Mota, and J. Ph. Thiran. Fast multi-view face tracking with pose estimation. In 16th European Signal Processing Conference, 2008. [ bib | http ]
In this paper, a fast and an effective multi-view face tracking algorithm with head pose estimation is introduced. For modeling the face pose we employ a tree of boosted classi?ers built using either Haar-like ?lters or Gauss ?lters. A ?rst classi?er extracts faces of any pose from the background. Then more speci?c classi?ers discriminate between different poses. The tree of classi?ers is trained by hierarchically sub-sampling the pose space. Finally, Condensation algorithm is used for tracking the faces. Experiments show large improvements in terms of detection rate and processing speed compared to state-of-the-art algorithms.

Keywords: Report_VII, IM2.VP, lts5; lts; face detection; face tracking; head pose; condensation
[592] F. Valente and H. Hermansky. On the combination of auditory and modulation frequency channels for asr applications. In Interspeech 2008, 2008. IDIAP-RR 08-12. [ bib ]
This paper investigates the combination of evidence coming from different frequency channels obtained filtering the speech signal at different auditory and modulation frequencies. In our previous work citeicassp2008

Keywords: IM2.AP, Report_VII
[593] M. Gurban, J. Ph. Thiran, T. Drugman, and T. Dutoit. Dynamic modality weighting for multi-stream hmms in audio-visual speech recognition. In 10th International Conference on Multimodal Interfaces, 2008. [ bib | http ]
Merging decisions from different modalities is a crucial problem in Audio-Visual Speech Recognition. To solve this, state synchronous multi-stream HMMs have been proposed for their important advantage of incorporating stream reliability in their fusion scheme. This paper focuses on stream weight adaptation based on modality confidence estimators. We assume different and time-varying environment noise, as can be encountered in realistic applications, and, for this, adaptive methods are best-suited. Stream reliability is assessed directly through classifier outputs since they are not specific to either noise type or level. The influence of constraining the weights to sum to one is also discussed.

Keywords: Report_VII, IM2.MPR, LTS5
[594] M. Gurban and J. Ph. Thiran. Using entropy as a stream reliability estimate for audio-visual speech recognition. In 16th European Signal Processing Conference, Lausanne, Switzerland, 2008. [ bib | http ]
We present a method for dynamically integrating audio-visual information for speech recognition, based on the estimated reliability of the audio and visual streams. Our method uses an information theoretic measure, the entropy derived from the state probability distribution for each stream, as an estimate of reliability. The two modalities, audio and video, are weighted at each time instant according to their reliability. In this way, the weights vary dynamically and are able to adapt to any type of noise in each modality, and more importantly, to unexpected variations in the level of noise.

Keywords: Report_VII, IM2.MPR, LTS5
[595] P. Motlicek, S. Ganapathy, H. Hermansky, H. Garudadri, and M. Athineos. Perceptually motivated sub-band decomposition for fdlp audio coding. In Text, Speech and Dialogue, volume 5246 of Series of Lecture Notes in Artificial Intelligence (LNAI), pages 435–442. Springer-Verlag Berlin, Heidelberg, 2008. [ bib | .pdf ]
This paper describes employment of non-uniform QMF decomposition to increase the efficiency of a generic wide-band audio coding system based on Frequency Domain Linear Prediction (FDLP). The base line FDLP codec, operating at high bit-rates ( 136 kbps), exploits a uniform QMF decomposition into 64 sub-bands followed by sub-band processing based on FDLP. Here, we propose a non-uniform QMF decomposition into 32 frequency sub-bands obtained by merging 64 uni- form QMF bands. The merging operation is performed in such a way that bandwidths of the resulting critically sampled sub-bands emulate the characteristics of the critical band filters in the human auditory system. Such frequency decomposition, when employed in the FDLP audio codec, results in a bit-rate reduction of 40% over the base line. We also describe the complete audio codec, which provides high-fidelity audio compression at 66 kbps. In subjective listening tests, the FDLP codec outperforms MPEG-1 Layer 3 (MP3) and achieves similar qualities as MPEG-4 HE-AAC codec.

Keywords: Audio Coding, Frequency Domain Linear Prediction (FDLP), speech coding, IM2.VP,Report_VIII
[596] D. Roth, E. Koller-Meier, D. Rowe, T. B. Moeslund, and L. van Gool. Event-based tracking evaluation metric. In IEEE Workshop on Motion and Video Computing (WMVC), 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[597] K. Schindler and L. van Gool. Combining densely sampled form and motion for human action recognition. In DAGM Annual Pattern Recognition Symposium. Springer, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[598] S. Kosinov and T. Pun. Distance-based discriminant analysis method and its applications. Pattern Analysis and Applications, 11(3-4):227–246, 2008. (DOI: 10.1007/s10044-007-0082-x). [ bib | .pdf ]
Keywords: IM2.MCA, Report_VIII
[599] M. D. Breitenstein, D. Kuettel, T. Weise, L. van Gool, and H. Pfister. Real-time face pose estimation from single range images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR'08). IEEE Press, 2008. [ bib ]
Keywords: Report_VII, IM2.VP
[600] P. W. Ferrez and J. del R. Millán. Error-related eeg potentials generated during simulated brain-computer interaction. IEEE Transactions on Biomedical Engineering, 55(3):923–929, 2008. [ bib | DOI ]
Brain-computer interfaces (BCIs) are prone to errors in the recognition of subject's intent. An elegant approach to improve the accuracy of BCIs consists in a verification procedure directly based on the presence of error-related potentials (ErrP) in the EEG recorded right after the occurrence of an error. Several studies show the presence of ErrP in typical choice reaction tasks. However, in the context of a BCI, the central question is: "Are ErrP also elicited when the error is made by the interface during the recognition of the subject's intent?" We have thus explored whether ErrP also follow a feedback indicating incorrect responses of the simulated BCI interface. Five healthy volunteer subjects participated in a new human-robot interaction experiment, which seem to confirm the previously reported presence of a new kind of ErrP. But in order to exploit these ErrP we need to detect them in each single trial using a short window following the feedback associated to the response of the BCI. We have achieved an average recognition rate of correct and erroneous single trials of 83.5% and 79.2%, respectively using a classifier built with data recorded up to three months earlier.

Keywords: IM2.BMI, Report_VII
[601] O. Koval, S. Voloshynovskiy, and T. Pun. Privacy-preserving multimodal person and object identification. In Proceedings of the 10th ACM Workshop on Multimedia & Security, Oxford, UK, 2008. [ bib ]
Keywords: Report_VII, IM2.MPR
[602] W. Li, M. M. Doss, J. Dines, and H. Bourlard. Mlp-based log spectral energy mapping for robust overlapping speech recognition. In European Signal Processing Conference, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[603] W. Li. Effective post-processing of single-channel frequency-domain speech enhancement. In IEEE conference on multimedia and expo, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[604] W. Li, J. Dines, M. Magimai-Doss, and H. Bourlard. Neural network based regression for robust overlapping speech recognition using microphone arrays. In Interspeech, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[605] H. Hung and D. Gatica-Perez. Identifying dominant people in meetings from audio-visual sensors. In Proc. IEEE Int. Conf. on Automatic Face and Gesture Recognition, Special Session on Multimodal HCI for Smart Environments, Amsterdam, The Netherlands, 2008. [ bib ]
This paper provides an overview of the area of automated dominance estimation in group meetings. We describe research in social psychology and use this to explain the motivations behind suggested automated systems. With the growth in availability of conversational data captured in meeting rooms, it is possible to investigate how multi-sensor data allows us to characterize non-verbal behaviors that contribute towards dominance. We use an overview of our own work to address the challenges and opportunities in this area of research.

Keywords: Report_VII, IM2.MPR
[606] A. Humm. Modelling combined handwriting and speech modalities for user authentication. PhD thesis, University of Fribourg, Switzerland, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[607] S. Kosinov, E. Bruno, and S. Marchand-Maillet. Spatially-consistent partial matching for intra- and inter-image prototype selection. To appear in Signal Processing: Image Communication special issue on "Semantic Analysis for Interactive Multimedia Services", 2008. [ bib ]
Keywords: Report_VII, IM2.MCA
[608] J. Anemuller, J. H. Back, B. Caputo, J. Luo, F. Ohl, F. Orabona, R. Vogels, D. Weinshall, and A. Zweig. Biologically motivated audio-visual cue integration for object. In Proceedings of the first Internatinal Conference on Cognitive Systems, 2008. [ bib | .pdf ]
Auditory and visual cues are important sensor inputs for biological and artificial systems. They provide crucial information for navigating environments, recognizing categories, animals and people. How to combine effectively these two sensory channels is still an open issue. As a step towards this goal, this paper presents a comparison between three different multi-modal integration strategies, for audiovisual object category detection. We consider a high-level and a low-level cue integration approach, both biologically motivated, and we compare them with a mid-level cue integration scheme. All the three integration methods are based on the least square support vector machine algorithm, and state of the art audio and visual feature representations. We conducted experiments on two audio-visual object categories, dogs and guitars, presenting different visual and auditory characteristics. Results show that the high-level integration scheme consistently performs better than single cue methods, and of the other two integration schemes. These findings confirm results from the neuroscience. This suggests that the high-level integration scheme is the most suitable approach for multi-modal cue integration for artificial cognitive systems.

Keywords: IM2.MPR, Report_VIII
[609] D. Jayagopi, H. Hung, C. Yeo, and D. Gatica-Perez. Modeling dominance in group conversations from nonverbal activity cues. IEEE Trans. on Audio, Speech and Language Processing, Special Issue on Multimodal Processing for Speech-based Interactions, accepted for publication, 2008. [ bib ]
Keywords: Report_VII, IM2.MPR
[610] H. Ketabdar and H. Bourlard. In-context phone posteriors as complementary features for tandem asr. In ICSLP'08, 2008. [ bib ]
In this paper, we present a method for integrating possible prior knowledge (such as phonetic and lexical knowledge), as well as acoustic context (e.g., the whole utterance) in the phone posterior estimation, and we propose to use the obtained posteriors as complementary posterior features in Tandem ASR configuration. These posteriors are estimated based on HMM state posterior probability definition (typically used in standard HMMs training). In this way, by integrating the appropriate prior knowledge and context, we enhance the estimation of phone posteriors. These new posteriors are called ?in-context? or HMM posteriors. We combine these posteriors as complementary evidences with the posteriors estimated from a Multi Layer Percep- tron (MLP), and use the combined evidence as features for training and inference in Tandem configuration. This approach has improved the performance, as compared to using only MLP estimated posteriors as features in Tandem, on OGI Numbers , Conversational Telephone speech (CTS), and Wall Street Journal (WSJ) databases.

Keywords: IM2.AP, Report_VII
[611] B. Schouten, N. Juul, A. Drygajlo, and M. Tistarelli. Biometrics and identity management. Heidelberg, 2008. Springer. [ bib ]
Keywords: IM2.MPR, Report_VIII
[612] K. Kryszczuk and A. Drygajlo. Impact of feature correlations on separation between bivariate normal distributions. Tampa, Florida, USA, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[613] M. Ouaret, F. Dufaux, and T. Ebrahimi. Enabling privacy for distributed video coding by transform domain scrambling. In 2008 SPIE Visual Communications and Image Processing, 2008. [ bib | http ]
In this paper, a novel scheme for video scrambling is introduced for Distributed Video Coding. The goal is to conceal video information in several applications such as video surveillance and anonymous video communications to preserve privacy. This is achieved by performing a transform domain scrambling on both Key and Wyner-Ziv frames. More specifically, the sign of the scrambled transform coefficient is inverted at the encoder side. The scrambling pattern is defined by a secret key and the latter is required at the decoder for descrambling. The scheme is proven to provide a good level of security in addition to a flexible scrambling level (i.e the amount of distortion introduced). Finally, it is shown that the original DVC scheme and the one with scrambling have a similar rate distortion performance. In other words, the DVC compression efficiency is not negatively impacted by the introduction of the scrambling.

Keywords: Report_VII, IM2.VP,Media Security; Privacy; Scrambling; Transform Domain; Distributed Video Coding
[614] F. Beekhof, S. Voloshynovskiy, O. Koval, and R. Villán. Secure surface identification codes. In E. J. Delp III, P. W. Wong, J. Dittmann, and N. D. Memon, editors, Steganography, and Watermarking of Multimedia Contents X, volume 6819 of Proceedings of SPIE, (SPIE, Bellingham, WA 2008) 68190D, 2008. [ bib | DOI ]
Keywords: Report_VII, IM2.MPR
[615] A. Stolcke, X. Anguera, K. Boakye, O. Cetin, A. Janin, M. Magimai-Doss, C. Wooters, and J. Zheng. The sri-icsi spring 2007 meeting and lecture recognition system. In Multimodal Technologies for Perception of Humans. Lecture Notes in Computer Science, 2008. [ bib ]
Keywords: Report_VII, IM2.AP, joint publication
[616] P. N. Garner. Silence models in weighted finite-state transducers. In Interspeech, 2008. IDIAP-RR 08-19. [ bib ]
We investigate the effects of different silence modelling strategies in Weighted Finite-State Transducers for Automatic Speech Recognition. We show that the choice of silence models, and the way they are included in the transducer, can have a significant effect on the size of the resulting transducer; we present a means to prevent particularly large silence overheads. Our conclusions include that context-free silence modelling fits well with transducer based grammars, whereas modelling silence as a monophone and a context has larger overheads.

Keywords: IM2.AP, Report_VII
[617] K. Boakye, O. Vinyals, and G. Friedland. Two's a crowd: improving speaker diarization by automatically identifying and excluding overlapped speech. In Interspeech, 2008. [ bib ]
Keywords: Report_VII, IM2.AP
[618] M. Liwicki, A. Schlapbach, and H. Bunke. Writer-dependent recognition of handwritten whiteboard notes in smart meeting room environments. In Proc. 8th IAPR Int. Workshop on Document Analysis Systems, pages 151–157, 2008. [ bib ]
Keywords: IM2.VP, Report_VIII
[619] S. Ganapathy, P. Motlicek, H. Hermansky, and H. Garudadri. Spectral noise shaping: improvements in speech/audio codec based on linear prediction in spectral domain. In INTERSPEECH 2008, 2008. IDIAP-RR 08-16. [ bib ]
Audio coding based on Frequency Domain Linear Prediction (FDLP) uses auto-regressive models to approximate Hilbert envelopes in frequency sub-bands. Although the basic technique achieves good coding efficiency, there is a need to improve the reconstructed signal quality for tonal signals with impulsive spectral content. For such signals, the quantization noise in the FDLP codec appears as frequency components not present in the input signal. In this paper, we propose a technique of Spectral Noise Shaping (SNS) for improving the quality of tonal signals by applying a Time Domain Linear Prediction (TDLP) filter prior to the FDLP processing. The inverse TDLP filter at the decoder shapes the quantization noise to reduce the artifacts. Application of the SNS technique to the FDLP codec improves the quality of the tonal signals without affecting the bit-rate. Performance evaluation is done with Perceptual Evaluation of Audio Quality (PEAQ) scores and with subjective listening tests.

Keywords: IM2.AP, Report_VII
[620] D. Jayagopi. Predicting the dominant clique in meetings through fusion of nonverbal cues. In Proc. ACM Vancouver, Canada, Canada, 2008. [ bib ]
Keywords: IM2.MPR, Report_VIII
[621] O. Koval, S. Voloshynovskiy, F. Beekhof, and T. Pun. Analysis of physical unclonable identification based on reference list decoding. In E. J. Delp III, P. W. Wong, J. Dittmann, and N. D. Memon, editors, Steganography, and Watermarking of Multimedia Contents X, volume 6819 of Proceedings of SPIE, (SPIE, Bellingham, WA 2008) 68190B, 2008. [ bib ]
Keywords: Report_VII, IM2.MPR
[622] L. Perruchoud. The anterior cingulate cortex. Idiap-Com Idiap-Com-02-2008, IDIAP, April 2008. [ bib | .pdf ]
Keywords: IM2.MPR,Report_VIII
[623] B. Mesot. Inference in switching linear dynamical systems applied to noise robust speech recognition of isolated digits. PhD thesis, Ecole Polytechnique Fédérale de Lausanne, May 2008. Thèse Ecole polytechnique fédérale de Lausanne EPFL, no 4059 (2008), Faculté des sciences et techniques de l'ingénieur STI, Section de génie électrique et électronique, Institut de génie électrique et électronique IEL (Laboratoire de l'IDIAP LIDIAP). Dir.: Hervé Bourlard. [ bib | .pdf ]
Real world applications such as hands-free dialling in cars may have to perform recognition of spoken digits in potentially very noisy environments. Existing state-of-the-art solutions to this problem use feature-based Hidden Markov Models (HMMs), with a preprocessing stage to clean the noisy signal. However, the effect that the noise has on the induced HMM features is difficult to model exactly and limits the performance of the HMM system. An alternative to feature-based HMMs is to model the clean speech waveform directly, which has the potential advantage that including an explicit model of additive noise is straightforward. One of the most simple model of the clean speech waveform is the autoregressive (AR) process. Being too simple to cope with the nonlinearity of the speech signal, the AR process is generally embedded into a more elaborate model, such as the Switching Autoregressive HMM (SAR-HMM). In this thesis, we extend the SAR-HMM to jointly model the clean speech waveform and additive Gaussian white noise. This is achieved by using a Switching Linear Dynamical System (SLDS) whose internal dynamics is autoregressive. On an isolated digit recognition task where utterances have been corrupted by additive Gaussian white noise, the proposed SLDS outperforms a state-of-the-art HMM system. For more natural noise sources, at low signal to noise ratios (SNRs), it is also significantly more accurate than a feature-based HMM system. Inferring the clean waveform from the observed noisy signal with a SLDS is formally intractable, resulting in many approximation strategies in the literature. In this thesis, we present the Expectation Correction (EC) approximation. The algorithm has excellent numerical performance compared to a wide range of competing techniques, and provides a stable and accurate linear-time approximation which scales well to long time series such as those found in acoustic modelling. A fundamental issue faced by models based on AR processes is that they are sensitive to variations in the amplitude of the signal. One way to overcome this limitation is to use Gain Adaptation (GA) to adjust the amplitude by maximising the likelihood of the observed signal. However, adjusting model parameters without constraint may lead to overfitting when the models are sufficiently flexible. In this thesis, we propose a statistically principled alternative based on an exact Bayesian procedure in which priors are explicitly defined on the parameters of the underlying AR process. Compared to GA, the Bayesian approach enhances recognition accuracy at high SNRs, but is slightly less accurate at low SNRs.

Keywords: Report_VII,IM2.AP
[624] F. Galán. Methods for Asynchronous and Non-Invasive EEG-Based Brain-Computer Interfaces. Towards Intelligent Brain-Actuated Wheelchairs. PhD thesis, University of Barcelona, June 2008. [ bib | .pdf ]
Keywords: IM2.BMI,Report_VII
[625] S. H. K. Parthasarathi, P. Motlicek, and H. Hermansky. Exploiting temporal context for speech/non-speech detection. Idiap-RR Idiap-RR-21-2008, IDIAP, September 2008. [ bib | .ps.gz | .pdf ]
In this paper, we investigate the effect of temporal context for speech/non-speech detection (SND). It is shown that even a simple feature such as full-band energy, when employed with a large-enough context, shows promise for further investigation. Experimental evaluations on the test data set, with a state-of-the-art multi-layer perceptron based SND system and a simple energy threshold based SND method, using the F-measure, show an absolute performance gain of 4.4% and 5.4% respectively, when used with a context of 1000 ms. ROC based performance evaluation also reveals promising performance for the proposed method, particularly in low SNR conditions.

Keywords: IM2.AP,Report_VII
[626] W. Li, K. Kumatani, J. Dines, M. Magimai-Doss, and H. Bourlard. A neural network based regression approach for recognizing simultaneous speech. Idiap-RR Idiap-RR-10-2008, IDIAP, September 2008. Submitted for publication. [ bib | .ps.gz | .pdf ]
This paper presents our approach for automatic speech recognition (ASR) of overlapping speech. Our system consists of two principal components: a speech separation component and a feature estmation component. In the speech separation phase, we first estimated the speaker's position, and then the speaker location information is used in a GSC-configured beamformer with a minimum mutual information (MMI) criterion, followed by a Zelinski and binary-masking post-filter, to separate the speech of different speakers. In the feature estimation phase, the neural networks are trained to learn the mapping from the features extracted from the pre-separated speech to those extracted from the close-talking microphone speech signal. The outputs of the neural networks are then used to generate acoustic features, which are subsequently used in acoustic model adaptation and system evaluation. The proposed approach is evaluated through ASR experiments on the it PASCAL Speech Separation Challenge II (SSC2) corpus. We demonstrate that our system provides large improvements in recognition accuracy compared with a single distant microphone case and the performance of ASR system can be significantly improved both through the use of MMI beamforming and feature mapping approaches.

Keywords: Report_VII,IM2.AP
[627] S. Ganapathy, P. Motlicek, and H. Hermansky. Low-delay error resilient speech coding using sub-band hilbert envelopes. Idiap-RR Idiap-RR-75-2008, Idiap, September 2008. [ bib | .pdf ]
Keywords: IM2.AP,Report_VIII
[628] R. A. Negoescu and D. Gatica-Perez. Topickr: Flickr groups and users reloaded. In MM '08: Proc. of the 16th ACM Intl. Conf. on Multimedia, Vancouver, Canada, October 2008. ACM. [ bib ]
With the increased presence of digital imaging devices there also came an explosion in the amount of multimedia content available online. Users have transformed from passive consumers of media into content creators. Flickr.com is such an example of an online community, with over 2 billion photos (and more recently, videos as well), most of which are publicly available. The user interaction with the system also provides a plethora of metadata associated with this content, and in particular tags. One very important aspect in Flickr is the ability of users to organize in self-managed communities called groups. Although users and groups are conceptually different, in practice they can be represented in the same way: a bag-of-tags, which is amenable for probabilistic topic modeling. We present a topic-based approach to represent Flickr users and groups and demonstrate it with a web application, Topickr, that allows similarity based exploration of Flickr entities using their topic-based representation, learned in an unsupervised manner.

Keywords: IM2.MPR,Report_VIII
[629] Y. Grandvalet, A. Rakotomamonjy, J. Keshet, and S. Canu. Support vector machines with a reject option. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems, December 2008. [ bib | .pdf ]
We consider the problem of binary classification where the classifier may abstain instead of classifying each observation. The Bayes decision rule for this setup, known as Chow's rule, is defined by two thresholds on posterior probabilities. From simple desiderata, namely the consistency and the sparsity of the classifier, we derive the double hinge loss function that focuses on estimating conditional probabilities only in the vicinity of the threshold points of the optimal decision rule. We show that, for suitable kernel machines, our approach is universally consistent. We cast the problem of minimizing the double hinge loss as a quadratic program akin to the standard SVM optimization problem and propose an active set method to solve it efficiently. We finally provide preliminary experimental results illustrating the interest of our constructive approach to devising loss functions.

Keywords: IM2.AP,Report_VIII
[630] P. Motlicek. Automatic out-of-language detection based on confidence measures derived fromlvcsr word and phone lattices. In 10thAnnual Conference of the International Speech Communication Association, 2009 ISCA. ISCA, 2009. [ bib ]
Confidence Measures (CMs) estimated from Large Vocabulary Continuous Speech Recognition (LVCSR) outputs are commonly used metrics to detect incorrectly recognized words. In this paper, we propose to exploit CMs derived from frame-based word and phone posteriors to detect speech segments containing pronunciations from non-target (alien) languages. The LVCSR system used is built for English, which is the target language, with medium-size recognition vocabulary (5k words). The efficiency of detection is tested on a set comprising speech from three different languages (English, German, Czech). Results achieved indicate that employment of specific temporal context (integrated in the word or phone level) significantly increases the detection accuracies. Furthermore, we show that combination of several CMs can also improve the efficiency of detection.

Keywords: IM2.AP,Report_VIII
[631] G. Heusch and S. Marcel. Bayesian networks to combine intensity and color information in face recognition. Idiap-RR Idiap-RR-27-2009, Idiap, 2009. [ bib | .pdf ]
Keywords: IM2.VP, Report_VIII
[632] S. Ganapathy, P. Motlicek, and H. Hermansky. Error resilient speech coding using sub-band hilbert envelopes. In 12th International Conference on Text, Speech and Dialogue, TSD 2009, LNAI 5729. Springer - Verlag, Berlin Heidelberg 2009, 2009. [ bib ]
Frequency Domain Linear Prediction (FDLP) represents a technique for auto-regressive modelling of Hilbert envelopes of a signal. In this paper, we propose a speech coding technique that uses FDLP in Quadrature Mirror Filter (QMF) sub-bands of short segments of the speech signal (25 ms). Line Spectral Frequency parameters related to au-toregressive models and the spectral components of the residual signals are transmitted. For simulating the effects of lossy transmission channels, bit-packets are dropped randomly. In the objective and subjective quality evaluations, the proposed FDLP speech codec is judged to be more resilient to bit-packet losses compared to the state-of-the-art Adaptive Multi-Rate Wide-Band (AMR-WB) codec at 12 kbps.

Keywords: IM2.VP,Report_VIII
[633] T. Weyand, T. Deselaers, and H. Ney. Log-linear mixtures for object recognition. In British Machine Vision Conference, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[634] T. Weise, T. Wismer, B. Leibe, and L. Van Gool. In-hand scanning with online loop closure. In IEEE International Workshop on 3-D Digital Imaging and Modeling, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[635] T. Gass, T. Deselaers, and H. Ney. Deformation-aware log-linear models. In Deutsche Arbeitsgemeinschaft für Mustererkennung Symposium, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[636] S. Pellegrini, A. Ess, K. Schindler, and L. van Gool. You'll never walk alone: Modeling social behavior for multi-target tracking. In International Conference on Computer Vision, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[637] S. Stalder, H. Grabner, and L. Van Gool. Beyond semi-supervised tracking: Tracking should be as simple as detection, but not simpler than recognition. In OLCV 09: 3rd On-line learning for Computer Vision Workshop, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[638] S. Gammeter, L. Bossard, T. Quack, and L. Van Gool. I know what you did last summer: object-level auto-annotation of holiday snaps. In International Conference on Computer Vision, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[639] Peter Gehler and Sebastian Nowozin. On feature combination for multiclass object classification. In Proceedings of the Twelfth IEEE International Conference on Computer Vision, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[640] Peter Gehler and Sebastian Nowozin. Let the kernel figure it out: Principled learning of pre-processing for kernel classifiers. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[641] N. Hasler, B. Rosenhahn, T. Thormüahlen, M. Wand, J. Gall, and H. P. Seidel. Markerless motion capture with unsynchronized moving cameras. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[642] N. Bellotto, E. Sommerlade, B. Benfold, C. Bibby, I. Reid, D. Roth, L. Van Gool, C. Fernandez, and J. Gonzalez. A distributed camera system for multi-resolution surveillance. In Third ACM/IEEE International Conference on Distributed Smart Cameras, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[643] M. D. Breitenstein, Helmut Grabner, and Luc Van Gool. Hunting nessie – real-time abnormality detection from webcams. In IEEE International Workshop on Visual Surveillance, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[644] M. D. Breitenstein, Fabian Reichlin, B. Leibe, E. Koller-Meier, and Luc Van Gool. Robust tracking-by-detection using a detector confidence particle filter. In IEEE International Conference on Computer Vision, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[645] M. Van den Bergh, J. Halatsch, A. Kunze, F. Bosche, L. Van Gool, and G. Schmitt. Towards collaborative interaction with large nd models for effective project management. In 9th International Conference on Construction Applications of Virtual Reality, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[646] M. Van den Bergh, F. Bosche, E. Koller-Meier, and L. Van Gool. Haarlet-based hand gesture recognition for 3d interaction. In Proceedings of the IEEE Workshop on Applications of Computer Vision, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[647] M. Shaheen, J. Gall, R. Strzodka, L. Van Gool, and H. P. Seidel. A comparison of 3d model-based tracking approaches for human motion capture in uncontrolled environments. In IEEE Workshop on Applications of Computer Vision, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[648] L. Wu, S. C. Hoi, R. Jin, J. Zhu, and N. Yu. Distance metric learning from uncertain side information with application to automated photo taggin. In ACM Multimedia 2009, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[649] L. Jie, Barbara Caputo, and V. Ferrari. Who's doing what: Joint modeling of names and verbs for simultaneous face and pose annotation. In Advances in Neural Information Processing Systems, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[650] K. Dylla, P. Müller, A. Ulmer, S. Haegler, and B. Fischer. Rome reborn 2.0: A framework for virtual city reconstruction using procedural modeling techniques. In Proceedings of Computer Applications and Quantitative Methods in Archaeology, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[651] J. Zhu, L. Van Gool, and S. C. Hoi. Unsupervised face alignment by robust nonrigid mapping. In ICCV2009, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[652] J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, and H. P. Seidel. Motion capture using joint skeleton tracking and surface estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[653] J. Gall and V. Lempitsky. Class-specific hough forests for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[654] Henning Hamer, K. Schindler, E. Koller-Meier, and Luc Van Gool. Tracking a hand manipulating an object. In IEEE International Conference on Computer Vision, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[655] Gideon Aschwanden, S. Haegler, Jan Halatsch, Raphael Jecker, Gerhard Schmitt, and Luc Van Gool. Evaluation of 3d city models using automatic placed urban agents. In CONVR, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[656] G. Fanelli, J. Gall, and L. Van Gool. Hough transform-based mouth localization for audio-visual speech recognition. In British Machine Vision Conference, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[657] Fabian Nater, Helmut Grabner, T. Jaeggli, and Luc Van Gool. Tracker trees for unusual event detectio. In IEEE International Workshop on Visual Surveillance, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[658] Marcin Eichner and V. Ferrari. Better appearance models for pictorial structures. In British Machine Vision Conference, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[659] Alain Lehmann, B. Leibe, and Luc Van Gool. Prism: Principled implicit shape model. In British Machine Vision Conference, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[660] Alain Lehmann, B. Leibe, and Luc Van Gool. Feature-centric efficient subwindow search. In IEEE International Conference on Computer Vision, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[661] A. Ess, K. Schindler, B. Leibe, and L. van Gool. Improved multi-person tracking with active occlusion handling. In ICRA Workshop on People Detection and Tracking, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[662] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. Moving obstacle detection in highly dynamic scenes. In IEEE International Conference on Robotics and Automation, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[663] V. Ferrari, M. Marin, , and A. Zisserman. 2d human pose estimation in tv shows. In D. Cremers, B. Rosenhahn, A. Yuille, and F. Schmidt, editors, Statistical and Geometrical Approaches to Visual Motion Analysis, pages 128–147. Springer, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, incollection
[664] Peter Gehler and Bernhard Schölkopf. An introduction to kernel learning algorithms. In Gustavo Camps-Valls and Lorenzo Bruzzone, editors, Kernel Methods for Remote Sensing Data Analysis, pages 39–60. Wiley, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, incollection
[665] Michael Van den Bergh, Roland Kehl, E. Koller-Meier, and Luc Van Gool. Real-time 3d body pose estimation. In Hamid Aghajan and Andrea Cavallaro, editors, Multi-Camera Networks: Concepts and Applications, pages 335–360. Elsevier, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, incollection
[666] S. Haegler, P. Müller, and Luc Van Gool. Procedural modeling for digital cultural heritage. EURASIP Journal on Image and Video Processing, 2009, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, article
[667] Michael Van den Bergh, E. Koller-Meier, and Luc Van Gool. Real-time body pose recognition using 2d or 3d haarlets. International Journal of Computer Vision, 83:72–84, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, article
[668] I. K. Park, M. Germann, M. D. Breitenstein, and H. Pfister. Fast and automatic object pose estimation for range images on the gpu. Machine Vision and Applications, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, article
[669] F. Bosché, C. T. Haas, and B. Akinci. Automated recognition of 3d cad objects in site laser scans for project 3d status visualization and performance control. ASCE Journal of Computing in Civil Engineering, 23(6):311–318, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, article
[670] D. Roth, E. Koller-Meier, and Luc Van Gool. Multi-object tracking evaluated on sparse events. Multimedia Tools and Applications, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, article
[671] A. Thomas, V. Ferrari, B. Leibe, T. Tuytelaars, and L. Van Gool. Using multi-view recognition to guide a robot. International Journal of Robotics Research, 28(8):976–998, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, article
[672] A. Thomas, V. Ferrari, B. Leibe, T. Tuytelaars, and L. Van Gool. Shape-from-recognition: Recognition enables meta-data transfer. Computer Vision and Image Understanding, 113(12):1222–1234, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, article
[673] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. Robust multi-person tracking from a mobile platform. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10):1831–1846, 2009. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, article
[674] Virginia Estellers, M. Gurban, and J. Ph. Thiran. SELECTING RELEVANT VISUAL FEATURES FOR SPEECHREADING. In Proc. of the IEEE International Conference on Image Processing, Cairo, Egypt, 2009. [ bib | http | http ]
A quantitative measure of relevance is proposed for the task of constructing visual feature sets which are at the same time relevant and compact. A feature's relevance is given by the amount of information that it contains about the problem, while compactness is achieved by preventing the replication of information between features in the set. To achieve these goals, we use mutual information both for assessing relevance and measuring the redundancy between features. Our application is speechreading, that is, speech recognition performed on the video of the speaker. This is justified by the fact that the performance of audio speech recognition can be improved by augmenting the audio features with visual ones, especially when there is noise in the audio channel. We report significant improvements compared to the most commonly used method of dimensionality reduction for speechreading, linear discriminant analysis.

Keywords: LTS5; Feature extraction; image processing; speech recognition, Report_IX, IM2.IP1, Group Thiran, inproceedings
[675] L. Gui, J. Ph. Thiran, and N. Paragios. Cooperative Object Segmentation and Behavior Inference in Image Sequences. International Journal of Computer Vision, 84(2):146–162, 2009. [ bib | http ]
In this paper, we propose a general framework for fusing bottom-up segmentation with top-down object behavior inference over an image sequence. This approach is beneficial for both tasks, since it enables them to cooperate so that knowledge relevant to each can aid in the resolution of the other, thus enhancing the final result. In particular, the behavior inference process offers dynamic probabilistic priors to guide segmentation. At the same time, segmentation supplies its results to the inference process, ensuring that they are consistent both with prior knowledge and with new image information. The prior models are learned from training data and they adapt dynamically, based on newly analyzed images. We demonstrate the effectiveness of our framework via particular implementations that we have employed in the resolution of two hand gesture recognition applications. Our experimental results illustrate the robustness of our joint approach to segmentation and behavior inference in challenging conditions involving complex backgrounds and occlusions of the target object.

Keywords: image segmentation; behavior inference; gesture recognition; LTS5, Report_IX, IM2.IP1, Group Thiran, article
[676] J. Kierkels, M. Soleymani, and T. Pun. Queries and tags in affect-based multimedia retrieval. In International Conference on Multimedia and Expo, Special Session on Implicit Tagging, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, inproceedings
[677] J. Kierkels and T. Pun. Simultaneous exploitation of explicit and implicit tags in affect-based multimedia retrieval. In International Conference on Affective Computing and Intelligent Interaction, pages 274–279, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, inproceedings
[678] M. Soleymani, J. Kierkels, G. Chanel, and T. Pun. A bayesian framework for video affective representation. In International Conference on Affective Computing and Intelligent Interaction, pages 267–273, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, inproceedings
[679] M. Soleymani, J. Davis, and T. Pun. A collaborative personalized affective video retrieval system. In International Conference on Affective Computing and Intelligent Interaction, pages 588–589, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, inproceedings
[680] J. Kierkels, M. Soleymani, and T. Pun. Identification of narrative peaks in clips: text features perform best. In VideoCLEF 2009, Cross Language Evaluation Forum (CLEF) Workshop, ECDL 200, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, inproceedings
[681] B. Deville, G. Bologna, M. Vinckenbosch, and T. Pun. See color: Seeing colours with an orchestra. In D. Lalanne and J. Kohlas, editors, Human Machine Interaction, Research Results of the MMI Program, pages 251–279. Springer LNCS, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, incollection
[682] B. Dumas, D. Lalanne, and R. Ingold. Hephaistk: A toolkit for rapid prototyping of multimodal interfaces. In Proceedings of International Conference on Multimodal Interfaces and Workshop on Machine Learning for Multi-modal Interaction (ICMI-MLMI 2009), pages 231–232, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ingold, inproceedings
[683] B. Dumas, D. Lalanne, and R. Ingold. Description languages for multimodal interaction: a set of guidelines. Journal on Multimodal User Interfaces, 3, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ingold, article
[684] F. Evéquoz. An ethnographically-inspired survey of pim strategies. technical report. Technical report, Department of Informatics, University of Fribourg, Switzerland, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ingold, techreport
[685] F. Evéquoz and D. Lalanne. "i thought you would show me how to do it" – studying and supporting pim strategy changes. In Proceedings of ASIS&T PIM Workshop (ASIS&T 2009), 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ingold, inproceedings
[686] E. Bertini and D. Lalanne. Investigating and reflecting on the integration of automatic data analysis and visualization in knowledge discovery. ACM SIGKDD Explorations, 22, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ingold, article
[687] Pascal Bruegger, D. Lalanne, A. Lisowska, and B. Hirsbrunner. A method and tools for designing and prototyping activity-based pervasive applications. In Proceedings of 7th International Conference on Advances in Mobile Computing & Multimedia (ACM MoMM 2009), pages 129–136, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ingold, inproceedings
[688] Dalila Mekhaldi and D. Lalanne. Joining meeting documents to strengthen multimodal thematic alignment. In Proceedings of 5th International Conference on Signal Image Technology and Internet Based Systems (SITIS 2009), pages 88–96, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ingold, inproceedings
[689] F. De Simone, L. Goldmann, V. Baroncini, and T. Ebrahimi. Subjective evaluation of JPEG XR image compression. In Proceedings of SPIE, volume 7443, San Diego, California, USA, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[690] J. S. Lee and T. Ebrahimi. Efficient video coding in H.264/AVC by using audio-visual information. In Proceedings of the IEEE International Workshop on Multimedia Signal Processing, Rio de Janeiro, Brazil, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[691] A. Yazdani, J. S. Lee, and T. Ebrahimi. Implicit emotional tagging of multimedia using EEG signals and brain computer interface. In Proceedings of the International Workshop on Social Media, pages 81–88, Beijing, China, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[692] P. Vajda, L. Goldmann, and T. Ebrahimi. Analysis of the limits of graph-based object duplicate detection. In Prooceedings of the IEEE International Symposium on Multimedia, pages 600–605, San Diego, California, USA, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[693] F Kaplan, S Do-Lenh, K Bachour, G. Y Kao, C Gault, and P Dillenbourg. Interpersonal Computers for Higher Education. In P Dillenbourg, J Huang, and M Cherubini, editors, Interactive Artifacts and Furniture Supporting Collaborative Work and Learning, Computer-Supported Collaborative Learning Series, pages 129–145. Springer US, 2009. [ bib | http ]
Keywords: Report_IX, IM2.IP2, Group Dillenbourg, incollection
[694] Marc-Antoine Nuessli, Patrick Jermann, Mirweis Sangin, and Pierre Dillenbourg. Collaboration and abstract representations: towards predictive models based on raw speech and eye-tracking data. In CSCL '09: Proceedings of the 2009 conference on Computer support for collaborative learning. International Society of the Learning Sciences, 2009. Invited Paper. [ bib | http | http ]
This study aims to explore the possibility of using machine learning techniques to build predictive models of performance in collaborative induction tasks. More specifically, we explored how signal-level data, like eye-gaze data and raw speech may be used to build such models. The results show that such low level features have effectively some potential to predict performance in such tasks. Implications for future applications design are shortly discussed.

Keywords: Report_IX, IM2.IP2, Group Dillenbourg, inproceedings
[695] Guillaume Zufferey, Patrick Jermann, Son Do Lenh, and Pierre Dillenbourg. Using Augmentations as Bridges from Concrete to Abstract Representations. In Proceedings of the 23rd British HCI Group Annual Conference on HCI 2009: Celebrating People and Technology, pages 130–139, Swinton, 2009. British Computer Society. [ bib | http | http ]
We describe a pedagogical approach supporting the acquisition of abstraction skills by apprentices in logistics. Apprentices start with a concrete representation in the form of a small-scale model which aims at engaging them in learning activities. Multiple External Representations are used to progressively introduce more abstract representations displayed on paper-based forms called TinkerSheets. We present the implementation of this approach on the TinkerTable, a tabletop learning environment which is used in two professional schools by four different teachers. We report observations of the use of the environment at different stages of the curriculum with first- and second-year apprentices.

Keywords: Tangible User Interfaces; Paper-based Interaction; Multiple External Representations; Augmented Reality; Vocational Training, Report_IX, IM2.IP2, Group Dillenbourg, inproceedings
[696] Alexander Sproewitz, A. Billard, Pierre Dillenbourg, and Auke Jan Ijspeert. Roombots-Mechanical Design of Self-Reconfiguring Modular Robots for Adaptive Furniture. In Proceedings of 2009 IEEE International Conference on Robotics and Automation, pages 4259–4264, 2009. [ bib | DOI | .pdf ]
We aim at merging technologies from information technology, roomware, and robotics in order to design adaptive and intelligent furniture. This paper presents design principles for our modular robots, called Roombots, as future building blocks for furniture that moves and self-reconfigures. The reconfiguration is done using dynamic connection and disconnection of modules and rotations of the degrees of freedom. We are furthermore interested in applying Roombots towards adaptive behaviour, such as online learning of locomotion patterns. To create coordinated and efficient gait patterns, we use a Central Pattern Generator (CPG) approach, which can easily be optimized by any gradient-free optimization algorithm. To provide a hardware framework we present the mechanical design of the Roombots modules and an active connection mechanism based on physical latches. Further we discuss the application of our Roombots modules as pieces of a homogenic or heterogenic mix of building blocks for static structures.

Keywords: self reconfiguring modular robots; active connection mechanism; furniture; mechanical design; quadruped robotics; biorob_roombots, Report_IX, IM2.IP2, Group Dillenbourg, inproceedings
[697] P. Estrella, A. Popescu-Belis, and M. King. The femti guidelines for contextual mt evaluation: principles and tools. Linguistica Antverpiensia New Series, 8, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, article
[698] S. Favre. Social network analysis in multimedia indexing: Making sense of people in multiparty recordings. In Proceedings of the Doctoral Consortium of the International Conference on Affective Computing & Intelligent Interaction (ACII), pages 25–32, 2009. [ bib | .pdf ]
This paper presents an automatic approach to analyze the human interactions appearing in multiparty data, aiming at understanding the data content and at extracting social informa- tion such as Which role do people play?, What is their attitude?, or Can people be split into meaningful groups?. To extract such information, we use a set of mathematical techniques, namely Social Networks Analysis (SNA), developed by sociologists to analyze social interac- tions. This paper shows that a strong connection can be established between the content of broadcast data and the social interactions of the individuals involved in the recordings. Experiments aiming at assigning each individual to a social group corresponding to a specific topic in broadcast news, and experiments aiming at recognizing the role played by each indi- vidual in multiparty data are presented in this paper. The results achieved are satisfactory, which suggests on one side that the application of SNA to similar problems could lead to useful contributions in the domain of multimedia content analysis, and on the other side, that the presented analysis of social interactions could be a significant breakthrough for affective computing.

Keywords: Report_IX, IM2.IP3, Group Bourlard, inproceedings
[699] J. Galbally, C. McCool, J. Fierrez, S. Marcel, and J. Ortega-Garcia. On the vulnerability of face verification systems to hill-climbing attacks. Pattern Recognition, 2009. [ bib ]
In this paper, we use a hill-climbing attack algorithm based on Bayesian adaption to test the vulnerability of two face recognition systems to indirect attacks. The attacking technique uses the scores provided by the matcher to adapt a global distribution computed from an independent set of users, to the local specificities of the client being attacked. The proposed attack is evaluated on an eigenface-based and a parts-based face verification system using the XM2VTS database. Experimental results demonstrate that the hill-climbing algorithm is very efficient and is able to bypass over 85% of the attacked accounts (for both face recognition systems). The security flaws of the analyzed systems are pointed out and possible countermeasures to avoid them are also proposed.

Keywords: Report_IX, IM2.IP1, Group Bourlard, article
[700] J. Ph. Thiran, H. Bourlard, and F. Marques. Multimodal Signal Processing: Methods and Techniques to Build Multimodal Interactive Systems. Academic Press, 2009. [ bib ]
Multimodal signal processing is an important new field that processes signals from a variety of modalities - speech, vision, language, text- derived from one source, which aids human-computer and human-human interaction. The overarching theme of this book is the application of signal processing and statistical machine learning techniques to problems arising in this field. It gives an overview of the field, the capabilities and limitations of current technology, and the technical challenges that must be overcome to realize multimodal interactive systems. As well as state-of-the-art methods in multimodal signal and image modeling and processing, the book gives numerous examples and applications of multimodal interactive systems, including humancomputer and human-human interaction. This is the definitive reference in multimodal signal processing, edited and contributed by the leading experts, for signal processing researchers and graduates, R&D engineers and computer engineers.

Keywords: Report_IX, IM2.IP1, Group Bourlard, book
[701] D. Gatica-Perez. Modeling interest in face-to-face conversations from multimodal nonverbal behavior. In In J.-P. Thiran, H. Bourlard, and F. Marques, (Eds.), Multimodal Signal Processing, Academic Press. Academic Press, 2009. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP3, Group Bourlard, incollection
[702] A. Popescu-Belis. Managing multimodal data, metadata and annotations: Challenges and solutions. In J. Ph. Thiran, F. Marques, and H. Bourlard, editors, Multimodal Signal Processing for Human-Computer Interaction, pages 183–203. Elsevier / Academic Press, London, 2009. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, incollection
[703] G. Friedland, H. Hung, and Chuohao Yeo. Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In International Conference on Audio, Speech and Signal Processing, 2009. [ bib | .pdf ]
Speaker diarization is originally defined as the task of de- termining �who spoke when� given an audio track and no other prior knowledge of any kind. The following article shows a multi-modal approach where we improve a state- of-the-art speaker diarization system by combining standard acoustic features (MFCCs) with compressed domain video features. The approach is evaluated on over 4.5 hours of the publicly available AMI meetings dataset which contains challenges such as people standing up and walking out of the room. We show a consistent improvement of about 34 % rela- tive in speaker error rate (21 % DER) compared to a state-of- the-art audio-only baseline.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[704] K. Farrahi and D. Gatica-Perez. Learning and predicting multimodal daily life patterns from cell phones. In ICMI-MLMI, 2009. [ bib | .pdf ]
In this paper, we investigate the multimodal nature of cell phone data in terms of discovering recurrent and rich patterns in people's lives. We present a method that can discover routines from multiple modalities (location and proximity) jointly modeled, and that uses these informative routines to predict unlabeled or missing data. Using a joint representation of location and proximity data over approximately 10 months of 97 individuals' lives, Latent Dirichlet Allocation is applied for the unsupervised learning of topics describing people's most common locations jointly with the most common types of interactions at these locations. We further successfully predict where and with how many other individuals users will be, for people with both highly and lowly varying lifestyles.

Keywords: Report_IX, IM2.IP3, Group Bourlard, inproceedings
[705] J. Berclaz, A. Shahrokni, F. Fleuret, James Ferryman, and P. Fua. Evaluation of probabilistic occupancy map people detection for surveillance systems. In Proceedings of the IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, 2009. [ bib ]
In this paper, we evaluate the Probabilistic Occupancy Map (POM) pedestrian detection algorithm on the PETS 2009 benchmark dataset. POM is a multi-camera generative detection method, which estimates ground plane occupancy from multiple background subtraction views. Occupancy probabilities are iteratively estimated by fitting a synthetic model of the background subtraction to the binary foreground motion. Furthermore, we test the integration of this algorithm into a larger framework designed for understanding human activities in real environments. We demonstrate accurate detection and localization on the PETS dataset, despite suboptimal calibration and foreground motion segmentation input.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[706] G. Heusch and S. Marcel. Bayesian networks to combine intensity and color information in face recognition. In International Conference on Biometrics [631], pages 414–423. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[707] G. Heusch. Bayesian Networks as Generative Models for Face Recognition. PhD thesis, EPFL, 2009. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, phdthesis
[708] G. Heusch and S. Marcel. A novel statistical generative model dedicated to face recognition. In Image & Vision Computing [94]. in press. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, article
[709] E. Indermühle, M. Liwicki, and H. Bunke. Combining alignment results for historical handwritten document analysis. In Proc. 10th Int. Conf. on Document Analysis and Recognition, volume 3, pages 1186–1190, 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[710] G. Gonzalez, F. Aguet, F. Fleuret, M. Unser, and P. Fua. Steerable features for statistical 3d dendrite detection. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2009. (to appear). [ bib ]
Keywords: IM2.VP, Report_VIII
[711] D. Imseng and G. Friedland. Robust speaker diarization for short speech recordings. In Proceedings of the IEEE workshop on Automatic Speech Recognition and Understanding, 2009. [ bib | .pdf ]
We investigate a state-of-the-art Speaker Diarization system regarding its behavior on meetings that are much shorter (from 500 seconds down to 100 seconds) than those typically analyzed in Speaker Diarization benchmarks. First, the problems inherent to this task are analyzed. Then, we propose an approach that consists of a novel initialization parameter estimation method for typical state-of-the-art diarization approaches. The estimation method balances the relationship between the optimal value of the duration of speech data per Gaussian and the duration of the speech data, which is verified experimentally for the first time in this article. As a result, the Diarization Error Rate for short meetings extracted from the 2006, 2007, and 2009 NIST RT evaluation data is decreased by up to 50% relative.

Keywords: IM2.AP, IM2.MCA, Report_VIII
[712] T. Tommasi and B. Caputo. The more you know, the less you learn: from knowledge transfer to one-shot learning of object categories. In BMVC, 2009. [ bib | .pdf ]
Learning a category from few examples is a challenging task for vision algorithms, while psychological studies have shown that humans are able to generalise correctly even from a single instance (one-shot learning). The most accredited hypothesis is that humans are able to exploit prior knowledge when learning a new related category. This paper presents an SVM-based model adaptation algorithm able to perform knowledge transfer for a new category when very limited examples are available. Using a leave- one-out estimate of the weighted error-rate the algorithm automatically decides from where to transfer (on which known category to rely), how much to transfer (the degree of adaptation) and if it is worth transferring something at all. Moreover a weighted least-squares loss function takes optimally care of data unbalance between negative and positive examples. Experiments presented on two different object category databases show that the proposed method is able to exploit previous knowledge avoiding negative transfer. The overall classification performance is increased compared to what would be achieved by starting from scratch. Furthermore as the number of already learned categories grows, the algorithm is able to learn a new category from one sample with increasing precision, i.e. it is able to perform one-shot learning.

Keywords: IM2.MPR, Report_VIII
[713] Q. A. Le and A. Popescu-Belis. Automatic vs. human question answering over multimedia meeting recordings. In Interspeech 2009 (10th Annual Conference of the International Speech Communication Association), Brighton, UK, 2009. [ bib ]
Keywords: IM2.HMI, Report_VIII
[714] S. Ba and J. M. Odobez. Recognizing human visual focus of attention from head pose in meetings. IEEE Trans. on System, Man and Cybernetics: part B, Man, 39(1):16–34, 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[715] J. Keshet and D. Chazan. A kernel wrapper for phoneme sequence recognition. In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods. John Wiley and Sons, 2009. [ bib ]
We describe a kernel wrapper, a Mercer kernel for the task of phoneme sequence recognition which is based on operations with the Gaussian kernel, and suitable for any sequence kernel classifier. We start by presenting a kernel-based algorithm for phoneme sequence recognition, which aims at minimizing the Levenshtein distance (edit distance) between the predicted phoneme sequence and the true phoneme sequence. Motivated by the good results of frame-based phoneme classification using SVMs with Gaussian kernel, we devised a kernel for speech utterances and phoneme sequences, which generalizes the kernel function for phoneme frame-based classification and adds timing constraints in the form of transitions and durations constraints. The kernel function has three parts corresponding to phoneme acoustic model, phoneme duration model and phoneme transition model. We present initial encouraging experimental results with the TIMIT corpus.

Keywords: IM2.AP, Report_VIII
[716] S. Xie, B. Favre, D. Hakkani-Tur, and Y. Liu. Leveraging sentence weights in a concept-based optimization framework for extractive meeting summarization. In 10th International Conference of the International Speech Communication Association, Brighton, UK, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[717] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. PAMI, 31(5):855–869, 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[718] J. Luo, F. Orabona, and B. Caputo. An online framework for learning novel concepts over multiple cues. In Proceeding of The 9th Asian Conference on Computer Vision, 2009. [ bib | .pdf ]
We propose an online learning algorithm to tackle the problem of learning under limited computational resources in a teacher-student scenario, over multiple visual cues. For each separate cue, we train an online learning algorithm that sacrifices performance in favor of bounded memory growth and fast upyearof the solution. We then recover back performance by using multiple cues in the online setting. To this end, we use a two-layers structure. In the first layer, we use a budget online learning algorithm for each single cue. Thus, each classifier provides confidence interpretations for target categories. On top of these classifiers, a linear online learning algorithm is added to learn the combination of these cues. As in standard online learning setups, the learning takes place in rounds. On each round, a new hypothesis is estimated as a function of the previous one. We test our algorithm on two student-teacher experimental scenarios and in both cases results show that the algorithm learns the new concepts in real time and generalizes well.

Keywords: IM2.MPR, Report_VIII
[719] A. Vinciarelli, M. Pantic, and H. Bourlard. Social signal processing: Survey of an emerging domain. Image and Vision Computing, 2009. to appear. [ bib | .pdf ]
The ability to understand and manage social signals of a person we are communicating with is the core of social intelligence. Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life. This paper argues that next-generation computing needs to include the essence of social intelligence - the ability to recognize human social signals and social behaviours like turn taking, politeness, and disagreement - in order to become more effective and more efficient. Although each one of us understands the importance of social signals in everyday life situations, and in spite of recent advances in machine analysis of relevant behavioural cues like blinks, smiles, crossed arms, laughter, and similar, design and development of automated systems for Social Signal Processing (SSP) are rather difficult. This paper surveys the past efforts in solving these problems by a computer, it summarizes the relevant findings in social psychology, and it proposes a set of recommendations for enabling the development of the next generation of socially-aware computing.

Keywords: Computer Vision, human behaviour analysis, Social Interactions, Social signals, speech processing, IM2.MCA,Report_VIII
[720] G. Bologna, B. Deville, and T. Pun. On the use of the auditory pathway to represent image scenes in real-time. Neurocomputing, 72:839–849, 2009. [ bib ]
Keywords: IM2.MCA, Report_VIII
[721] D. Vijayasenan, F. Valente, and H. Bourlard. An information theoretic approach to speaker diarization of meeting data. IEEE Transactions on Audio Speech and Language Processing, 17(7):1382–1393, 2009. [ bib | DOI | .pdf ]
A speaker diarization system based on an information theoretic framework is described. The problem is formulated according to the Information Bottleneck (IB) principle. Unlike other approaches where the distance between speaker segments is arbitrarily introduced, the IB method seeks the partition that maximizes the mutual information between observations and variables relevant for the problem while minimizing the distortion between observations. This solves the problem of choosing the distance between speech segments, which becomes the Jensen- Shannon divergence as it arises from the IB objective function optimization. We discuss issues related to speaker diarization using this information theoretic framework such as the criteria for inferring the number of speakers, the trade-off between quality and compression achieved by the diarization system, and the algorithms for optimizing the objective function. Furthermore we benchmark the proposed system against a state-of-the-art system on the NIST RT06 (Rich Transcription) data set for speaker diarization of meetings. The IB based system achieves a Diarization Error Rate of 23.2% compared to 23.6% for the baseline system. This approach being mainly based on nonparametric clustering, it runs significantly faster than the baseline HMM/GMM based system, resulting in faster-than-real-time diarization.

Keywords: IM2.AP, Report_VIII
[722] A. Popescu-Belis. Comparing meeting browsers using a task-based evaluation method. Idiap-RR Idiap-RR-11-2009, Idiap, 2009. [ bib | .pdf ]
Information access within meeting recordings, potentially transcribed and augmented with other media, is facilitated by the use of meeting browsers. To evaluate their performance through a shared benchmark task, users are asked to discriminate between true and false parallel statements about facts in meetings, using different browsers. This paper offers a review of the results obtained so far with five types of meeting browsers, using similar sets of statements over the same meeting recordings. The results indicate that state-of-the-art speed for true/false question answering is 1.5-2 minutes per question, and precision is 70%-80% (vs. 50% random guess). The use of ASR compared to manual transcripts, or the use of audio signals only, lead to a perceptible though not dramatic decrease in performance scores.

Keywords: IM2.HMI, Report_VIII
[723] M. Wuthrich, M. Liwicki, A. Fischer, E. Indermühle, H. Bunke, G. Viehhauser, and M. Stolz. Language model integration for the recognition of handwritten medieval documents. In Proc. 10th Int. Conf. on Document Analysis and Recognition, volume 1, pages 211–215, 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[724] D. Morrison, E. Bruno, and S. Marchand-Maillet. capturing the semantics of user interaction: a review and case study. In Emergent Web Intelligence. Springer, 2009. [ bib ]
Keywords: IM2.MCA, Report_VIII
[725] L. Gottlieb and G. Friedland. On the use of artificial conversation data for speaker recognition in cars. In IEEE International Conference for Semantic Computing, Berkeley, USA, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[726] D. Imseng. Novel initialization methods for speaker diarization. Idiap-RR Idiap-RR-07-2009, Idiap, 2009. Master's thesis. [ bib | .pdf ]
Speaker Diarization is the process of partitioning an audio input into homogeneous segments according to speaker identity where the number of speakers in a given audio input is not known a priori. This master thesis presents a novel initialization method for Speaker Diarization that requires less manual parameter tuning than most current GMM/HMM based agglomerative clustering techniques and is more accurate at the same time. The thesis reports on empirical research to estimate the importance of each of the parameters of an agglomerative-hierarchical-clustering-based Speaker Diarization system and evaluates methods to estimate these parameters completely unsupervised. The parameter estimation combined with a novel non-uniform initialization method result in a system that performs better than the current ICSI baseline engine on datasets of the National Institute of Standards and Technology (NIST) Rich Transcription evaluations of the years 2006 and 2007 (17% overall relative improvement).

Keywords: IM2.AP, Report_VIII
[727] S. Favre, A. Dielmann, and A. Vinciarelli. Automatic role recognition in multiparty recordings using social networks and probabilistic sequential models. In ACM International Conference on Multimedia, To Appear, 2009. [ bib | .pdf ]
The automatic analysis of social interactions is attracting significant interest in the multimedia community. This work addresses one of the most important aspects of the problem, namely the recognition of roles in social exchanges. The proposed approach is based on Social Network Analysis, for the representation of individuals in terms of their interactions with others, and probabilistic sequential models, for the recognition of role sequences underlying the sequence of speakers in conversations. The experiments are performed over different kinds of data (around 90 hours of broadcast data and meetings), and show that the performance depends on how formal the roles are, i.e. on how much they constrain people behavior.

Keywords: IM2.MCA, Report_VIII
[728] G. Bologna, B. Deville, and T. Pun. Blind navigation along a sinuous path by means of the see color interface. In IWINAC2009, 3rd International Work-conference on the Interplay between Natural and Artificial Computation, Santiago de Compostela, Spain, June 22–27, 2009. [ bib ]
Keywords: IM2.MCA, Report_VIII
[729] D. Hakkani-Tur. Towards automatic argument diagramming of multiparty meetings. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Taipei, Taiwan, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[730] F. Orabona, C. Castellini, B. Caputo, A. E. Fiorilla, and G. Sandini. Model adaptation with least-square svm for adaptive hand prosthetics. In IEEE International conference on Robotics and Automation, 2009. [ bib | .pdf ]
Keywords: IM2.MPR, Report_VIII
[731] P. N. Garner, J. Dines, T. Hain, A. El Hannani, M. Karafiat, D. Korchagin, M. Lincoln, V. Wan, and L. Zhang. Real-time asr from meetings. In Proceedings of Interspeech, 2009. [ bib | .pdf ]
The AMI(DA) system is a meeting room speech recognition system that has been developed and evaluated in the context of the NIST Rich Text (RT) evaluations. Recently, the “Distant Access” requirements of the AMIDA project have necessitated that the system operate in real-time. Another more difficult requirement is that the system fit into a live meeting transcription scenario. We describe an infrastructure that has allowed the AMI(DA) system to evolve into one that fulfils these extra requirements. We emphasise the components that address the live and real-time aspects.

Keywords: IM2.AP, Report_VIII
[732] X. Perrin, F. Colas, C. Pradalier, and R. Siegwart. Learning human habits and reactions to external events with a dynamic bayesian network. Technical report, Autonomous Systems Lab, ETHZ, 2009. [ bib ]
Keywords: IM2.BMI, Report_VIII
[733] G. Friedland, O. Vinyals, Y. Huang, and C. Muller. Prosodic and other long-term features for speaker diarization. IEEE Transactions on Audio, Speech and Language Processing, 17(5):985–993, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[734] D. Morrison, S. Marchand-Maillet, and E. Bruno. Modelling long-term relevance feedback. In Proceedings of the ECIR Workshop on Information Retrieval over Social Networks, Toulouse, FR, 2009. [ bib | .pdf ]
Keywords: IM2.MCA, Report_VIII
[735] B. Deville, G. Bologna, M. Vinckenbosch, and T. Pun. See color: seeing colours with an orchestra. In D. Lalanne and J. Kohlas, editors, Human Machine Interaction: Research Results of the MMI Program, volume 5440 of Lecture Notes in Computer Science, pages 251–279. Springer, 2009. Subseries: Programming and Software Engineering. [ bib ]
Keywords: IM2.MCA, Report_VIII
[736] M. M. Ullah, F. Orabona, and B. Caputo. You live, you learn, you forget: continuous learning of visual places with a forgetting mechanism. In International Conference on Robotic and Systems, 2009. [ bib ]
Keywords: IM2.VP, IM2.MPR, Report_VIII
[737] F. Monay, P. Quelhas, J. M. Odobez, and D. Gatica-Perez. Contextual classification of image patches with latent aspect models. EURASIP Journal on Image and Video Processing, Special Issue on Patches in Vision, 2009. to appear. [ bib | .pdf ]
We present a novel approach for contextual classification of image patches in complex visual scenes, based on the use of histograms of quantized features and probabilistic aspect models. Our approach uses context in two ways: (1) by using the fact that specific learned aspects correlate with the semantic classes, which resolves some cases of visual polysemy often present in patch-based representations, and (2) by formalizing the notion that scene context is image-specific -what an individual patch represents depends on what the rest of the patches in the same image are-. We demonstrate the validity of our approach on a man-made vs. natural patch classification problem. Experiments on an image collection of complex scenes show that the proposed approach improves region discrimination, producing satisfactory results, and outperforming two non-contextual methods. Furthermore, we also show that co-occurrence and traditional (Markov Random Field) spatial contextual information can be conveniently integrated for further improved patch classification.

Keywords: IM2.VP, Report_VIII
[738] M. Magimai-Doss, G. Aradilla, and H. Bourlard. On joint modelling of grapheme and phoneme information using kl-hmm for asr. Idiap-RR Idiap-RR-24-2009, Idiap, 2009. [ bib | .pdf ]
In this paper, we propose a simple approach to jointly model both grapheme and phoneme information using Kullback-Leibler divergence based HMM (KL-HMM) system. More specifically, graphemes are used as subword units and phoneme posterior probabilities estimated at output of multilayer perceptron are used as observation feature vector. Through preliminary studies on DARPA Resource Management corpus it is shown that although the proposed approach yield lower performance compared to KL-HMM system using phoneme as subword units, this gap in the performance can be bridged via temporal modelling at the observation feature vector level and contextual modelling of early tagged contextual graphemes.

Keywords: IM2.AP, Report_VIII
[739] F. Orabona, C. Castellini, B. Caputo, J. Luo, and G. Sandini. Towards life-long learning for cognitive systems: Online independent support vector machine. Pattern Recognition, Accepted for Pub, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[740] A. Vinciarelli. Capturing order in social interactions. IEEE Signal Processing Magazine, 2009. [ bib | .pdf ]
As humans appear to be literally wired for social interaction, it is not surprising to observe that social aspects of human behavior and psychology attract interest in the computing community as well. The gap between social animal and unsocial machine was tolerable when computers were nothing else than improved versions of old tools (e.g., word processors replacing typewriters), but nowadays computers go far beyond that simple role. Today, computers are the natural means for a wide spectrum of new, inherently social, activities like remote communication, distance learning, online gaming, social networking, information seeking and sharing, training in virtual worlds, etc. In this new context, computers must integrate human-human interaction as seamlessly as possible and deal effectively with spontaneous social behaviors of their users. In concise terms, computers need to become socially intelligent.

Keywords: IM2.MCA, Report_VIII
[741] G. Friedland, H. Hung, and C. Yeo. Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Taipei, Taiwan, pages 4069–4072, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[742] G. Friedland, O. Vinyals, Y. Huang, and C. Muller. Fusion of short-term and long-term features for improved speaker diarization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Taipei, Taiwan, pages 4077–4080, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[743] G. Friedland and D. van Leeuwen. Speaker diarization and identification. IEEE Press/Wiley, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[744] G. Friedland, C. Yeo, and H. Hung. Visual speaker localization aided by acoustic models (full paper). In Proceedings of ACM Multimedia, Beijing, China, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[745] F. Orabona, B. Caputo, A. Fillbrandt, and F. Ohl. A theoretical framework for transfer of knowledge across modalities in artificial and cognitive systems. In International Conference on Developmental Learning, 2009. [ bib | .pdf ]
Keywords: IM2.MPR, Report_VIII
[746] D. Gillick, K. Riedhammer, B. Favre, and D. Hakkani-Tur. A global optimization framework for meeting summarization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Taipei, Taiwan, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[747] A. Vinciarelli, N. Suditu, and M. Pantic. Implicit human centered tagging. In Proceedings of IEEE Conference on Multimedia and Expo, pages 1428–1431, 2009. [ bib | .pdf ]
This paper provides a general introduction to the concept of Implicit Human-Centered Tagging (IHCT) - the automatic extraction of tags from nonverbal behavioral feedback of media users. The main idea behind IHCT is that nonverbal behaviors displayed when interacting with multimedia data (e.g., facial expressions, head nods, etc.) provide information useful for improving the tag sets associated with the data. As such behaviors are displayed naturally and spontaneously, no effort is required from the users, and this is why the resulting tagging process is said to be ”implicit”. Tags obtained through IHCT are expected to be more robust than tags associated with the data explicitly, at least in terms of: generality (they make sense to everybody) and statistical reliability (all tags will be sufficiently represented). The paper discusses these issues in detail and provides an overview of pioneering efforts in the field.

Keywords: IM2.MCA, Report_VIII
[748] D. Lalanne and J. Kholas. Human machine interaction. 2009. [ bib ]
Keywords: IM2.HMI, Report_VIII
[749] S. Voloshynovskiy, O. Koval, F. Beekhof, and T. Holotyak. Binary robust hashing based on probabilistic bit reliability. In IEEE Workshop on Statistical Signal Processing 2009, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[750] V. Frinken and H. Bunke. Evaluating retraining rules for semi-supervised learning in neural network based cursive word recognition. In Proc. 10th Int. Conf. on Document Analysis and Recognition, volume 1, pages 31–35, 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[751] A. Pronobis and B. Caputo. Cold: The cosy localization database. International Journal of Robotics Research, 28(5):588–594, 2009. [ bib | .pdf ]
Keywords: IM2.DMA, Report_VIII
[752] G. Chanel, J. Kierkels, M. Soleymani, and T. Pun. short-term emotion assessment in a recall paradigm. International Journal of Human-Computer Studies, 67(8):607–627, 2009. DOI: http://dx.doi.org/10.1016/j.ijhcs.2009.03.005. [ bib | http ]
Keywords: IM2.MCA, Report_VIII
[753] O. Koval, S. Voloshynovskiy, F. Caire, and P. Bas. On security threats for robust perceptual hashin. In Electronic Imaging 2009, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[754] S. H. K. Parthasarathi, M. Magimai-Doss, D. Gatica-Perez, and H. Bourlard. Speaker change detection with privacy-preserving audio cues. In Proceedings of ICMI-MLMI 2009, 2009. [ bib | .pdf ]
In this paper we investigate a set of privacy-sensitive audio features for speaker change detection (SCD) in multiparty conversations. These features are based on three different principles: characterizing the excitation source information using linear prediction residual, characterizing subband spectral information shown to contain speaker information, and characterizing the general shape of the spectrum. Experiments show that the performance of the privacy-sensitive features is comparable or better than that of the state-of-the-art full-band spectral-based features, namely, mel frequency cepstral coefficients, which suggests that socially acceptable ways of recording conversations in real-life is feasible.

Keywords: IM2.AP, Report_VIII
[755] J. Ortega-Garcia, J. Fierrez, F. Alonso-Fernandez, J. Galbally, M. R. Freire, J. Gonzalez-Rodriguez, C. Garcia-Mateo, J. L. Alba-Castro, E. Gonzalez-Agulla, E. Otero-Muras, S. Garcia-Salicetti, L. Allano, B. Ly-Van, B. Dorizzi, J. Kittler, T. Bourlai, N. Poh, F. Deravi, M. W. R. Ng, M. Fairhurst, J. Hennebert, A. Humm, M. Tistarelli, L. Brodo, J. Richiardi, A. Drygajlo, H. Ganster, F. M. Sukno, S. K. Pavani, A. Frangi, L. Akarun, and A. Savran. The multi-scenario multi-environment biosecure multimodal database (bmdb). IEEE Trans. on Pattern Analysis and Machine Intelligence, 2009. to appear. [ bib ]
Keywords: IM2.MPR, Report_VII
[756] K. Kryszczuk and A. Drygajlo. Improving biometric verification with class-independent quality information. IET Signal Processing, Special Issue on Biometric Recognition, 3(4):310–321, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[757] P. N. Garner. Snr features for automatic speech recognition. In Proceedings of the IEEE workshop on Automatic Speech Recognition and Understanding [848]. [ bib | .pdf ]
When combined with cepstral normalisation techniques, the features normally used in Automatic Speech Recognition are based on Signal to Noise Ratio (SNR). We show that calculating SNR from the outset, rather than relying on cepstral normalisation to produce it, gives features with a number of practical and mathematical advantages over power-spectral based ones. In a detailed analysis, we derive Maximum Likelihood and Maximum a-Posteriori estimates for SNR based features, and show that they can outperform more conventional ones, especially when subsequently combined with cepstral variance normalisation. We further show anecdotal evidence that SNR based features lend themselves well to noise estimates based on low-energy envelope tracking.

Keywords: IM2.AP, Report_VIII
[758] S. Ganapathy, P. Motlicek, and H. Hermansky. Error resilient speech coding using sub-band hilbert envelopes. In 12th International Conference on Text, Speech and Dialogue, TSD 2009 [632], pages 355–362. [ bib | .pdf ]
Frequency Domain Linear Prediction (FDLP) represents a technique for auto-regressive modelling of Hilbert envelopes of a signal. In this paper, we propose a speech coding technique that uses FDLP in Quadrature Mirror Filter (QMF) sub-bands of short segments of the speech signal (25 ms). Line Spectral Frequency parameters related to au-toregressive models and the spectral components of the residual signals are transmitted. For simulating the effects of lossy transmission channels, bit-packets are dropped randomly. In the objective and subjective quality evaluations, the proposed FDLP speech codec is judged to be more resilient to bit-packet losses compared to the state-of-the-art Adaptive Multi-Rate Wide-Band (AMR-WB) codec at 12 kbps.

Keywords: IM2.VP, Report_VIII
[759] J. Baker, L. Deng, J. Glass, S. Khudanpur, C. H. Lee, N. Morgan, and D. O'Shgughnessy. Research developments and directions in speech recognition and understanding. IEEE Signal Processing Magazine, 26(3):75–80, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[760] G. Garau, S. Ba, H. Bourlard, and J. M. Odobez. Investigating the use of visual focus of attention for audio-visual speaker diarisation. In Proceedings of the ACM International Conference on Multimedia, 2009. [ bib | .pdf ]
Audio-visual speaker diarisation is the task of estimating “who spoke when” using audio and visual cues. In this paper we propose the combination of an audio diarisation system with psychology inspired visual features, reporting experiments on multiparty meetings, a challenging domain characterised by unconstrained interaction and participant movements. More precisely the role of gaze in coordinating speaker turns was exploited by the use of Visual Focus of Attention features. Experiments were performed both with the reference and 3 automatic VFoA estimation systems, based on head pose and visual activity cues, of increasing complexity. VFoA features yielded consistent speaker diarisation improvements in combination with audio features using a multi-stream approach.

Keywords: IM2.MPR, Report_VIII
[761] J. Richiardi, K. Kryszczuk, and A. Drygajlo. Static models of derivative-coordinates phase spaces for multivariate time series classification: an application to signature verification. In Advances in Biometrics, Lecture Notes in Computer Science 5558, pages 1200–1208, Heidelberg, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[762] A. Humm, R. Ingold, and J. Hennebert. Spoken handwriting for user authentication using joint modelling systems. In Proceedings of 6th International Symposium on Image and Signal Processing and Analysis (ISPA'09), Salzburg (Austria), 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[763] A. Humm, J. Hennebert, and R. Ingold. Combined handwriting and speech modalities for user authentication. IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 39, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[764] V. Frinken, K. Riesen, and H. Bunke. Improving graph classification by isomap. In A. Torsello, F. Escolano, and L. Brun, editors, Graph-Based Representations in Pattern Recognition, LNCS 5534, pages 205–214. Springer, 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[765] D. Gelbart, N. Morgan, and A. Tsymbal. Hill-climbing feature selection for multi-stream asr. In 10th International Conference of the International Speech Communication Association, Brighton, UK, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[766] H. Hung and S. Ba. Speech/non-speech detection in meetings from automatically extracted low resolution visual features. Idiap-RR Idiap-RR-20-2009, Idiap, 2009. submitted to icmi-mlmi. [ bib | .pdf ]
In this paper we address the problem of estimating who is speaking from automatically extracted low resolution visual cues from group meetings. Traditionally, the task of speech/non-speech detection or speaker diarization tries to find who speaks and when from audio features only. Recent work has addressed the problem audio-visually but often with less emphasis on the visual component. Due to the high probability of losing the audio stream during video conferences, this work proposes methods for estimating speech using just low resolution visual cues. We carry out experiments to compare how context through the observation of group behaviour and task-oriented activities can help improve estimates of speaking status. We test on 105 minutes of natural meeting data with unconstrained conversations.

Keywords: IM2.MCA, IM2.MPR, Report_VIII
[767] S. Y. Zhao, R. Ravuri, and N. Morgan. Multi-stream to many-stream: using spectro-temporal features for asr. In 10th International Conference of the International Speech Communication Association, Brighton, UK, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[768] A. Drygajlo, W. Li, and K. Zhu. Q-stack aging model for face verification. In 17th European Signal Processing Conference, Glasgow, UK, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[769] M. Soleymani, G. Chanel, J. Kierkels, and T. Pun. affective characterization of movie scenes based on content analysis and physiological changes. To appear in International Journal of Semantic Computing, 2009. (to appear). [ bib ]
Keywords: IM2.MCA, Report_VIII
[770] J. P. Pinto, G. S. V. S. Sivaram, H. Hermansky, and M. Magimai-Doss. Volterra series for analyzing mlp based phoneme posterior probability estimator. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2009. [ bib | .pdf ]
We present a framework to apply Volterra series to analyze multilayered perceptrons trained to estimate the posterior probabilities of phonemes in automatic speech recognition. The identified Volterra kernels reveal the spectro-temporal patterns that are learned by the trained system for each phoneme. To demonstrate the applicability of Volterra series, we analyze a multilayered perceptron trained using Mel filter bank energy features and analyze its first order Volterra kernels.

Keywords: IM2.AP, Report_VIII
[771] A. Popescu-Belis and A. Vinciarelli. Multimedia meeting processing and retrieval at the idiap research institute. Informer (Newsletter of the BCS Information Retrieval Specialist Group), 29:14–16, 2009. [ bib ]
Keywords: IM2.DMA, Report_VIII
[772] D. Gatica-Perez. Automatic nonverbal analysis of social interaction in small groups: a review. In Image and Vision Computing, Special Issue on Human Naturalistic Behavior, in press, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[773] J. Dines, L. Saheer, and H. Liang. Speech recognition with speech synthesis models by marginalising over decision tree leaves. In Proceedings of Interspeech, 2009. [ bib | .pdf ]
There has been increasing interest in the use of unsupervised adaptation for the personalisation of text-to-speech (TTS) voices, particularly in the context of speech-to-speech translation. This requires that we are able to generate adaptation transforms from the output of an automatic speech recognition (ASR) system. An approach that utilises unified ASR and TTS models would seem to offer an ideal mechanism for the application of unsupervised adaptation to TTS since transforms could be shared between ASR and TTS. Such unified models should use a common set of parameters. A major barrier to such parameter sharing is the use of differing contexts in ASR and TTS. In this paper we propose a simple approach that generates ASR models from a trained set of TTS models by marginalising over the TTS contexts that are not used by ASR. We present preliminary results of our proposed method on a large vocabulary speech recognition task and provide insights into future directions of this work.

Keywords: decision trees, speech recognition, speech synthesis, unified models, IM2.AP, Report_VIII
[774] C. McCool and S. Marcel. Parts-based face verification using local frequency bands. In in Proceedings of IEEE/IAPR International Conference on Biometrics, 2009. [ bib | .pdf ]
Keywords: IM2.VP, Report_VIII
[775] S. H. K. Parthasarathi, M. Magimai-Doss, H. Bourlard, and D. Gatica-Perez. Investigating privacy-sensitive features for speech detection in multiparty conversations. In Proceedings of Interspeech 2009, 2009. [ bib | .pdf ]
We investigate four different privacy-sensitive features, namely energy, zero crossing rate, spectral flatness, and kurtosis, for speech detection in multiparty conversations. We liken this scenario to a meeting room and define our datasets and annotations accordingly. The temporal context of these features is modeled. With no temporal context, energy is the best performing single feature. But by modeling temporal context, kurtosis emerges as the most effective feature. Also, we combine the features. Besides yielding a gain in performance, certain combinations of features also reveal that a shorter temporal context is sufficient. We then benchmark other privacy-sensitive features utilized in previous studies. Our experiments show that the performance of all the privacy-sensitive features modeled with context is close to that of state-of-the-art spectral-based features, without extracting and using any features that can be used to reconstruct the speech signal.

Keywords: IM2.AP, IM2.MCA, Report_VIII
[776] J. Yao and J. M. Odobez. Fast human detection in videos using joint appearance and foreground learning from covariances of image feature subsets. Idiap-RR Idiap-RR-19-2009, Idiap, 2009. [ bib | .pdf ]
We present a fast method to detect humans from stationary surveillance videos. Traditional approaches exploit background subtraction as an attentive filter, by applying the still image detectors only on foreground regions. This doesn't take into account that foreground observations contain human shape information which can be used for detection. To address this issue, we propose a method that learn the correlation between appearance and foreground information. It is based on a cascade of LogitBoost classifiers which uses covariance matrices computed from appearance and foreground features as object descriptors. We account for the fact that covariance matrices lie in a Riemanian space, introduce different novelties -like exploiting only covariance sub-matrices- to reduce the induced computation load, as well as an image rectification scheme to remove the slant of people in images when dealing with wide angle cameras. Evaluation on a large set of videos shows that our approach performs better than the attentive filter paradigm while processing from 5 to 20 frames/sec. In addition, on the INRIA human (static image) benchmark database, our sub-matrix approach performs better than the full covariance case while reducing the computation cost by more than one order of magnitude.

Keywords: IM2.VP, Report_VIII
[777] V. Frinken and H. Bunke. Self-training strategies for handwriting word recognition. In Proc. Industrial Conf. Advances in Data Mining. Applications and Theoretical Aspects, LNCS 5633, pages 291–300. Springer, 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[778] F. De Simone, F. Dufaux, T. Ebrahimi, C. Delogu, and V. Baroncini. A subjective study of the influence of color information on visual quality assessment of high resolution pictures. In Fourth International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM-09), 2009. [ bib | http ]
This paper presents the design and the results of a psychovisual experiment which aims at understanding how the color information affects the perceived quality of a high resolution still picture. The results of this experiment help to shed light into the importance of color for human observers and could be used to improve the performance of objective quality metrics.

Keywords: IM2.MCA, Report_VIII
[779] J. Keshet, S. Shalev-Shwartz, Y. Singer, and D. Chazan. A large margin algorithm for forced alignment. In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods. John Wiley and Sons, 2009. [ bib ]
We describe and analyze a discriminative algorithm for learning to align a phoneme sequence of a speech utterance with its acoustical signal counterpart by predicting a timing sequence representing the phoneme start times. In contrast to common HMM-based approaches, our method employs a discriminative learning procedure in which the learning phase is tightly coupled with the forced alignment task. The alignment function we devise is based on mapping the input acoustic-symbolic representations of the speech utterance along with the target timing sequence into an abstract vector space. We suggest a specific mapping into the abstract vector-space which utilizes standard speech features (e.g. spectral distances) as well as confidence outputs of a frame-based phoneme classifier. Generalizing the notion of separation with a margin used in support vector machines (SVM) for binary classification, we cast the learning task as the problem of finding a vector in an abstract inner-product space. We set the prediction vector to be the solution of a minimization problem with a large set of constraints. Each constraint enforces a gap between the projection of the correct target timing sequence and the projection of an alternative, incorrect, timing sequence onto the vector. Though the number of constraints is very large, we describe a simple iterative algorithm for efficiently learning the vector and analyze the formal properties of the resulting learning algorithm. We report experimental results comparing the proposed algorithm to previous studies on forced alignment, which use hidden Markov models (HMM). The results obtained in our experiments using the discriminative alignment algorithm outperform the state-of-the-art systems on the TIMIT corpus.

Keywords: IM2.AP, Report_VIII
[780] G. Aradilla, H. Bourlard, and M. Magimai-Doss. Posterior features applied to speech recognition tasks with user-defined vocabulary. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2009. [ bib | .pdf ]
Keywords: IM2.AP, Report_VIII
[781] D. Grangier, J. Keshet, and S. Bengio. Discriminative keyword spotting. In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods. John Wiley and Sons, 2009. [ bib ]
This chapter introduces a discriminative method for detecting and spotting keywords in spoken utterances. Given a word represented as a sequence of phonemes and a spoken utterance, the keyword spotter predicts the best time span of the phoneme sequence in the spoken utterance along with a confidence. If the prediction confidence is above certain level the keyword is declared to be spoken in the utterance within the predicted time span, otherwise the keyword is declared as not spoken. The problem of keyword spotting training is formulated as a discriminative task where the model parameters are chosen so the utterance in which the keyword is spoken would have higher confidence than any other spoken utterance in which the keyword is not spoken. It is shown theoretically and empirically that the proposed training method resulted with a high area under the receiver operating characteristic (ROC) curve, the most common measure to evaluate keyword spotters. We present an iterative algorithm to train the keyword spotter efficiently. The proposed approach contrasts with standard spotting strategies based on HMMs, for which the training procedure does not maximize a loss directly related to the spotting performance. Several experiments performed on TIMIT and WSJ corpora show the advantage of our approach over HMM-based alternatives.

Keywords: IM2.AP, Report_VIII
[782] B. Dumas, D. Lalanne, and R. Ingold. Benchmarking fusion engines of multimodal interactive systems. In Proceedings of International Conference on Multimodal Interfaces and Workshop on Machine Learning for Multi-modal Interaction (ICMI-MLMI 2009), Cambridge (MA) (USA), 2009. [ bib ]
Keywords: IM2.HMI, Report_VIII
[783] S. Thomas, S. Ganapathy, and H. Hermansky. Phoneme recognition using spectral envelope and modulation frequency features. Idiap-RR Idiap-RR-04-2009, Idiap, 2009. [ bib | .pdf ]
We present a new feature extraction technique for phoneme recognition that uses short-term spectral envelope and modulation frequency features. These features are derived from sub-band temporal envelopes of speech estimated using Frequency Domain Linear Prediction (FDLP). While spectral envelope features are obtained by the short-term integration of the sub-band envelopes, the modulation frequency components are derived from the long-term evolution of the sub-band envelopes. These features are combined at the phoneme posterior level and used as features for a hybrid HMM-ANN phoneme recognizer. For the phoneme recognition task on the TIMIT database, the proposed features show an improvement of 4.7% over the other feature extraction techniques.

Keywords: IM2.AP, Report_VIII
[784] J. Keshet, D. Grangier, and S. Bengio. Discriminative keyword spotting. Speech Communication, 51(4):317–329, 2009. [ bib | .pdf ]
Keywords: IM2.AP, Report_VIII
[785] X. Perrin, F. Colas, C. Pradalier, and R. Siegwart. Learning to identify users and predict their destination in a robotic guidance application. In Field and Service Robotics (FSR), Cambridge, MA, 2009. [ bib ]
Keywords: IM2.BMI, Report_VIII
[786] K. Kryszczuk and A. Drygajlo. Improving biometric verification with class-independent quality information. volume 3, pages 310–321, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[787] I. Ivanov, F. Dufaux, T. M. Ha, and T. Ebrahimi. Towards generic detection of unusual events in video surveillance. In 6th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSSâ09), 2009. [ bib | http ]
In this paper, we consider the challenging problem of unusual event detection in video surveillance systems. The proposed approach makes a step toward generic and automatic detection of unusual events in terms of velocity and acceleration. At first, the moving objects in the scene are detected and tracked. A better representation of moving objects trajectories is then achieved by means of appropriate pre-processing techniques. A supervised Support Vector Machine method is then used to train the system with one or more typical sequences, and the resulting model is then used for testing the proposed method with other typical sequences (different scenes and scenarios). Experimental results are shown to be promising. The presented approach is capable of determining similar unusual events as in the training sequences.

Keywords: Unusual event; Trajectory representation; Feature extraction; Support Vector Machine classifier, IM2.MCA, Report_VIII
[788] S. Voloshynovskiy, O. Koval, F. Beekhof, and T. Pun. Random projections based item authentication. In Electronic Imaging 2009, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[789] J. Galbally, C. McCool, J. Fierrez, S. Marcel, and J. Ortega-Garcia. Hill-climbing attack to an eigenface-based face verification system. In Proceedings of the First IEEE International Conference on Biometrics, Identity and Security (BIdS), 2009. [ bib | .pdf ]
We use a general hill-climbing attack algorithm based on Bayesian adaption to test the vulnerability of an Eigenface-based approach for face recognition against indirect attacks. The attacking technique uses the scores provided by the matcher to adapt a global distribution, computed from a development set of users, to the local specificities of the client being attacked. The proposed attack is evaluated on an Eigenfacebased verification system using the XM2VTS database. The results show a very high efficiency of the hill-climbing algorithm, which successfully bypassed the system for over 85% of the attacked accounts.

Keywords: IM2.VP, Report_VIII
[790] S. Lefèvre and J. M. Odobez. Structure and appearance features for robust 3d facial actions tracking. In International Conference on Multimedia and Expo (ICME), 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[791] K. Ali, F. Fleuret, D. Hasler, and P. Fua. Joint learning of pose estimators and features for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2009. (to appear). [ bib ]
Keywords: IM2.VP, Report_VIII
[792] F. Fleuret. Multi-layer boosting for pattern recognition. Pattern Recognition Letters (PRL), 30:237–241, 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[793] D. Vijayasenan, F. Valente, and H. Bourlard. Mutual information based channel selection for speaker diarization of meetings data. In Proceedings of International Conference on Acoustics, Speech and Signal Processing, 2009. [ bib | .pdf ]
In the meeting case scenario, audio is often recorded using Multiple Distance Microphones (MDM) in a non-intrusive manner. Typically a beamforming is performed in order to obtain a single enhanced signal out of the multiple channels. This paper investigates the use of mutual information for selecting the channel subset that produces the lowest error in a diarization system. Conventional systems perform channel selection on the basis of signal properties such as SNR, cross correlation. In this paper, we propose the use of a mutual information measure that is directly related to the objective function of the diarization system. The proposed algorithms are evaluated on the NIST RT 06 eval dataset. Channel selection improves the speaker error by 1.1% absolute (6.5% relative) w.r.t. the use of all channels.

Keywords: IM2.AP, Report_VIII
[794] E. Bruno and S. Marchand-Maillet. Multimodal preference aggregation for multimedia information retrieval. To appear in Journal of Multimedia, 2009. [ bib | .pdf ]
Keywords: IM2.MCA, Report_VIII
[795] B. Raducanu and D. Gatica-Perez. You are fired! nonverbal role analysis in competitive meetings. In Proc. ICASSP, Taiwan, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[796] J. L. Bloechle, D. Lalanne, and R. Ingold. Ocd: an optimized and canonical document format. In Proceedings of 10th IEEE International Conference on Document Analysis and Recognition (ICDAR 2009), pages 236–240, Barcelona (Spain), 2009. [ bib ]
Keywords: IM2.DMA, Report_VIII
[797] P. Motlicek. Automatic out-of-language detection based on confidence measures derived fromlvcsr word and phone lattices. In 10thAnnual Conference of the International Speech Communication Association [630], pages 1215–1218. [ bib | .pdf ]
Confidence Measures (CMs) estimated from Large Vocabulary Continuous Speech Recognition (LVCSR) outputs are commonly used metrics to detect incorrectly recognized words. In this paper, we propose to exploit CMs derived from frame-based word and phone posteriors to detect speech segments containing pronunciations from non-target (alien) languages. The LVCSR system used is built for English, which is the target language, with medium-size recognition vocabulary (5k words). The efficiency of detection is tested on a set comprising speech from three different languages (English, German, Czech). Results achieved indicate that employment of specific temporal context (integrated in the word or phone level) significantly increases the detection accuracies. Furthermore, we show that combination of several CMs can also improve the efficiency of detection.

Keywords: IM2.AP, Report_VIII
[798] D. Jayagopi and D. Gatica-Perez. Discovering group nonverbal conversational patterns with topics. In accepted for publication in Proc. ICMI-MLMI, Boston, USA, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[799] J. Berclaz, F. Fleuret, and P. Fua. Multiple object tracking using flow linear programming. Technical Report 10-2009, IDIAP Research Institute, 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[800] N. Garg, B. Favre, K. Riedhammer, and D. Hakkani-Tur. Clusterrank: a graph based method for meeting summarization. In 10th International Conference of the International Speech Communication Association, Brighton, UK, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[801] F. Beekhof, S. Voloshynovskiy, O. Koval, and T. Holotyak. Multi-class classifiers based on binary classifiers: performance, efficiency, and minimum coding matrix distances. In MLSP 2009, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[802] P. N. Garner. A map approach to noise compensation of speech. Idiap-RR Idiap-RR-08-2009, Idiap, 2009. [ bib | .pdf ]
We show that estimation of parameters for the popular Gaussian model of speech in noise can be regularised in a Bayesian sense by use of simple prior distributions. For two example prior distributions, we show that the marginal distribution of the uncorrupted speech is non-Gaussian, but the parameter estimates themselves have tractable solutions. Speech recognition experiments serve to suggest values for hyper-parameters, and demonstrate that the theory is practically applicable.

Keywords: IM2.AP, Report_VIII
[803] A. Popescu-Belis, J. Carletta, J. Kilgour, and P. Poller. Accessing a large multimodal corpus using an automatic content linking device. In M. Kipp, J. C. Martin, P. Paggio, and D. Heylen, editors, Multimodal Corpora, LNAI. Springer-Verlag, Berlin/Heidelberg, 2009. [ bib ]
Keywords: IM2.MPR, IM2.DMA, IM2.HMI, Report_VIII
[804] E. Ricci and J. M. Odobez. Real-time simultaneous head tracking and pose estimation. In IEEE International Conference on Image Processing (ICIP), 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[805] S. Duffner, J. M. Odobez, and E. Ricci. Dynamic partitioned sampling for tracking with discriminative features. In Proceedings of the British Maschine Vision Conference, 2009. [ bib | .pdf ]
We present a multi-cue fusion method for tracking with particle filters which relies on a novel hierarchical sampling strategy. Similarly to previous works, it tackles the problem of tracking in a relatively high-dimensional state space by dividing such a space into partitions, each one corresponding to a single cue, and sampling from them in a hierarchical manner. However, unlike other approaches, the order of partitions is not fixed a priori but changes dynamically depending on the reliability of each cue, i.e. more reliable cues are sampled first. We call this approach Dynamic Partitioned Sampling (DPS). The reliability of each cue is measured in terms of its ability to discriminate the object with respect to the background, where the background is not described by a fixed model or by random patches but is represented by a set of informative "background particles" which are tracked in order to be as similar as possible to the object. The effectiveness of this general framework is demonstrated on the specific problem of head tracking with three different cues: colour, edge and contours. Experimental results prove the robustness of our algorithm in several challenging video sequences.

Keywords: IM2.VP, Report_VIII
[806] M. Wöllmer, F. Eyben, J. Keshet, A. Graves, B. Schuller, and G. Rigoll. Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional lstm networks. In IEEE International Conference on Acoustic, Speech, and Signal Processing, Taipei, Taiwan, 2009. [ bib | .pdf ]
In this paper we propose a new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding. Our approach overcomes the drawbacks of generative HMM modeling by applying a discriminative learning procedure that non-linearly maps speech features into an abstract vector space. By incorporating the outputs of a BLSTM network into the speech features, it is able to make use of past and future context for phoneme predictions. The robustness of the approach is evaluated on a keyword spotting task using the HUMAINE Sensitive Artificial Listener (SAL) database, which contains accented, spontaneous, and emotionally colored speech. The test is particularly stringent because the system is not trained on the SAL database, but only on the TIMIT corpus of read speech. We show that our method prevails over a discriminative keyword spotter without BLSTM-enhanced feature functions, which in turn has been proven to outperform HMM-based techniques.

Keywords: IM2.AP, Report_VIII
[807] G. Friedland, C. Yeo, and H. Hung. Visual speaker localization aided by acoustic models. In ACM Multimedia, 2009. [ bib ]
The following paper presents a novel audio-visual approach for unsupervised speaker locationing. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the art audio-only speaker localization system (traditionally called speaker diarization) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the number of speakers and estimates ”who spoke when”, then, in a second step, the visual models are used to infer the location of the speakers in the video. The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audioonly) speaker diarization, but also adds visual speaker locationing at little incremental engineering and computation costs.

Keywords: IM2.MPR, Report_VIII
[808] J. Keshet. A proposal for a kernel-based algorithm for large vocabulary continuous speech recognition. In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods. John Wiley and Sons, 2009. [ bib ]
We present a proposal of a kernel-based model for large vocabulary continuous speech recognizer. The continuous speech recognition is described as a problem of finding the best phoneme sequence and its best time span, where the phonemes are generated from all permissible word sequences. A non-probabilistic score is assigned to every phoneme sequence and time span sequence, according to a kernel-based acoustic model and a kernel-based language model. The acoustic model is described in terms of segments, where each segment corresponds to a whole phoneme, and it generalizes Segmental Models for the non-probabilistic setup. The language model is based on discriminative language model recently proposed by Roark et al. (2007). We devise a loss function based on the word error rate and present a large margin training procedure for the kernel models, which aims at minimizing this loss function. Finally, we discuss the practical issues of the implementation of kernel-based continuous speech recognition model by presenting an efficient iterative algorithm and considering the decoding process. We conclude the chapter by a brief discussion on the model limitations and future work. This chapter does not introduce any experimental results.

Keywords: IM2.AP, Report_VIII
[809] F. Valente, M. Magimai-Doss, C. Plahl, and R. Suman. Hierarchical processing of the modulation spectrum for gale mandarin lvcsr system. In Proceedings of the 10thAnnual Conference of the International Speech Communication Association (Interspeech), 2009. [ bib | .pdf ]
This paper aims at investigating the use of TANDEM features based on hierarchical processing of the modulation spectrum. The study is done in the framework of the GALE project for recognition of Mandarin Broadcast data. We describe the improvements obtained using the hierarchical processing and the addition of features like pitch and short-term critical band energy. Results are consistent with previous findings on a different LVCSR task suggesting that the proposed technique is effective and robust across several conditions. Furthermore we describe integration into RWTH GALE LVCSR system trained on 1600 hours of Mandarin data and present progress across the GALE 2007 and GALE 2008 RWTH systems resulting in approximatively 20% CER reduction on several data set.

Keywords: speech recognition, TANDEM features, IM2.AP, Report_VIII
[810] G. Gonzalez, F. Fleuret, and P. Fua. Learning rotational features for filament detection. In Proceedings of the IEEE international conference on Computer Vision and Pattern Recognition (CVPR), 2009. (to appear). [ bib ]
Keywords: IM2.VP, Report_VIII
[811] K. Kumatani, J. McDonough, B. Rauch, P. N. Garner, W. Li, and J. Dines. Maximum kurtosis beamforming with the generalized sidelobe canceller. In Proceedings of INTERSPEECH, September 2008, 2009. [ bib | .pdf ]
This paper presents an adaptive beamforming application based on the capture of far-field speech data from a real single speaker in a real meeting room. After the position of a speaker is estimated by a speaker tracking system, we construct a subband-domain beamformer in generalized sidelobe canceller (GSC) configuration. In contrast to conventional practice, we then optimize the active weight vectors of the GSC so that kurtosis of output signals is maximized. Our beamforming algorithms can suppress noise and reverberation without the signal cancellation problems encountered in conventional beamforming algorithms. We demonstrate the effectiveness of our proposed techniques through a series of automatic speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSJ-AV). The beamforming algorithm proposed here achieved a 13.6% WER, whereas the simple delay-and-sum beamformer provided a WER of 17.8%.

Keywords: IM2.AP, Report_VIII
[812] S. Marchand-Maillet, E. Szekely, and E. Bruno. Optimizing strategies for the exploration of social-networks and associated data collections. In Proceedings of the International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS'09) - Special session on "People, Pixels, Peers: Interactive Content in Social Networks", London, UK, 2009. (invited). [ bib | .pdf ]
Keywords: IM2.MCA, Report_VIII
[813] E. Bertini, D. Lalanne, and M. Rigamonti. Extended excentric labeling. International Journal of the Eurographics Association, 28, 2009. [ bib ]
Keywords: IM2.HMI, Report_VIII
[814] J. Baker, L. Deng, J. Glass, S. Khudanpur, C. H. Lee, N. Morgan, and D. O'Shgughnessy. Research developments and directions in speech recognition and understanding. IEEE Signal Processing Magazine, 26(4):78–85, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[815] N. Noceti, B. Caputo, C. Castellini, L. Baldassarre, A. Barla, L. Rosasco, F. Odone, and G. Sandini. Towards a theoretical framework for learning multi-modal patterns for embodied agents. In International Conference on Image Analysis and Processing, 2009. [ bib | .pdf ]
Keywords: IM2.MPR, Report_VIII
[816] A. Popescu-Belis, P. Poller, J. Kilgour, E. Boertjes, J. Carletta, S. Castronovo, M. Fapso, M. Flynn, A. Nanchen, T. Wilson, J. de Wit, and M. Yazdani. A multimedia retrieval system using speech input. In ICMI-MLMI 2009 (11th International Conference on Multimodal Interfaces and 6th Workshop on Machine Learning for Multimodal Interaction), Cambridge, MA, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[817] D. Jayagopi. Modeling dominance in group conversations using nonverbal activity cues. IEEE Trans. on Audio, Speech, and Language Processing, Special Issue on Multimodal Processing for Speech-based Interactions, 17:501–513, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[818] J. S. Lee, F. De Simone, and T. Ebrahimi. Video coding based on audio-visual attention. In IEEE International Conference on Multimedia and Expo (ICME'09), 2009. [ bib | http ]
This paper proposes an efficient video coding method based on audio-visual attention, which is motivated by the fact that cross-modal interaction significantly affects humans perception of multimedia experiences. First, we propose an audio-visual source localization method to locate the sound source in a video sequence. Then, its result is used for applying spatial blurring to the images in order to reduce redundant high-frequency information and achieve coding efficiency. We demonstrate the effectiveness of the proposed method for H.264/AVC coding along with the results of a subjective test.

Keywords: video coding, audio-visual attention, cross-modal interaction, source localization, H.264, perceived audio-visual quality, IM2.MCA, Report_VIII
[819] P. Rajan, S. H. K. Parthasarathi, and H. Murthy. Robustness of phase based features for speaker recognition. In Proceedings of Interspeech [845]. [ bib | .pdf ]
This paper demonstrates the robustness of group-delay based features for speech processing. An analysis of group delay functions is presented which show that these features retain formant structure even in noise. Furthermore, a speaker verification task performed on the NIST 2003 database show lesser error rates, when compared with the traditional MFCC features. We also mention about using feature diversity to dynamically choose the feature for every claimed speaker.

Keywords: IM2.AP, Report_VIII
[820] M. Baechler, J. L. Bloechle, A. Humm, R. Ingold, and J. Hennebert. Labeled images verification using gaussian mixture models. In Proceedings of 24th Annual ACM Symposium on Applied Computing (ACM SAC'09), pages 1331–1336, Honolulu, Hawaii (USA), 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[821] S. Ba, H. Hung, and J. M. Odobez. Visual activity context for focus of attention estimation in dynamic meetings. In IEEE Proc. Int. Conf. on Multimedia and Expo (ICME), New-York, 2009. [ bib ]
Keywords: IM2.VP, Report_VIII
[822] D. Lalanne, L. Nigay, P. Palanque, P. Robinson, J. Vanderdonckt, and J. F. Ladry. Fusion engines for multimodal interfaces: a survey. In Proceedings of International Conference on Multimodal Interfaces and Workshop on Machine Learning for Multi-modal Interaction (ICMI-MLMI 2009), Cambridge (MA) (USA), 2009. [ bib ]
Keywords: IM2.HMI, Report_VIII
[823] E. Bruno and S. Marchand-Maillet. multiview clustering: a late fusion approach using latent models. In Proceedings of the 32nd ACM Special Interest Group on Information Retrieval Conference, SIGIR 09, Boston, USA, 2009. [ bib ]
Keywords: IM2.MCA, Report_VIII
[824] B. Picart. Improved phone posterior estimation through k-nn and mlp-based similarity. Idiap-RR Idiap-RR-18-2009, Idiap, Rue Marconi 19, 1920 Martigny - switzerland, 2009. [ bib | .pdf ]
In this work, we investigate the possible use of k-nearest neighbour (kNN) classifiers to perform frame-based acoustic phonetic classification, hence replacing Gaussian Mixture Models (GMM) or MultiLayer Perceptrons (MLP) used in standard Hidden Markov Models (HMMs). The driving motivation behind this idea is the fact that kNN is known to be an "optimal" classifier if a very large amount of training data is available (replacing the training of functional parameters by plain memorization of the training examples) and the correct distance metric is found. Nowadays, amount of training data is no longer an issue. In the current work, we thus specifically focused on the "correct" distance metric, mainly using an MLP to compute the probability that two input feature vectors are part of the same phonetic class or not. This MLP output can thus be used as a distance metric for kNN. While providing a "universal" distance metric, this work also enabled us to consider the speech recognition problem under a different angle, simply formulated in terms of hypothesis tests: "Given two feature vectors, what is the probability that these belong to the same (phonetic) class or not?". Actually, one of the main goals of the present thesis finally boils down to one interesting question: ”Is it easier to classify feature vectors into C phonetic classes or to tell whether or not two feature vectors belong to the same class?”. This work was done with standard acoustic features as inputs (PLP) and with posterior features (resulting of another pre-training MLP). Both feature sets indeed exhibit different properties and metric spaces. For example, while the use of posteriors as input is motivated by the fact that they are speaker and environment independent (so they capture much of the phonetic information contained in the signal), they are also no longer Gaussian distributed. When showing mathematically that using the MLP as a similarity measure makes sense, we discovered that this measure was equivalent to a very simple metric that can be analytically computed without needing the use of an MLP. This new type of measure is in fact the scalar product between two posterior feature vectors. Experiments have been conducted on hypothesis tests and on kNN classification. Results of the hypothesis tests show that posterior feature vectors achieve better performance than acoustic feature vectors. Moreover, the use of the scalar product leads to better performance than the use of all other metrics (including the MLP-based distance metric), whatever the input features.

Keywords: IM2.AP, Report_VIII
[825] M. Gurban and J. Ph. Thiran. Information theoretic feature extraction for audio-visual speech recognition. IEEE Trans. on Signal Processing, in press, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[826] G. Bologna, S. Malandain, B. Deville, and T. Pun. The multi-touch see color interface. In ICTA 2009, The 2nd International Conference on Information and Communication Technologies and Accessibility, Hammamet, Tunisia, May 7–9, 2009. [ bib ]
Keywords: IM2.MCA, Report_VIII
[827] X. Perrin, R. Chavarriaga, C. Pradalier, J. del R. Millán, and R. Siegwart. Dialog management technique for brain-computer interfaces. Technical report, Autonomous Systems Lab, ETHZ, 2009. [ bib ]
Keywords: IM2.BMI, Report_VIII
[828] J. Yao and J. M. Odobez. Multi-camera multi-person 3d space tracking with mcmc in surveillance scenarios. In European Conference on Computer Vision, workshop on Multi Camera and Multi-modal Sensor Fusion Algorithms and Applications (ECCV-M2SFA2), 2009. [ bib | .pdf ]
We present an algorithm for the tracking of a variable number of 3D persons in a multi-camera setting with partial field-of-view overlap. The multi-object tracking problem is posed in a Bayesian framework and relies on a joint multi-object state space with individual object states defined in the 3D world. The Reversible Jump Markov Chain Monte Carlo (RJ-MCMC) method is used to efficiently search the state-space and recursively estimate the multi-object configuration. The paper presents several contributions: i) the use and extension of several key features for efficient and reliable tracking (e.g. the use of the MCMC framework for multiple camera MOT; the use of powerful human detector outputs in the MCMC proposals to automatically initialize/upyearobject tracks); ii) the definition of appropriate prior on the object state, to take into account the effects of 2D image measurement uncertainties on the 3D object state estimation due to depth effects; iii) a simple rectification method aligning people 3D standing direction with 2D image vertical axis, allowing to obtain better object measurements relying on rectangular boxes and integral images; iv) representing objects with multiple reference color histograms, to account for variability in color measurement due to changes in pose, lighting, and importantly multiple camera view points. Experimental results on challenging real-world tracking sequences and situations demonstrate the efficiency of our approach.

Keywords: IM2.VP, Report_VIII
[829] B. Caputo, E. Hayman, M. Fritz, and J. O Ekluhnd. Classifying material in the real world. Image and vision Computing, accepted for pub, 2009. [ bib ]
Keywords: IM2.VP, IM2.MCR, Report_VIII
[830] H. Salamin, S. Favre, and A. Vinciarelli. Automatic role recognition in multiparty recordings: Using social affiliation networks for feature extraction. IEEE Transactions on Multimedia, To Appear, 2009. [ bib | .pdf ]
Automatic analysis of social interactions attracts increasing attention in the multimedia community. This paper considers one of the most important aspects of the problem, namely the roles played by individuals interacting in different settings. In particular, this work proposes an automatic approach for the recognition of roles in both production environment contexts (e.g., news and talk-shows) and spontaneous situations (e.g., meetings). The experiments are performed over roughly 90 hours of material (one of the largest databases used for role recognition in the literature) and show that the recognition effectiveness depends on how much the roles influence the behavior of people. Furthermore, this work proposes the first approach for modeling mutual dependences between roles and assesses its effect on role recognition performance.

Keywords: IM2.MCA, Report_VIII
[831] N. Garg and D. Gatica-Perez. Tagging and retrieving images with co-occurrence models: from corel to flickr. Idiap-RR Idiap-RR-21-2009, Idiap, 2009. [ bib | .pdf ]
This paper presents two models for content-based automatic image annotation and retrieval in web image repositories, based on the co-occurrence of tags and visual features in the images. In particular, we show how additional measures can be taken to address the noisy and limited tagging problems, in datasets such as Flickr, to improve performance. An image is represented as a bag of visual terms computed using edge and color information. The first model begins with a naive Bayes approach and then improves upon it by using image pairs as single documents to significantly reduce the noise and increase annotation performance. The second method models the visual features and tags as a graph, and uses query expansion techniques to improve the retrieval performance. We evaluate our methods on the commonly used 150 concept Corel dataset, and a much harder 2000 concept Flickr dataset.

Keywords: IM2.MCA, Report_VIII
[832] E. Bertini and D. Lalanne. Surveying the complementary roles of automatic data analysis and visualization in knowledge discovery. In Proceedings of ACM SIGKDD Workshop on Visual Analytics and Knowledge Discovery, VAKD '09, 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (VAKD 2009), pages 12–20, Paris (France), 2009. [ bib ]
Keywords: IM2.HMI, Report_VIII
[833] D. Jayagopi, R. Bogdan, and D. Gatica-Perez. Characterising conversationsal group dynamics using nonverbal behaviour. In Proceedings ICME 2009, 2009. [ bib | .pdf ]
This paper addresses the novel problemof characterizing conversational group dynamics. It is well documented in social psychology that depending on the objectives a group, the dynamics are different. For example, a competitive meeting has a different objective from that of a collaborative meeting. We propose a method to characterize group dynamics based on the joint description of a group members' aggregated acoustical nonverbal behaviour to classify two meeting datasets (one being cooperative-type and the other being competitive-type). We use 4.5 hours of real behavioural multi-party data and show that our methodology can achieve a classification rate of upto 100%.

Keywords: IM2.MCA, Report_VIII
[834] K. Zhu, A. Drygajlo, and W. Li. Q-stack aging model for face verification. Glasgow, UK, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[835] N. Garg. Co-occurrence models for image annotation and retrieval. Idiap-RR Idiap-RR-22-2009, Idiap, 2009. Ecole Polytechnique Fédérale de Lausanne - Master Thesis. [ bib | .pdf ]
We present two models for content-based automatic image annotation and retrieval in web image repositories, based on the co-occurrence of tags and visual features in the images. In particular, we show how additional measures can be taken to address the noisy and limited tagging problems, in datasets such as Flickr, to improve performance. As in many state-of-the-art works, an image is represented as a bag of visual terms computed using edge and color information. The cooccurrence information of visual terms and tags is used to create models for image annotation and retrieval. The first model begins with a naive Bayes approach and then improves upon it by using image pairs as single documents to significantly reduce the noise and increase annotation performance. The second method models the visual terms and tags as a graph, and uses query expansion techniques to improve the retrieval performance. We evaluate our methods on the commonly used 150 concept Corel dataset, and a much harder 2000 concept Flickr dataset.

Keywords: IM2.MCA, Report_VIII
[836] F. Orabona, J. Keshet, and B. Caputo. Bounded kernel-based perceptrons. Journal of Machine Learning Research, Accepted for pub, 2009. [ bib ]
Keywords: IM2.AP, Report_VIII
[837] D. Vijayasenan, F. Valente, and H. Bourlard. Kl realignment for speaker diarization with multiple feature streams. In 10th Annual Conference of the International Speech Communication Association, 2009. [ bib ]
This paper aims at investigating the use of Kullback-Leibler (KL) divergence based realignment with application to speaker diarization. The use of KL divergence based realignment operates directly on the speaker posterior distribution estimates and is compared with traditional realignment performed using HMM/GMM system. We hypothesize that using posterior estimates to re-align speaker boundaries is more robust than gaussian mixture models in case of multiple feature streams with different statistical properties. Experiments are run on the NIST RT06 data. These experiments reveal that in case of conventional MFCC features the two approaches yields the same performance while the KL based system outperforms the HMM/GMM re-alignment in case of combination of multiple feature streams (MFCC and TDOA).

Keywords: IM2.AP, Report_VIII
[838] M. Pantic and A. Vinciarelli. Implicit human centered tagging. IEEE Signal Processing Magazine, 26, 2009. [ bib | .pdf ]
Keywords: IM2.MCA, Report_VIII
[839] F. Valente. A novel criterion for classifiers combination in multistream speech recognition. IEEE Signal Processing Letters, 16(7):561–564, 2009. [ bib | DOI | .pdf ]
In this paper we propose a novel information theoretic criterion for optimizing the linear combination of classifiers in multi stream automatic speech recognition. We discuss an objective function that achieves a trade-off between the minimization of a bound on the Bayes probability of error and the minimization of the divergence between the individual classifier outputs and their combination. The method is compared with the conventional inverse entropy and minimum entropy combinations on both small and large vocabulary automatic speech recognition tasks. Results reveal that it outperforms other linear combination rules. Furthermore we discuss the advantages of the proposed approach and the extension to other (non-linear) combination rules.

Keywords: IM2.AP, Report_VIII
[840] J. Richiardi, A. Drygajlo, and K. Kryszczuk. Static models of derivative-coordinates phase spaces for multivariate time series classification: an application to signature verification. pages 140–149, Alghero, Italy, 2009. [ bib ]
Keywords: IM2.MPR, Report_VIII
[841] F. Orabona, C. Castellini, B. Caputo, A. E. Fiorilla, and G. Sandini. Model adaptation with least-squares svm for adaptive hand prosthetics. Idiap-RR Idiap-RR-05-2009, Idiap, March 2009. Accepted in ICRA09. [ bib | .pdf ]
The state-of-the-art in control of hand prosthetics is far from optimal. The main control interface is represented by surface electromyography (EMG): the activation potentials of the remnants of large muscles of the stump are used in a non-natural way to control one or, at best, two degrees-of-freedom. This has two drawbacks: first, the dexterity of the prosthesis is limited, leading to poor interaction with the environment; second, the patient undergoes a long training time. As more dexterous hand prostheses are put on the market, the need for a finer and more natural control arises. Machine learning can be employed to this end. A desired feature is that of providing a pre-trained model to the patient, so that a quicker and better interaction can be obtained. To this end we propose model adaptation with least-squares SVMs, a technique that allows the automatic tuning of the degree of adaptation. We test the effectiveness of the approach on a database of EMG signals gathered from human subjects. We show that, when pre-trained models are used, the number of training samples needed to reach a certain performance is reduced, and the overall performance is increased, compared to what would be achieved by starting from scratch.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[842] Raducanu Bogdan, Vitria J., and Daniel Gatica-Perez. You are fired! nonverbal role analysis in competitive meetings. In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Taiwan., April 2009. [ bib | .pdf ]
This paper addresses the problem of social interaction analysis in competitive meetings, using nonverbal cues. For our study, we made use of �The Apprentice� reality TV show, which features a competition for a real, highly paid corporate job. Our analysis is centered around two tasks regarding a person�s role in a meeting: predicting the person with the highest status and predicting the fired candidates. The current study was carried out using nonverbal audio cues. Results obtained from the analysis of a full season of the show, representing around 90 minutes of audio data, are very promising (up to 85.7% of accuracy in the first case and up to 92.8% in the second case). Our approach is based only on the nonverbal interaction dynamics during the meeting without relying on the spoken words.

Keywords: Report_IX, IM2.IP3, Group Bourlard, inproceedings
[843] D. Vijayasenan, F. Valente, and H. Bourlard. Mutual information based channel selection for speaker diarization of meetings data. In Proceedings of International conference on acoustics speech and signal processing, April 2009. [ bib ]
This paper aims at investigating the use of Kullback-Leibler (KL) divergence based realignment with application to speaker diarization. The use of KL divergence based realignment operates directly on the speaker posterior distribution estimates and is compared with traditional realignment performed using HMM/GMM system. We hypothesize that using posterior estimates to re-align speaker boundaries is more robust than gaussian mixture models in case of multiple feature streams with different statistical properties. Experiments are run on the NIST RT06 data. These experiments reveal that in case of conventional MFCC features the two approaches yields the same performance while the KL based system outperforms the HMM/GMM re-alignment in case of combination of multiple feature streams (MFCC and TDOA).

Keywords: IM2.AP, Report_VIII
[844] W. Li, J. Dines, M. Magimai-Doss, and H. Bourlard. Non-linear mapping for multi-channel speech separation and robust overlapping speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2009. [ bib | .pdf ]
This paper investigates a non-linear mapping approach to extract robust features for ASR and separation of overlapping speech. Based on our previous studies, we continue to use two additional sound sources, namely, from the target and interfering speakers. The focues of this work are: 1) We investigate the feature mapping between different domains with the consideration of MMSE criterion and regression optimizations, demonstrating the mapping of log mel-filterbank energies to MFCC can be exploited to improve the effectiveness of the regression; 2) We investigate the data-driven filtering for the speech separation by using the mapping method, which can be viewed as a generalized log spectral subtraction and results in better separation performance. We demonstrate the effectiveness of the proposed approach through extensive evaluations on the MONC corpus, which includes both non-overlapping single speaker and overlapping multi-speaker conditions.

Keywords: binary masking, microphone array, neural network, overlapping speech recognition, speech separation, IM2.AP, Report_VIII
[845] Padmanabhan Rajan, Sree Hari Krishnan Parthasarathi, and Hema A Murthy. Robustness of phase based features for speaker recognition. Idiap-RR Idiap-RR-14-2009, Idiap, June 2009. [ bib | .pdf ]
This paper demonstrates the robustness of group-delay based features for speech processing. An analysis of group delay functions is presented which show that these features retain formant structure even in noise. Furthermore, a speaker verification task performed on the NIST 2003 database show lesser error rates, when compared with the traditional MFCC features. We also mention about using feature diversity to dynamically choose the feature for every claimed speaker.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[846] N. Scaringella. On the design of audio features robust to the album-effect for music information retrieval. PhD thesis, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, June 2009. Thèse EPFL, no 4412 (2009). Dir.: Hervé Bourlard. [ bib ]
Short-term spectral features - and most notably Mel-Frequency Cepstral Coefficients (MFCCs) - are the most widely used descriptors of audio signals and are deployed in a majority of state-of-the-art Music Information Retrieval (MIR) systems. These descriptors have however demonstrated their limitations in the context of speech processing when training and testing conditions of the system do not match, like e.g. in noisy conditions or under a channel mismatch. A related problem has been observed in the context of music processing. It has indeed been hypothesized that MIR algorithms relying on the use of short-term spectral features were unexpectedly picking up on similarities in the production/mastering qualities of music albums. This problem has been referred to as the album-effect in the literature though it has never been studied in depth. It is showed in this thesis how the album-effect relates to the problem of channel mismatch. A measure of robustness to the album-effect is proposed and channel normalization techniques borrowed from the speech processing community are evaluated to help at improving the robustness of short-term spectral features. Alternatively, longer-term features describing critical-band specialized temporal patterns (TRAPs) are adapted to the context of music processing. It is shown how such features can help at describing either timbre or rhythm content depending on the scale considered for analysis and how robust they are to the album-effect. Contrarily to more classic short-term spectral descriptors, TRAP-based features encode some form of prior knowledge of the problem considered through a trained feature extraction chain. The lack of appropriately annotated datasets raises however some new issues when it comes to training the feature extraction chain. Advanced unsupervised learning strategies are considered in this thesis and evaluated against more traditional supervised approaches relying on coarse-grained annotations such as music genres. Specialized learning strategies and specialized architectures are also proposed to compensate for some inherent variability of the data due either to album-related factors or to the dependence of music signals to the tempo of the performance.

Keywords: channel normalization, machine learning, music information retrieval, neural networks, rhythm, timbre, IM2.AP,Report_VIII
[847] K. Kumatani, J. McDonough, Barbara Rauch, D. Klakow, P. N. Garner, and Weifeng Li. Beamforming with a maximum negentropy criterion. IEEE Transactions on Audio Speech and Language Processing, 17(5):994–1008, July 2009. [ bib | .pdf ]
In this paper, we address a beamforming application based on the capture of far-field speech data from a single speaker in a real meeting room. After the position of the speaker is estimated by a speaker tracking system, we construct a subband-domain beamformer in generalized sidelobe canceller (GSC) configuration. In contrast to conventional practice, we then optimize the active weight vectors of the GSC so as to obtain an output signal with maximum negentropy (MN). This implies the beamformer output should be as non-Gaussian as possible. For calculating negentropy, we consider the � and the generalized Gaussian (GG) pdfs. After MN beamforming, Zelinski post- filtering is performed to further enhance the speech by remov- ing residual noise. Our beamforming algorithm can suppress noise and reverberation without the signal cancellation problems encountered in the conventional beamforming algorithms. We demonstrate this fact through a set of acoustic simulations. More- over, we show the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC- WSJ-AV), a corpus of data captured with real far-field sensors, in a realistic acoustic environment, and spoken by real speakers. On the MC-WSJ-AV evaluation data, the delay-and-sum beamformer with post-filtering achieved a word error rate (WER) of 16.5%. MN beamforming with the � pdf achieved a 15.8% WER, which was further reduced to 13.2% with the GG pdf, whereas the simple delay-and-sum beamformer provided a WER of 17.8%. To the best of our knowledge, no lower error rates at present have been reported in the literature on this ASR task.

Keywords: Report_IX, IM2.IP1, Group Bourlard, article
[848] Philip N. Garner. Snr features for automatic speech recognition. Idiap-RR Idiap-RR-25-2009, Idiap, September 2009. [ bib | .pdf ]
When combined with cepstral normalisation techniques, the features normally used in Automatic Speech Recognition are based on Signal to Noise Ratio (SNR). We show that calculating SNR from the outset, rather than relying on cepstral normalisation to produce it, gives features with a number of practical and mathematical advantages over power-spectral based ones. In a detailed analysis, we derive Maximum Likelihood and Maximum a-Posteriori estimates for SNR based features, and show that they can outperform more conventional ones, especially when subsequently combined with cepstral variance normalisation. We further show anecdotal evidence that SNR based features lend themselves well to noise estimates based on low-energy envelope tracking.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[849] A. Roy and S. Marcel. Haar local binary pattern feature for fast illumination invariant face detection. Idiap-RR Idiap-RR-28-2009, Idiap, September 2009. [ bib | .pdf ]
Face detection is the first step in many visual processing systems like face recognition, emotion recognition and lip reading. In this paper, we propose a novel feature called Haar Local Binary Pattern (HLBP) feature for fast and reliable face detection, particularly in adverse imaging conditions. This binary feature compares bin values of Local Binary Pattern histograms calculated over two adjacent image subregions. These subregions are similar to those in the Haar masks, hence the name of the feature. They capture the region-specific variations of local texture patterns and are boosted using AdaBoost in a framework similar to that proposed by Viola and Jones. Preliminary results obtained on several standard databases show that it competes well with other face detection systems, especially in adverse illumination conditions.

Keywords: IM2.VP, Report_VIII
[850] A. Roy and S. Marcel. Haar local binary pattern feature for fast illumination invariant face detection. In British Machine Vision Conference 2009 [849]. [ bib | .pdf ]
Face detection is the first step in many visual processing systems like face recognition, emotion recognition and lip reading. In this paper, we propose a novel feature called Haar Local Binary Pattern (HLBP) feature for fast and reliable face detection, particularly in adverse imaging conditions. This binary feature compares bin values of Local Binary Pattern histograms calculated over two adjacent image subregions. These subregions are similar to those in the Haar masks, hence the name of the feature. They capture the region-specific variations of local texture patterns and are boosted using AdaBoost in a framework similar to that proposed by Viola and Jones. Preliminary results obtained on several standard databases show that it competes well with other face detection systems, especially in adverse illumination conditions.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[851] A. Vinciarelli, A. Dielmann, S. Favre, and H. Salamin. Canal9: A database of political debates for analysis of social interactions. In Proceedings of the International Conference on Affective Computing and Intelligent Interaction (IEEE International Workshop on Social Signal Processing), pages 1–4, September 2009. Publication Date: 10-12 Sept. 2009. [ bib | DOI | .pdf ]
Automatic analysis of social interactions attracts major attention in the computing community, but relatively few benchmarks are available to researchers active in the domain. This paper presents a new, publicly available, corpus of political debates including not only raw data, but a rich set of socially relevant annotations such as turn-taking (who speaks when and how much), agreement and disagreement between participants, and role played by people involved in each debate. The collection includes 70 debates for a total of 43 hours and 10 minutes of material.

Keywords: Report_IX, IM2.IP3, Group Bourlard, inproceedings
[852] J. Dines, J. Yamagishi, and S. King. Measuring the gap between hmm-based asr and tts. In Proceedings of Interspeech, September 2009. [ bib | .pdf ]
The EMIME European project is conducting research in the development of technologies for mobile, personalised speech-tospeech translation systems. The hidden Markov model is being used as the underlying technology in both automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components, thus, the investigation of unified statistical modelling approaches has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper we present results and analysis of a series of experiments that have been conducted on English ASR and TTS systems measuring their performance with respect to phone set and lexicon, acoustic feature type and dimensionality andHMM topology. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance often demands diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of such unified modelling approaches.

Keywords: speech recognition, speech synthesis, unified models, IM2.AP,Report_VIII
[853] P. Motlicek, S. Ganapathy, and H. Hermansky. Arithmetic coding of sub-band residuals in fdlp speech/audio codec. In 10th Annual Conference of the International Speech Communication Association, pages 2591–2594. ISCA, ISCA 2009, September 2009. [ bib | .pdf ]
A speech/audio codec based on Frequency Domain Linear Prediction (FDLP) exploits auto-regressive modeling to approximate instantaneous energy in critical frequency sub-bands of relatively long input segments. The current version of the FDLP codec operating at 66 kbps has been shown to provide comparable subjective listening quality results to state-of-the-art codecs on similar bit-rates even without employing standard blocks such as entropy coding or simultaneous masking. This paper describes an experimental work to increase compression efficiency of the FDLP codec by employing entropy coding. Unlike conventional Huffman coding employed in current speech/audio coding systems, we describe an efficient way to exploit arithmetic coding to entropy compress quantized spectral magnitudes of the sub-band FDLP residuals. Such an approach provides 11% ( 3 kbps) bit-rate reduction compared to the Huffman coding algorithm ( 1 kbps).

Keywords: Arithmetic Coding, Audio Coding, Entropy Coding, Frequency Domain Linear Prediction (FDLP), Huffman Coding, IM2.AP, Report_VIII
[854] Joan-Isaac Biel and D. Gatica-Perez. Wearing a youtube hat: directors, comedians, gurus, and user aggregated behavior. In Proceedings of the 17th ACM International Conference on Multimedia, pages 833–836. ACM, October 2009. [ bib | .pdf ]
While existing studies on YouTube's massive user-generated video content have mostly focused on the analysis of videos, their characteristics, and network properties, little attention has been paid to the analysis of users' long-term behavior as it relates to the roles they self-define and (explicitly or not) play in the site. In this paper, we present a novel statistical analysis of aggregated user behavior in YouTube from the novel perspective of user categories, a feature that allows people to ascribe to popular roles and to potentially reach certain communities. Using a sample of 270,000 users, we found that a high level of interaction and participation is concentrated on a relatively small, yet significant, group of users, following recognizable patterns of personal and social involvement. Based on our analysis, we also show that by using simple behavioral features from user profiles, people can be automatically classified according to their category with accuracy rates of up to 73%.

Keywords: social networks, user aggregated behavior, video-sharing, YouTubeYouTube, Report_IX, IM2.IP3, Group Bourlard, inproceedings
[855] Jagannadan Varadarajan and J. M. Odobez. Topic models for scene analysis and abnormality detection. In 9th International Workshop in Visual Surveillance. IEEE, IEEE, October 2009. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[856] Edgar Roman-Rangel, Carlos Pallan, J. M. Odobez, and D. Gatica-Perez. Retrieving ancient maya glyphs with shape context. In 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops. IEEE, October 2009. [ bib | .pdf ]
We introduce an interdisciplinary project for archaeological and computer vision research teams on the analysis of the ancient Maya writing system. Our first task is the automatic retrieval of Maya syllabic glyphs using the Shape Context descriptor. We investigated the effect of several parameters to adapt the shape descriptor given the high complexity of the shapes and their diversity in our data. We propose an improvement in the cost function used to compute similarity between shapes making it more restrictive and precise. Our results are promising, they are analyzed via standard image retrieval measurements.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[857] S. Ganapathy, P. Motlicek, and H. Hermansky. Mdct for encoding residual signals in frequency domain linear prediction. In Audio Engineering Society (AES), 127th Convention, number Preprint 7921 in 127th Convention, New York, USA, October 2009. Audio Engineering Society (AES), Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA;. [ bib | http | .pdf ]
Frequency domain linear prediction (FDLP) uses autoregressive models to represent Hilbert envelopes of relatively long segments of speech/audio signals. Although the basic FDLP audio codec achieves good quality of the reconstructed signal at high bit-rates, there is a need for scaling to lower bit-rates without degrading the reconstruction quality. Here, we present a method for improving the compression efficiency of the FDLP codec by the application of the modified discrete cosine transform (MDCT) for encoding the FDLP residual signals. In the subjective and objective quality evaluations, the proposed FDLP codec provides competent quality of reconstructed signal compared to the state-of-the-art audio codecs for the 32 � 64 kbps range.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[858] E. Ricci and J. M. Odobez. Learning large margin likelihood for realtime head pose tracking. In IEEE Int. Conference on Image Processing, Cairo, Egypt. IEEE, October 2009. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[859] R. A. Negoescu, B. Adams, D. Phung, S. Venkatesh, and D. Gatica-Perez. Flickr hypergroups. In Proceedings of the 17th ACM International Conference on Multimedia, October 2009. [ bib | .pdf ]
The amount of multimedia content available online constantly increases, and this leads to problems for users who search for content or similar communities. Users in Flickr often self-organize in user communities through Flickr Groups. These groups are particularly interesting as they are a natural instantiation of the content   relations social media paradigm. We propose a novel approach to group searching through hypergroup discovery. Starting from roughly 11,000 Flickr groups' content and membership information, we create three different bag-of-word representations for groups, on which we learn probabilistic topic models. Finally, we cast the hypergroup discovery as a clustering problem that is solved via probabilistic affinity propagation. We show that hypergroups so found are generally consistent and can be described through topic-based and similarity-based measures. Our proposed solution could be relatively easy implemented as an application to enrich Flickr's traditional group search.

Keywords: Flickr groups LDA, Report_IX, IM2.IP1, Group Bourlard, inproceedings
[860] S. Ganapathy, S. Thomas, P. Motlicek, and H. Hermansky. Applications of signal analysis using autoregressive models for amplitude modulation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2009, WASPAA '09., pages 341–344. IEEE, October 2009. Digital Object Identifier 10.1109/ASPAA.2009.534649. [ bib | http | .pdf ]
Frequency Domain Linear Prediction (FDLP) represents an efficient technique for representing the long-term amplitude modulations (AM) of speech/audio signals using autoregressive models. For the proposed analysis technique, relatively long temporal segments (1000 ms) of the input signal are decomposed into a set of sub-bands. FDLP is applied on each sub-band to model the temporal envelopes. The residual of the linear prediction represents the frequency modulations (FM) in the sub-band signal. In this paper, we present several applications of the proposed AM-FM decomposition technique for a variety of tasks like wide-band audio coding, speech recognition in reverberant environments and robust feature extraction for phoneme recognition.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[861] D. Korchagin. Out-of-scene av data detection. Idiap-RR Idiap-RR-31-2009, Idiap, P.O. Box 592, CH-1920 Martigny, Switzerland, November 2009. [ bib | .pdf ]
In this paper, we propose a new approach for the automatic audio-based out-of-scene detection of audio-visual data, recorded by different cameras, camcorders or mobile phones during social events. All recorded data is clustered to out-of-scene and in-scene datasets based on confidence estimation of cepstral pattern matching with a common master track of the event, recorded by a reference camera. The core of the algorithm is based on perceptual time-frequency analysis and confidence measure based on distance distribution variance. The results show correct clustering in 100% of cases for a real life dataset and surpass the performance of cross correlation while keeping lower system requirements.

Keywords: confidence estimation, pattern matching, time-frequency analysis, Report_IX, IM2.IP1, Group Bourlard, techreport
[862] Dairazalia Sanchez-Cortes, D. Jayagopi, and D. Gatica-Perez. Predicting remote versus collocated group interactions using nonverbal cues. In Proc. Int. Conf. on Multimodal Interfaces, Workshop on Multimodal Sensor-Based Systems and Mobile Phones for Social Computing,, November 2009. [ bib | DOI ]
This paper addresses two problems: Firstly, the problem of classifying remote and collocated small-group working meet- ings, and secondly, the problem of identifying the remote participant, using in both cases nonverbal behavioral cues. Such classifiers can be used to improve the design of remote collaboration technologies to make remote interactions as ef- fective as possible to collocated interactions. We hypothesize that the difference in the dynamics between collocated and remote meetings is significant and measurable using speech activity based nonverbal cues. Our results on a publicly available dataset - the Augmented Multi-Party Interaction with Distance Access (AMIDA) corpus - show that such an approach is promising, although more controlled settings and more data are needed to explore the addressed prob- lems further.

Keywords: Characterizing small groups, Nonverbal behavior, Remote meetings, Report_IX, IM2.IP3, Group Bourlard, inproceedings
[863] D. Korchagin. Out-of-scene av data detection. In Proceedings IADIS International Conference Applied Computing [861], pages 244–248. [ bib | .pdf ]
In this paper, we propose a new approach for the automatic audio-based out-of-scene detection of audio-visual data, recorded by different cameras, camcorders or mobile phones during social events. All recorded data is clustered to out-of-scene and in-scene datasets based on confidence estimation of cepstral pattern matching with a common master track of the event, recorded by a reference camera. The core of the algorithm is based on perceptual time-frequency analysis and confidence measure based on distance distribution variance. The results show correct clustering in 100% of cases for a real life dataset and surpass the performance of cross correlation while keeping lower system requirements.

Keywords: confidence estimation, pattern matching, time-frequency analysis, Report_IX, IM2.IP1, Group Bourlard, inproceedings
[864] D. Korchagin. Multimodal data flow controller. Idiap-Com Idiap-Com-01-2009, Idiap, P.O. Box 592, CH-1920 Martigny, Switzerland, November 2009. [ bib | .pdf ]
In this paper, we describe a multimodal data flow controller capable of reading most multichannel sound cards and web cameras, synchronising media streams, being a server to stream captured media over TCP in raw format, being a client to receive media streams over TCP in raw format and using unified interface for online transmission.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[865] C. McCool and S. Marcel. Mobio database for the icpr 2010 face and speech competition. Idiap-Com Idiap-Com-02-2009, Idiap, November 2009. [ bib | .pdf ]
This document presents an overview of the mobile biometry (MOBIO) database. This document is written expressly for the face and speech organised for the 2010 International Conference on Pattern Recognition.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[866] M. Pronobis and M. Magimai-Doss. Analysis of f0 and cepstral features for robust automatic gender recognition. Idiap-RR Idiap-RR-30-2009, Idiap, November 2009. [ bib | .pdf ]
In this paper, we analyze applicability of F0 and cepstral features, namely LPCCs, MFCCs, PLPs for robust Automatic Gender Recognition (AGR). Through gender recognition studies on BANCA corpus comprising datasets of varying complexity, we show that use of voiced speech frames and modelling of higher spectral detail (i.e. using higher order cepstral coefficients) along with the use of dynamic features improve the robustness of the system towards mismatched training and test conditions. Moreover, our study shows that for matched clean training and test conditions and for multi-condition training, the AGR system is less sensitive to the order of cepstral coefficients and the use of dynamic features gives little-to-no gain. F0 and cepstral features perform equally well under clean conditions, however under noisy conditions cepstral features yield robust system compared to F0-based system.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[867] D. Korchagin, P. N. Garner, and J. Dines. Automatic temporal alignment of av data with confidence estimation. Idiap-RR Idiap-RR-40-2009, Idiap, CH-1920 Martigny, Switzerland, December 2009. [ bib | .pdf ]
In this paper, we propose a new approach for the automatic audio-based temporal alignment with confidence estimation of audio-visual data, recorded by different cameras, camcorders or mobile phones during social events. All recorded data is temporally aligned based on ASR-related features with a common master track, recorded by a reference camera, and the corresponding confidence of alignment is estimated. The core of the algorithm is based on perceptual time-frequency analysis with a precision of 10 ms. The results show correct alignment in 99% of cases for a real life dataset and surpass the performance of cross correlation while keeping lower system requirements.

Keywords: pattern matching, reliability estimation, time synchronisation, time-frequency analysis, Report_IX, IM2.IP1, Group Bourlard, techreport
[868] J. Luo, B. Caputo, and V. Ferrari. Who's doing what: Joint modeling of names and verbs for simultaneous face and pose annotation. In Advances in Neural Information Processing Systems 22 (NIPS09). NIPS Foundation, MIT Press, December 2009. [ bib | .pdf ]
Given a corpus of news items consisting of images accompanied by text captions, we want to find out �who�s doing what�, i.e. associate names and action verbs in the captions to the face and body pose of the persons in the images. We present a joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus. These models can then be used to recognize people and actions in novel images without captions. We demonstrate experimentally that our joint �face and pose� model solves the correspondence problem better than earlier models covering only the face, and that it can perform recognition of new uncaptioned images.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[869] A. Popescu-Belis, P. Poller, J. Kilgour, M. Flynn, Sebastian Germesin, A. Nanchen, and M. Yazdani. User interface design in a just-in-time retrieval system for meetings. Idiap-RR Idiap-RR-38-2009, Idiap, December 2009. [ bib | .pdf ]
The Automatic Content Linking Device (ACLD) is a just-in-time multimedia retrieval system that monitors and supports the conversation among a small group of people within a meeting. The ACLD retrieves from a repository, at regular intervals, information that might be relevant to the group's activity, and presents it through a graphical user interface (GUI). The repository contains documents from past meetings such as slides or reports along with processed meeting recordings; in parallel, Web searches are run as well. The acceptance by users of such a system depends considerably on the GUI, along with the performance of retrieval. The trade-off between informativeness and unobtrusiveness is studied here through the design of a series of GUIs. The requirements and feedback collected while demonstrating the successive versions show that users vary considerably in their preferences for a given style of interface. After studying two extreme options, a widget vs. a wide-screen UI, we conclude that a modular UI, which can be flexibly structured and resized by users, is the most sensible design for a just-in-time multimedia retrieval system.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[870] Serena Soldo, M. Magimai-Doss, J. P. Pinto, and H. Bourlard. On mlp-based posterior features for template-based asr. Idiap-RR Idiap-RR-37-2009, Idiap, December 2009. [ bib | .pdf ]
We investigate the invariance of posterior features estimated using MLP trained on auxiliary corpus towards different data condition and different distance measures for matching posterior features in the context of template-based ASR. Through ASR studies on isolated word recognition task we show that posterior features estimated using MLP trained on auxiliary corpus with out any kind of adaptation can achieve comparable or better performance when compared to the case where the MLP is trained on the corpus same as that of the test set. We also show that local scores, weighted symmetric KL-divergence and Bhattacharya distance yield better systems compared to Hellinger distance, cosine angle, L1-norm, L2-norm, dot product, and cross entropy.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[871] D. Korchagin. Memoirs of togetherness from audio logs. In Proceedings International ICST Conference on User Centric Media, P.O. Box 592, CH-1920 Martigny, Switzerland, December 2009. [ bib | .pdf ]
In this paper, we propose a new concept how tempo-social information about moments of togetherness within a social group of people can be retrieved in the palm of the hand from social context. The social context is digitised by audio logging of the same user centric device such as mobile phone. Being asynchronously driven it allows automatically logging social events with involved parties and thus helps to feel at home anywhere anytime and to nurture user to group relationships. The core of the algorithm is based on perceptual time-frequency analysis via confidence estimate of dynamic cepstral pattern matching between audio logs of people within a social group. The results show robust retrieval and surpass the performance of cross correlation while keeping lower system requirements.

Keywords: confidence estimation, pattern matching, time-frequency analysis, Report_IX, IM2.IP1, Group Bourlard, inproceedings
[872] J. P. Pinto, M. Magimai-Doss, and H. Bourlard. Mlp based hierarchical system for task adaptation in asr. In Proceedings of the IEEE workshop on Automatic Speech Recognition and Understanding, pages 365–370, December 2009. [ bib | .pdf ]
We investigate a multilayer perceptron (MLP) based hierarchical approach for task adaptation in automatic speech recognition. The system consists of two MLP classifiers in tandem. A well-trained MLP available off-the-shelf is used at the first stage of the hierarchy. A second MLP is trained on the posterior features estimated by the first, but with a long temporal context of around 130 ms. By using an MLP trained on 250 hours of conversational telephone speech, the hierarchical adaptation approach yields a word error rate of 1.8% on the 600-word Phonebook isolated word recognition task. This compares favorably to the error rate of 4% obtained by the conventional single MLP based system trained with the same amount of Phonebook data that is used for adaptation. The proposed adaptation scheme also benefits from the ability of the second MLP to model the temporal information in the posterior features.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[873] D. Korchagin, P. N. Garner, and J. Dines. Automatic temporal alignment of av data. Idiap-RR Idiap-RR-39-2009, Idiap, December 2009. [ bib | .pdf ]
In this paper, we describe the automatic audio-based temporal alignment of audio-visual data, recorded by different cameras, camcorders or mobile phones during social events like high school concerts. All recorded data is temporally aligned with a common master track, recorded by a reference camera. The core of the algorithm is based on perceptual time-frequency analysis with a precision of 10 ms. The results show correct alignment in 99% of cases for a real life dataset.

Keywords: audio processing, temporal alignment, time-frequency analysis, Report_IX, IM2.IP1, Group Bourlard, techreport
[874] Lakshmi Saheer, John Dines, Philip N. Garner, and Hui Liang. Implementation of vtln for statistical speech synthesis. Idiap-RR Idiap-RR-32-2010, Idiap, 2010. [ bib | .pdf ]
Vocal tract length normalization is an important feature normalization technique that can be used to perform speaker adaptation when very little adaptation data is available. It was shown earlier that VTLN can be applied to statistical speech synthesis and was shown to give additive improvements to CMLLR. This paper presents an EM optimization for estimating more accurate warping factors. The EM formulation helps to embed the feature normalization in the HMM training. This helps in estimating the warping factors more efficiently and enables the use of multiple (appropriate) warping factors for different state clusters of the same speaker.

Keywords: report_X, IM2.IP1, Group Bourlard, techreport
[875] Jagannadan Varadarajan, Remi Emonet, and Jean-Marc Odobez. Probabilistic latent sequential motifs: Discovering temporal activity patterns in video scenes. Idiap-RR Idiap-RR-33-2010, Idiap, 2010. [ bib | .pdf ]
This paper introduces a novel probabilistic activity modeling approach that mines recurrent sequential patterns from documents given as word-time occurrences. In this model, documents are represented as a mixture of sequential activity motifs (or topics) and their starting occurrences. The novelties are threefold. First, unlike previous approaches where topics only modeled the co-occurrence of words at a given time instant, our topics model the co-occurrence and temporal order in which the words occur within a temporal window. Second, our model counts for the important case where activities occur concurrently in the document. And third, our method explicitly models with latent variables the starting time of the activities within the documents, enabling to implicitly align the occurrences of the same pattern during the joint inference of the temporal topics and their starting times. The model and its robustness to the presence of noise have been validated on synthetic data. Its effectiveness is also illustrated in video activity analysis from low-level motion features, where the discovered topics capture frequent patterns that implicitly represent typical trajectories of scene objects.

Keywords: report_X, IM2.IP1, Group Bourlard, techreport
[876] John Dines, Junichi Yamagishi, and Simon King. Measuring the gap between hmm-based asr and tts. Idiap-RR Idiap-RR-34-2010, Idiap, 2010. [ bib | .pdf ]
Keywords: report_X, IM2.IP1, Group Bourlard, techreport
[877] David Imseng and Gerald Friedland. Tuning-robust initialization methods for speaker diarization. Idiap-RR Idiap-RR-35-2010, Idiap, Centre du Parc, Rue Marconi 19, Case Postale 592, CH-1920 Martigny, 2010. [ bib | .pdf ]
This paper investigates a typical Speaker Diarization system regarding its robustness against initialization parameter variation and presents a method to reduce manual tuning of these values significantly. The behavior of an agglomerative hierarchical clustering system is studied to determine which initialization parameters impact accuracy most. We show that the accuracy of typical systems is indeed very sensitive to the values chosen for the initialization parameters and factors such as the length of the recording. We then present a solution that reduces the sensitivity of the initialization values and therefore reduces the need for manual tuning significantly while at the same time increasing the accuracy of the system. For short meetings extracted from the previous (2006 and 2007) National Institute of Standards and Technology (NIST) Rich Transcription (RT) evaluation data, the decrease of the Diarization Error Rate is up to 50% relative. The approach consists of a novel initial parameter estimation method for Speaker Diarization that uses agglomerative clustering with Bayesian Information Criterion (BIC) and Gaussian Mixture Models (GMMs) of frame-based cepstral features (MFCCs). The estimation method leverages the relationship between the optimal value of the seconds of speech data per Gaussian and the duration of the speech data and is combined with a novel non-uniform initialization method. This approach results in a system that performs better than the current ICSI baseline engine on datasets of the NIST RT evaluations of the years 2006 and 2007.

Keywords: report_X, IM2.IP1, Group Bourlard, techreport
[878] Jagannadan Varadarajan, Remi Emonet, and Jean-Marc Odobez. A sparsity constraint for topic models - application to temporal activity mining. Idiap-RR Idiap-RR-36-2010, Idiap, 2010. [ bib | .pdf ]
We address the mining of sequential activity patterns from document logs given as word-time occurrences. We achieve this using topics that models both the cooccurrence and the temporal order in which words occur within a temporal window. Discovering such topics, which is particularly hard when multiple activities can occur simultaneously, is conducted through the joint inference of the temporal topics and of their starting times, allowing the implicit alignment of the same activity occurences in the document. A current issue is that while we would like topic starting times to be represented by sparse distributions, this is not achieved in practice. Thus, in this paper, we propose a method that encourages sparsity, by adding regularization constraints on the searched distributions. The constraints can be used with most topic models (e.g. PLSA, LDA) and lead to a simple modified version of the EM standard optimization procedure. The effect of the sparsity constraint on our activity model and the robustness improvment in the presence of difference noises have been validated on synthetic data. Its effectiveness is also illustrated in video activity analysis, where the discovered topics capture frequent patterns that implicitly represent typical trajectories of scene objects.

Keywords: report_X, IM2.IP1, Group Bourlard, techreport
[879] Joel Praveen Pinto, Mathew Magimai.-Doss, and Hervé Bourlard. Hierarchical tandem features for asr in mandarin. Idiap-RR Idiap-RR-39-2010, Idiap, 2010. [ bib | .pdf ]
We apply multilayer perceptron (MLP) based hierarchical Tandem features to large vocabulary continuous speech recognition in Mandarin. Hierarchical Tandem features are estimated using a cascade of two MLP classifiers which are trained independently. The first classifier is trained on perceptual linear predictive coefficients with a 90 ms temporal context. The second classifier is trained using the phonetic class conditional probabilities estimated by the first MLP, but with a relatively longer temporal context of about 150 ms. Experiments on the Mandarin DARPA GALE eval06 dataset show significant reduction (about 7.6% relative) in character error rates by using hierarchical Tandem features over conventional Tandem features.

Keywords: report_X, IM2.IP1, Group Bourlard, techreport
[880] Joan-Isaac Biel and Daniel Gatica-Perez. Vlogcast yourself: Nonverbal behavior and attention in social media. In Proceedings International Conference on Multimodal Interfaces (ICMI-MLMI), 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, inproceedings
[881] Gelareh Mohammadi, Alessandro Vinciarelli, and Marcello Mortillaro. The voice of personality: Mapping nonverbal vocal behavior into trait attributions. In Proceedings of ACM Multimedia Workshop on Social Signal Processing, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, inproceedings
[882] V Murino, M Cristani, and Alessandro Vinciarelli. Socially intelligent surveillance and monitoring: Analysing social dimensions of physical space. In Proceedings of International Workshop on Socially Intelligent Surveillance and Monitoring, pages 51–58, San Francisco, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, inproceedings
[883] Alessandro Vinciarelli and Fabio Valente. Social signal processing: Understanding nonverbal communication in social interactions. In Proceedings of Measuring Behavior 2010, Eindhoven (The Netherlands), 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, inproceedings
[884] Dinesh Babu Jayagopi, Taemie Kim, Alex Pentland, and Daniel Gatica-Perez. Recognizing conversational context in group interaction using privacy-sensitive mobile sensors. In Proceedings of International Conference on Mobile and Ubiquitous Multimedia, Limassol, Cyprus, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, inproceedings
[885] Alessandro Vinciarelli, Roderick Murray-Smith, and Hervé Bourlard. Mobile social signal processing: vision and research issues. In Proceedings of the International Workshop on Mobile HCI, pages 513–516, Lisbon, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, inproceedings
[886] Katayoun Farrahi and Daniel Gatica-Perez. Mining human location-routines using a multi-level approach to topic modeling. In 2010 IEEE Second International Conference on Social Computing, SIN Symposium, Minneapolis, Minnesota, USA, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, inproceedings
[887] Fabio Valente and Alessandro Vinciarelli. Improving speech processing trough social signals: Automatic speaker segmentation of political debates using role based turn-taking patterns. In Proceedings of ACM Multimedia Workshop on Social Signal Processing, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, inproceedings
[888] Dairazalia Sanchez-Cortes, Oya Aran, Marianne Schmid Mast, and Daniel Gatica-Perez. Identifying emergent leadership in small groups using nonverbal communicative cues. In Proc. ICMI-MLMI '10 International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, Beijing, 2010. ACM New York, NY, USA 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, inproceedings
[889] Raul. Montoliu and Daniel Gatica-Perez. Discovering human places of interest from multimodal mobile phone data. In Proceedings of 9th International Conference on on Mobile and Ubiquitous Multimedia, Limassol, Cyprus, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, inproceedings
[890] Trinh-Minh-Tri Do and Daniel Gatica-Perez. By their apps you shall understand them: mining large-scale patterns of mobile phone usage. In The 9th International Conference on Mobile and Ubiquitous Multimedia, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, inproceedings
[891] Andrei Popescu-Belis, Jonathan Kilgour, Peter Poller, Alexandre Nanchen, Erik Boertjes, and Joost de Wit. Automatic content linking: Speech-based just-in-time retrieval for multimedia archives. In Proceedings of the 33rd Annual ACM SIGIR Conference, page 703, 2010. [ bib ]
Keywords: report_X, IM2.IP2, Group Bourlard, inproceedings
[892] Stéphanie Lefèvre and Jean-Marc Odobez. View-based appearance model online learning for 3d deformable face tracking. In Proc. Int. Conf. on Computer Vision Theory and Applications, Angers, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Bourlard, inproceedings
[893] Mikko Kurimo, William Byrne, John Dines, Philip N. Garner, Matthew Gibson, Yong Guan, Teemu Hirsimüaki, Reima Karhila, Simon King, Hui Liang, Keiichiro Oura, Lakshmi Saheer, Matt Shannon, Sayaka Shiota, Jilei Tian, Keiichi Tokuda, Mirjam Wester, Yi-Jian Wu, and Junichi Yamagishi. Personalising speech-to-speech translation in the emime project. In Proceedings of the ACL 2010 System Demonstrations, Uppsala, Sweden, 2010. Association for Computational Linguistics. [ bib ]
Keywords: report_X, IM2.IP1, Group Bourlard, inproceedings
[894] Jagannadan Varadarajan, Remi Emonet, and Jean-Marc Odobez. A sparsity constraint for topic models - application to temporal activity mining. In NIPS-2010 Workshop on Practical Applications of Sparse Modeling: Open Issues and New Directions, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Bourlard, inproceedings
[895] Fabio Valente, Mathew Magimai.-Doss, Christian Plahl, Ravuri Suman, and Wang Wen. A comparative study of mlp front-ends for mandarin asr. In Proceedings of Interspeech, Japan, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Bourlard, inproceedings
[896] Alessandro Vinciarelli. Human Behavior Understanding. Springer Verlag, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, book
[897] Alessandro Vinciarelli and Maja Pantic. www.sspnet.eu: A web portal for social signal processing. IEEE Signal Processing Magazine, 27(4):142–144, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group Bourlard, article
[898] Fabio Valente. Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition. Speech Communication, 52(10), 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Bourlard, article
[899] Fabio Valente. Multi-stream speech recognition based on dempster-shafer combination rule. Speech Communication, 52(3), 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Bourlard, article
[900] J. S. Lee, F. De Simone, and T. Ebrahimi. Video coding based on audio-visual focus of attention. Journal of Visual Communication and Image Representation, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Ebrahimi, article
[901] Apostolos Antonacopoulos, Michael J. Gormish, and Rolf Ingold, editors. Proceedings of the 2010 ACM Symposium on Document Engineering, Manchester, United Kingdom, September 21-24, 2010. ACM, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Ingold, inproceedings
[902] Karim Hadjar and Rolf Ingold. Improving xed for extracting content from arabic pdfs. In Document Analysis Systems, pages 371–376, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Ingold, inproceedings
[903] Florian Verdet, Driss Matrouf, Jean-François Bonastre, and Jean Hennebert. Channel detectors for system fusion in the context of nist lre 2009. In INTERSPEECH, pages 733–736, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Ingold, inproceedings
[904] Dalila Mekhaldi and Denis Lalanne. Multimodal document alignment: Feature-based validation to strengthen thematic links. Journal of Multimedia Processing Technologies, 1(1):30–46, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Ingold, article
[905] Florian Evéquoz, Julien Thomet, and Denis Lalanne. Gérer son information personnelle au moyen de la navigation par facettes. In Conference Internationale Francophone sur I'Interaction Homme-Machine, IHM '10, pages 41–48. ACM, 2010. [ bib ]
Keywords: report_X, IM2.IP2, Group Ingold, inproceedings
[906] Bruno Dumas, Denis Lalanne, and Rolf Ingold. Description languages for multimodal interaction: a set of guidelines and its illustration with smuiml. Journal on Multimodal User Interfaces, 3(3):237–247, 2010. [ bib ]
Keywords: report_X, IM2.IP2, Group Ingold, article
[907] Pascal Bruegger, Agnes Lisowska, Denis Lalanne, and Beat Hirsbrunner. Enriching the design and prototyping loop: a set of tools to support the creation of activity-based pervasive applications. Journal of Mobile Multimedia, 6(4):339–360, 2010. [ bib ]
Keywords: report_X, IM2.IP2, Group Ingold, article
[908] Matthias Schwaller, Denis Lalanne, and Omar Abou Khaled. Pygmi: creation and evaluation of a portable gestural interface. In NordiCHI, pages 773–776, 2010. [ bib ]
Keywords: report_X, IM2.IP2, Group Ingold, inproceedings
[909] D. Morrison, E. Bruno, and S. Marchand-Maillet. Tagcaptcha: Annotating images with captchas. In ACM MULTIMEDIA 2010 (Demo Program), 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Pun, inproceedings
[910] M. Soleymani and M. Larson. Crowdsourcing for affective annotation of video: development of a viewer-reported boredom corpus. In 33th ACM SIGIR, Workshop on Crowdsourcing for Search Evaluatio, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Pun, inproceedings
[911] B. Deville, G. Bologna, and T. Pun. Detecting objects and obstacles for visually impaired individuals using visual saliency. In ASSETS 2010, 12th Int. ACM SigAccess Conf. on Computers and Accessibility, Demonstrations Track, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Pun, inproceedings
[912] J. D. Gomez, G. Bologna, and T. Pun. Color-audio encoding interface for visual substitution: See color matlab-based demo. In ASSETS 2010, 12th Int. ACM SigAccess Conf. on Computers and Accessibility, Demonstrations Track, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Pun, inproceedings
[913] F. Crestani, S. Marchand-Maillet, H. H. Chen, E. N. Efthimiadis, and J. Savoy. Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010. ACM, New York, USA, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Pun, book
[914] G. Bologna, B. Deville, and T. Pun. Toward local and global perception modules for vision substitution. Neurocomputing, 74(8):1182–1190, 2010. [ bib ]
Keywords: report_X, IM2.IP1, Group Pun, article
[915] S. Pellegrini, A. Ess, M. Tanaskovic, and L. Van Gool. Wrong turn - no dead end: a stochastic pedestrian motion model. In International Workshop on Socially Intelligent Surveillance and Monitoring (SISM), 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[916] S. Pellegrini, A. Ess, and L. Van Gool. Improving data association by joint modeling of pedestrian trajectories and groupings. In European Conference on Computer Vision (ECCV), 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[917] S. Stalder, H. Grabner, and L. Van Gool. Cascaded confidence filtering for improved tracking-by-detectio. In European Conference on Computer Vision (ECCV), 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[918] S. Gammeter, T. Quack, D. Tingdahl, and Luc van Gool. Size does matter: improving object recognition and 3d reconstruction with cross-media analysis of image clusters. In European Conference on Computer Vision (ECCV 2010, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[919] N. Razavi, J. Gall, and Luc Van Gool. Backprojection revisited: Scalable multi-view object detection and similarity metrics for detections. In European Conference on Computer Vision, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[920] J. Gall, N. Razavi, and Luc Van Gool. On-line adaption of class-specific codebooks for instance trackin. In British Machine Vision Conference, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[921] J. Knopp, M. Prasad, G. Willems, R. Timofte, and L. Van Gool. Hough transform and 3D SURF for robust three dimensional classification. In Proceedings of the European Conference on Computer Vision, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[922] J. Knopp, M. Prasad, and L. Van Gool. Orientation invariant 3d object classification using hough transform based methods. In Proceedings of the ACM workshop on 3D object retrieval, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[923] G. Veres, H. Grabner, L. Middleton, and L. Van Gool. Automatic workflow monitoring in industrial environments. In Proceedings Asian Conference on Computer Vision (ACCV), 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[924] G. Fanelli, A.Yao, P. L. Noel, J. Gall, and L. Van Gool. Hough forest-based facial expression recognition from video sequences. In International Workshop on Sign, Gesture and Activity (SGA) 2010, in conjunction with ECCV 2010, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[925] F. Nater, J. Vangeneugden, H. Grabner, L. Van Gool, and R. Vogels. Discrimination of locomotion direction at different speeds: A comparison between macaque monkeys and algorithms. In ECML Workshop on rare audio-visual cues, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[926] C. Lalos, H. Grabner, L. Van Gool, and T. Varvarigo. Object fow: Learning object displacement. In roceeding IEEE Workshop on Visual Surveillance, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[927] A. Yao, D. Uebersax, J. Gall, and L. Van Gool. Tracking in broadcast sports. In 32nd Annual Symposium of the German Association for Pattern Recognition, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[928] A. Mansfield, P. Gehler, L. Van Gool, and C. Rothe. Visibility maps for improving seam carving. In Media Retargeting Workshop, European Conference on Computer Vision (ECCV), 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[929] A. Mansfield, P. Gehler, L. Van Gool, and C. Rother. Scene carving: Scene consistent image retargeting. In European Conference on Computer Vision (ECCV), 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, inproceedings
[930] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Online multi-person tracking-by-detection from a single, uncalibrated camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, article
[931] G. Fanelli, J. Gall, H. Romsdorfer, T. Weise, and L. Van Gool. A 3-d audio-visual corpus of affective communication. IEEE Transactions on Multimedia, 12(6):591–598, 2010. [ bib ]
Keywords: report_X, IM2.IP3, Group van Gool, article
[932] S. Haegler, P. Wonka, Stefan Mueller Arisona, Luc Van Gool, and P. Müller. Grammar-based encoding of facades. In EGSR, 2010. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[933] M. D. Breitenstein, B. Leibe, and Luc Van Gool. Evaluation of agent motion in video: Online tracking-by-detection. In International Conference on Cognitive Systems, 2010. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[934] J. Gall, N. Razavi, and L. Van Gool. On-line adaption of class-specific codebooks for instance tracking. In British Machine Vision Conference, 2010. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[935] J. Gall, A. Yao, and L. Van Gool. 2d action recognition serves 3d human pose estimation. In European Conference on Computer Vision, 2010. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[936] G. Fanelli, J. Gall, H. Romsdorfer, T. Weise, and L. Van Gool. 3d vision technology for capturing multimodal corpora: Chances and challenges. In LREC Workshop on Multimodal Corpora, 2010. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[937] Fabian Nater, Helmut Grabner, and Luc Van Gool. Visual abnormal event detection for prologed independent livin. In IEEE Healthcom Workshop on mHealth, 2010. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[938] Fabian Nater, Helmut Grabner, and Luc Van Gool. Exploiting simple hierarchies for unsupervised human behavior analysis. In CVPR, 2010. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[939] D. Kuettel, M. D. Breitenstein, Luc Van Gool, and V. Ferrari. What�s going on? discovering spatio-temporal dependencies in dynamic scenes. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[940] A. Yao, J. Gall, and L. Van Gool. A hough transform-based voting framework for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [ bib ]
Keywords: Report_IX, IM2.IP3, Group VanGool, inproceedings
[941] M. Sorci, G. Antonini, J. Cruz Mota, T. Rubin, M. Bierlaire, and J. Ph. Thiran. Modelling human perception of static facial expressions. Image and Vision Computing, 28(5):790–806, 2010. [ bib | DOI ]
Keywords: Face; Facial expression; LTS5, Report_IX, IM2.IP1, Group Thiran, article
[942] S. Koelstra, A. Yazdani, M. Soleymani, C. Muehl, J. S. Lee, A. Nijholt, T. Pun, T. Ebrahimi, and I. Patras. Single trial classification of eeg and peripheral physiological signals for recognition of emotions induced by music videos. In Brain Informatics, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, inproceedings
[943] D. Morrison, E. Bruno, and S. Marchand-Maillet. Tagcaptcha: Annotating images with CAPTCHAs. In ACM Multimedia 2010, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, inproceedings
[944] J. Kierkels, M. Soleymani, and T. Pun. Identification of narrative peaks in clips: text features perform best. In VideoCLEF 2009, Cross Language Evaluation Forum (CLEF) Workshop, Post-Conference Proceedings. Springer LNCS, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, incollection
[945] I. Kompatsiaris, S. Marchand-Maillet, S. Marcel, and R. van Zwol. Image and Video Retrieval: Theory and Applications. Springer, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, book
[946] D. Morrison, E. Bruno, and S. Marchand-Maillet. Capturing the semantics of user interaction: A review and case study. In R. Chbeir, Y. Badr, A. Abraham, and A. E. Hassanien, editors, Emergent Web Intelligence: Advanced Information Retrieval. Springer, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, incollection
[947] S. Marchand-Maillet, D. Morrison, E. Szekely, and E. Bruno. Interactive representations of multimodal databases. In H. Bourlard, F. Marques, and J. Ph. Thiran, editors, Multimodal Signal Processing for Human Computer Interaction. Academis Press, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, incollection
[948] Hsin-Hsi Chen, Efthimis N. Efthimiadis, Jacques Savoy, Fabio Crestani, and S. Marchand-Maillet. Proceedings of the ACM-SIGIR 2010 conference. ACM Digital Library, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Pun, book
[949] F. Evéquoz, Julien Thomet, and D. Lalanne. La navigation par facettes appliquée à la gestion de l'information personnelle. In Proceedings of 22ème Conférence Francophone sur l'Interaction Homme-Machine (IHM'10), 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ingold, inproceedings
[950] Ilya Boyandin, E. Bertini, and D. Lalanne. Using flow maps to explore migrations over time. In Proceedings of Geospatial Visual Analytics Workshop in conjunction with The 13th AGILE International Conference on Geographic Information Science, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ingold, inproceedings
[951] L. Goldmann, F. De Simone, and T. Ebrahimi. A comprehensive database and subjective evaluation methodology for quality of experience in stereoscopic video. In Proceedings of SPIE, volume 7526, San Jose, California, USA, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[952] L. Goldmann, F. De Simone, and T. Ebrahimi. Impact of acquisition distortion on the quality of stereoscopic images. In Proceedings of International Workshop on Video Processing and Quality Metrics for Consumer Electronics, Scottsdale, Arizona, USA, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[953] F. De Simone, L. Goldmann, D. Filimonov, and T. Ebrahimi. On the limits of perceptually optimized JPEG. In Proceedings of International Workshop on Video Processing and Quality Metrics for Consumer Electronics, Scottsdale, Arizona, USA, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[954] F. De Simone, M. Tagliasacchi, M. Naccari, S. Tubaro, and T. Ebrahimi. A H.264/AVC video database for the evaluation of quality metrics. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2430–2433, Dallas, Texas, USA, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[955] I. Ivanov, P. Vajda, L. Goldmann, J. S. Lee, and T. Ebrahimi. Object-based tag propagation for semi-automatic annotation of images. In Proceedings of the ACM SIGMM International Conference on Multimedia Information Retrieval, pages 497–506, Philadelphia, USA, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[956] P. Vajda, I. Ivanov, L. Goldmann, J. S. Lee, and T. Ebrahimi. 3D object duplicate detection for video retrieval. In Proceedings of the International Workshop on Image Analysis for Multimedia Interactive Services, Desenzano del Garda, Italy, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[957] F. De Simone, L. Goldmann, J. S. Lee, T. Ebrahimi, and V. Baroncini. Subjective evaluation of next-generation video compression algorithm: a case study. In Proceedings of SPIE, volume 7798, San Diego, USA, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[958] P. Vajda, I. Ivanov, J. S. Lee, L. Goldmann, and T. Ebrahimi. Propagation of geotags based on object duplicate detection. In Proceedings of SPIE, volume 7798, San Diego, USA, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[959] I. Ivanov, P. Vajda, J. S. Lee, and T. Ebrahimi. Epitome- a social game for photo album summarization. In Proceedings of the International Workshop on Connected Multimedia, Firenze, Italy, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[960] S. Koelstra, A. Yazdani, M. Soleymani, C. Muehl, J. S. Lee, A. Nijholt, T. Pun, T. Ebrahimi, and I. Patras. Single trial classification of EEG and peripheral physiological signals for recognition of emotions induced by music videos. In Proceedings of the International Conference on Brain Informatics, Toronto, Canada, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[961] J. S. Lee, F. De Simone, N. Ramzan, Z. Zhao, E. Kurutepe, T. Sikora, J. Ostermann, E. Izquierdo, and T. Ebrahimi. Subjective evaluation of scalable video coding for content distribution. In Proceedings of the ACM Multimedia International Conference, Firenze, Italy, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[962] S. Buchinger, F. De Simone, E. Hotop, H. Hlavacs, and T. Ebrahimi. Gesture and touch controlled video player interface for mobile devices. In Proceedings of the ACM Multimedia International Conference, Firenze, Italy, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, inproceedings
[963] I. Ivanov, P. Vajda, J. S. Lee, L. Goldmann, and T. Ebrahimi. Geotag propagation in social networks based on user trust model. Multimedia Tools and Application, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, article
[964] P. Vajda, I. Ivanov, L. Goldmann, J. S. Lee, and T. Ebrahimi. Robust duplicate detection of 2D and 3D objects. International Journal of Multimedia Data Engineering and Management, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Ebrahimi, article
[965] Pierre Dillenbourg and Patrick Jermann. Technology for Classroom Orchestration. In M. S. Khine and I. M. Saleh, editors, New Science of Learning, pages 525–552. Springer Science Business Media, New York, 2010. [ bib | DOI ]
We use different criteria to judge teaching methods and learning environments as researchers and teachers. As researchers, we tend to rely on learning gains measured in controlled conditions. As teacher, the skilled management of classroom constraints results in the impression that a specific design �works well�. We describe fourteen design factors related to the metaphors of classroom orchestration and education ecosystems and illustrate their embodiment in three learning environments. These design factors provide a teacher-centric, integrated view of educational technologies in the classroom. We expand this list of factors to include the main constraints that designers should consider to address the difficult methodological issue of generalizing research results about the effectiveness of methods and designs.

Keywords: Educational Technology; Classroom; Orchestration; Ecosystem, Report_IX, IM2.IP2, Group Dillenbourg, incollection
[966] Alexander Sproewitz, Soha Pouya, Stéphane Bonardi, Jesse van den Kieboom, Rico Moeckel, A. Billard, Pierre Dillenbourg, and Auke Ijspeert. Roombots: Reconfigurable Robots for Adaptive Furniture. IEEE Computational Intelligence Magazine, special issue on "Evolutionary and developmental approaches to robotics", 2010. [ bib | DOI ]
Imagine a world in which our furniture moves around like legged robots, interacts with us, and changes shape and function during the day according to our needs. This is the long term vision we have in the Roombots project. To work towards this dream, we are developing modular robotic modules that have rotational degrees of freedom for locomotion as well as active connection mechanisms for runtime reconfiguration. A piece of furniture, e.g. a stool, will thus be composed of several modules that activate their rotational joints together to implement locomotor gaits, and will be able to change shape, e.g. transforming into a chair, by sequences of attachments and detachments of modules. In this article, we firstly present the project and the hardware we are currently developing. We explore how reconfiguration from a configuration A to a configuration B can be controlled in a distributed fashion. This is done using metamodules-two Roombots modules connected serially-that use broadcast signals and connections to a structured ground to collectively build desired structures without the need of a centralized planner. We then present how locomotion controllers can be implemented in a distributed system of coupled oscillators-one per degree of freedom-similarly to the concept of central pattern generators (CPGs) found in the spinal cord of vertebrate animals. The CPGs are based on coupled phase oscillators to ensure synchronized behavior and have different output filters to allow switching between oscillations and rotations. A stochastic optimization algorithm is used to explore optimal CPG configurations for different simulated Roombots structures.

Keywords: self-reconfiguring modular robots; reconfiguration ; adaptive furniture, Report_IX, IM2.IP2, Group Dillenbourg, article
[967] Khaled Bachour, Frédéric Kaplan, and Pierre Dillenbourg. An Interactive Table for Supporting Participation Balance in Face-to-Face Collaborative Learning. IEEE Transactions on Learning Technologies, 2010. [ bib | DOI | .pdf ]
We describe an interactive table designed for supporting face-to-face collaborative learning. The table, Reflect, addresses the issue of unbalanced participation during group discussions. By displaying on its surface a shared visualization of member participation, Reflect is meant to encourage participants to avoid the extremes of over- and under-participation. We report on a user study that validates some of our hypotheses on the effect the table would have on its users. Namely we show that Reflect leads to more balanced collaboration, but only under certain conditions. We also show different effects the table has on over- and under-participators.

Keywords: Computer-Supported Collaborative Learning; Interactive Furniture; Ubiquitous Computing; Human-Computer Interaction, Report_IX, IM2.IP2, Group Dillenbourg, article
[968] D. Jayagopi and D. Gatica-Perez. Mining group nonverbal conversational patterns using probabilistic topic models. IEEE Transactions on Multimedia, 2010. [ bib | .pdf ]
The automatic discovery of group conversational behavior is a relevant problem in social computing. In this paper, we present an approach to address this problem by defining a novel group descriptor called bag of group-nonverbal-patterns defined on brief observations of group interaction, and by using principled probabilistic topic models to discover topics. The proposed bag of group NVPs allows fusion of individual cues and facilitates the eventual comparison of groups of varying sizes. The use of topic models helps to cluster group interactions and to quantify how different they are from each other in a formal probabilistic sense. Results of behavioral topics discovered on the Augmented Multi-Party Interaction (AMI) meeting corpus are shown to be meaningful using human annotation with multiple observers. Our method facilitates �group behaviour-based� retrieval of group conversational segments without the need of any previous labeling.

Keywords: Report_IX, IM2.IP3, Group Bourlard, article
[969] D. Gatica-Perez and J. M. Odobez. Visual attention, speaking activity, and group conversational analysis in multi-sensor environments. In In H. Nakashima, J. Augusto, H. Aghajan (Eds.), Handbook of Ambient Intelligence and Smart Environments. Springer, 2010. [ bib ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, incollection
[970] L. Saheer, P. N. Garner, J. Dines, and H. Liang. Vtln adaptation for statistical speech synthesis. In Proceedings of ICASSP, 2010. [ bib | .pdf ]
The advent of statistical speech synthesis has enabled the unification of the basic techniques used in speech synthesis and recognition. Adaptation techniques that have been successfully used in recognition systems can now be applied to synthesis systems to improve the quality of the synthesized speech. The application of vocal tract length normalization (VTLN) for synthesis is explored in this paper. VTLN based adaptation requires estimation of a single warping factor, which can be accurately estimated from very little adaptation data and gives additive improvements over CMLLR adaptation. The challenge of estimating accurate warping factors using higher order features is solved by initializing warping factor estimation with the values calculated from lower order features.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[971] D. Vijayasenan, F. Valente, and H. Bourlard. Multistream speaker diarization beyond two acoustic feature streams. In International Conference on Acoustics, Speech, and Signal Processing, 2010. [ bib | .pdf ]
Speaker diarization for meetings data are recently converging towards multistream systems. The most common complementary features used in combination with MFCC are Time Delay of Arrival (TDOA). Also other features have been proposed although, there are no reported improvements on top of MFCC TDOA systems. In this work we investigate the combination of other feature sets along with MFCC TDOA. We discuss issues and problems related to the weighting of four different streams proposing a solution based on a smoothed version of the speaker error. Experiments are presented on NIST RT06 meeting diarization evaluation. Results reveal that the combination of four acoustic feature streams results in a 30% relative improvement with respect to the MFCC TDOA feature combination. To the authors� best knowledge, this is the first successful attempt to improve the MFCC TDOA baseline including other feature streams.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[972] J. P. Pinto. Multilayer Perceptron Based Hierarchical Acoustic Modeling for Automatic Speech Recognition. PhD thesis, Ecole polytechnique fédérale de Lausanne, 2010. Thèse Ecole polytechnique fédérale de Lausanne EPFL, no 4649 (2010), Programme doctoral Génie électrique, Faculté des sciences et techniques de l'ingénieur STI, Institut de génie électrique et électronique IEL (Laboratoire de l'IDIAP LIDIAP). Dir.: Hervé Bourlard. [ bib | .pdf ]
In this thesis, we investigate a hierarchical approach for estimating the phonetic class-conditional probabilities using a multilayer perceptron (MLP) neural network. The architecture consists of two MLP classifiers in cascade. The first MLP is trained in the conventional way using standard acoustic features with a temporal context of around 90 ms. The second MLP is trained on the phonetic class-conditional probabilities (or posterior features) estimated by the first classifier, but with a relatively longer temporal context of around 150-250 ms. The hierarchical architecture is motivated towards exploiting the useful contextual information in the sequence of posterior features which includes the evolution of the probability values within a phoneme (sub-phonemic) and its transition to/from neighboring phonemes (sub-lexical). As the posterior features are sparse and simple, the second classifier is able to learn the contextual information spanning a context as long as 250 ms. Extensive experiments on the recognition of phonemes on read speech as well as conversational speech show that the hierarchical approach yields significantly higher recognition accuracies. Analysis of the second MLP classifier using Volterra series reveal that it has learned the phonetic-temporal patterns in the posterior feature space which captures the confusions in phoneme classification at the output of the first classifier as well as the phonotactics of the language as observed in the training data. Furthermore, we show that the second MLP can be simple in terms of the number of model parameters and that it can be trained on lesser training data. The usefulness of the proposed hierarchical acoustic modeling in automatic speech recognition (ASR) is demonstrated using two applications (a) task adaptation where the goal is to exploit MLPs trained on large amount of data and available off-the-shelf to new tasks and (b) large vocabulary continuous ASR on broadcast news and broadcast conversations in Mandarin. Small vocabulary isolated word recognition and task adaptation studies are performed on the Phonebook database and the large vocabulary speech recognition studies are performed on the DARPA GALE Mandarin database.

Keywords: Report_IX, IM2.IP1, Group Bourlard, phdthesis
[973] S. Ba and J. M. Odobez. Multi-person visual focus of attention from head pose and meeting contextual cues. In IEEE Trans. on Pattern Analysis and Machine Intelligence, accepted for publication, november 2009 [334]. IDIAP-RR 08-47. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, article
[974] A. Roy and S. Marcel. Introducing crossmodal biometrics:person identification from distinct audio & visual streams. In IEEE Fourth International Conference on Biometrics: Theory, Applications and Systems, number 4, 2010. [ bib | .pdf ]
Person identification using audio or visual biometrics is a well-studied problem in pattern recognition. In this scenario, both training and testing are done on the same modalities. However, there can be situations where this condition is not valid, i.e. training and testing has to be done on different modalities. This could arise, for example, in covert surveillance. Is there any person specific information common to both the audio and visual (video-only) modalities which could be exploited to identify a person in such a constrained situation? In this work, we investigate this question in a principled way and propose a framework which can perform this task consistently better than chance, suggesting that such crossmodal biometric information exists.

Keywords: audio and video classification, audio-visual speaker recognition, crossmodal matching, Multimodal biometrics, Report_IX, IM2.IP1, Group Bourlard, inproceedings
[975] R. Bogdan and D. Gatica-Perez. Inferring competitive role patterns in reality tv show through nonverbal analysis. Multimedia Tools and Applications, Special issue on Social Media, 2010. [ bib | .pdf ]
This paper introduces a new facet of social media, namely that depicting social interaction. More concretely, we address this problem from the perspective of nonverbal behavior-based analysis of competitive meetings. For our study, we made use of �The Apprentice� reality TV show, which features a competition for a real, highly paid corporate job. Our analysis is centered around two tasks regarding a person�s role in a meeting: predicting the person with the highest status, and predicting the fired candidates. We address this problem by adopting both supervised and unsupervised strategies. The current study was carried out using nonverbal audio cues. Our approach is based only on the nonverbal interaction dynamics during the meeting without relying on the spoken words. The analysis is based on two types of data: individual and relational measures. Results obtained from the analysis of a full season of the show are promising (up to 85.7% of accuracy in the first case and up to 92.8% in the second case). Our approach has been conveniently compared with the Influence Model, demonstrating its superiority.

Keywords: Report_IX, IM2.IP3, Group Bourlard, article
[976] A. Popescu-Belis. Finding without searching. Idiap-Com Idiap-Com-01-2010, Idiap, January 2010. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[977] S. H. K. Parthasarathi, M. Magimai-Doss, H. Bourlard, and D. Gatica-Perez. Evaluating the robustness of privacy-sensitive audio features for speech detection in personal audio log scenarios. In ICASSP 2010, 2010. [ bib | .pdf ]
Personal audio logs are often recorded in multiple environments. This poses challenges for robust front-end processing, including speech/nonspeech detection (SND). Motivated by this, we investigate the robustness of four different privacy-sensitive features for SND, namely energy, zero crossing rate, spectral flatness, and kurtosis. We study early and late fusion of these features in conjunction with modeling temporal context. These combinations are evaluated in mismatched conditions on a dataset of nearly 450 hours. While both combinations yield improvements over individual features, generally feature combinations perform better. Comparisons with a state-of-the-art spectral based and a privacy-sensitive feature set are also provided.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[978] H. Hung, Y. Huang, G. Friedland, and D. Gatica-Perez. Estimating dominance in multi-party meetings using speaker diarization. IEEE Transactions on Audio, Speech, and Language Processing, 2010. [ bib | .pdf ]
With the increase in cheap commercially available sensors, recording meetings is becoming an increasingly practical option. With this trend comes the need to summarize the recorded data in semantically meaningful ways. Here, we investigate the task of automatically measuring dominance in small group meetings when only a single audio source is available. Past research has found that speaking length as a single feature, provides a very good estimate of dominance. For these tasks we use speaker segmentations generated by our automated faster than real-time speaker diarization algorithm, where the number of speakers is not known beforehand. From user-annotated data, we analyze how the inherent variability of the annotations affects the performance of our dominance estimation method. We primarily focus on examining of how the performance of the speaker diarization and our dominance tasks vary under different experimental conditions and computationally efficient strategies, and how this would impact on a practical implementation of such a system. Despite the use of a state-of-the-art speaker diarization algorithm, speaker segments can be noisy. On conducting experiments on almost 5 hours of audio-visual meeting data, our results show that the dominance estimation is robust to increasing diarization noise.

Keywords: Report_IX, IM2.IP1, Group Bourlard, article
[979] S. Ganapathy, P. Motlicek, and H. Hermansky. Autoregressive models of amplitude modulations in audio compression. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2010. [ bib | http | .pdf ]
We present a scalable medium bit-rate wide-band audio coding technique based on frequency domain linear prediction (FDLP). FDLP is an efficient method for representing the long-term amplitude modulations of speech/audio signals using autoregressive models. For the proposed audio codec, relatively long temporal segments (1000 ms) of the input audio signal are decomposed into a set of critically sampled sub-bands using a quadrature mirror filter (QMF) bank. The technique of FDLP is applied on each sub-band to model the sub-band temporal envelopes. The residual of the linear prediction, which represents the frequency modulations in the sub-band signal [1], are encoded and transmitted along with the envelope parameters. These steps are reversed at the decoder to reconstruct the signal. The proposed codec utilizes a simple signal independent non-adaptive compression mechanism for a wide class of speech and audio signals. The subjective and objective quality evaluations show that the reconstruction signal quality for the proposed FDLP codec compares well with the state-of-the-art audio codecs in the 32-64 kbps range.

Keywords: Report_IX, IM2.IP1, Group Bourlard, article
[980] Afsaneh Asaei, B. Picart, and H. Bourlard. Analysis of phone posterior feature space exploiting class specific sparsity and mlp-based similarity measure. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[981] J. P. Pinto, G. S. V. S. Sivaram, M. Magimai-Doss, H. Hermansky, and H. Bourlard. Analysis of mlp based hierarchical phoneme posterior probability estimator. IEEE Transcations on Audio, Speech, and Language Processing, 2010. [ bib | .pdf ]
We analyze a simple hierarchical architecture consisting of two multilayer perceptron (MLP) classifiers in tandem to estimate the phonetic class conditional probabilities. In this hierarchical setup, the first MLP classifier is trained using standard acoustic features. The second MLP is trained using the posterior probabilities of phonemes estimated by the first, but with a long temporal context of around 150-230 ms. Through extensive phoneme recognition experiments, and the analysis of the trained second MLP using Volterra series, we show that (a) the hierarchical system yields higher phoneme recognition accuracies - an absolute improvement of 3.5% and 9.3% on TIMIT and CTS respectively - over the conventional single MLP based system, (b) there exists useful information in the temporal trajectories of the posterior feature space, spanning around 230 ms of context, (c) the second MLP learns the phonetic temporal patterns in the posterior features, which include the phonetic confusions at the output of the first MLP as well as the phonotactics of the language as observed in the training data, and (d) the second MLP classifier requires fewer number of parameters and can be trained using lesser amount of training data.

Keywords: Report_IX, IM2.IP1, Group Bourlard, article
[982] Venkatesh Bala Subburaman and S. Marcel. An alternative scanning strategy to detect faces. In Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2010. [ bib | .pdf ]
The sliding window approach is the most widely used technique to detect faces in an image. Usually a classifier is applied on a regular grid and to speed up the scanning, the grid spacing is increased, which increases the number of miss detections. In this paper we propose an alternative scanning method which minimizes the number of misses, while improving the speed of detection. To achieve this we use an additional classifier that predicts the bounding box of a face within a local search area. Then a face/non-face classifier is used to verify the presence or absence of a face. We propose a new combination of binary features which we term as u-Ferns for bounding box estimation, which performs comparable or better than former techniques. Experimental evaluation on benchmark database show that we can achieve 15-30% improvement in detection rate or speed when compared to the standard scanning technique.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[983] P. Motlicek, P. N. Garner, M. Guillemot, and Vincent Bozzo. Amida/klewel mini-project. Idiap-RR Idiap-RR-03-2010, Idiap, Rue Marconi 19, Martigny, January 2010. [ bib | .pdf ]
The goal of the AMIDA mini-project is to transfer some of the technologies developed within the AMIDA project to be used by a Klewel retrieval system. More specifically, the main focus is to develop a speech-to-text application based on the AMIDA Automatic Speech Recognition (ASR) system which could be potentially implemented by Klewel in their conference webcasting system. First, this document describes experimental setup and results achieved in the project devoted to the automatic processing of real lecture recordings provided by Klewel. Then, a demonstrator � an application created for demonstrating Automatic Speech Recognition (ASR) results�is described.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[984] M. Yazdani and A. Popescu-Belis. A random walk framework to compute textual semantic similarity: a unified model for three benchmark tasks. In Proceedings of the 4th IEEE International Conference on Semantic Computing (ICSC 2010), Carnegie Mellon University, Pittsburgh, PA, USA, 2010. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[985] Oya Aran, H. Hung, and D. Gatica-Perez. A multimodal corpus for studying dominance in small group conversations. In LREC workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, Malta, May 2010, 2010. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[986] P. Motlicek, S. Ganapathy, H. Hermansky, and H. Garudadri. Wide-band audio coding based on frequency domain linear prediction. EURASIP Journal on Audio Speech and Music Processing, 2010(856280):14, February 2010. Special Issue: Scalable Audio-Content Analysis. [ bib | DOI | .html | .pdf ]
We revisit an original concept of speech coding in which the signal is separated into the carrier modulated by the signal envelope. A recently developed technique, called frequency-domain linear prediction (FDLP), is applied for the efficient estimation of the envelope. The processing in the temporal domain allows for a straightforward emulation of the forward temporal masking. This, combined with an efficient nonuniform sub-band decomposition and application of noise shaping in spectral domain instead of temporal domain (a technique to suppress artifacts in tonal audio signals), yields a codec that does not rely on the linear speech production model but rather uses well-accepted concept of frequency-selective auditory perception. As such, the codec is not only specific for coding speech but also well suited for coding other important acoustic signals such as music and mixed content. The quality of the proposed codec at 66�kbps is evaluated using objective and subjective quality assessments. The evaluation indicates competitive performance with the MPEG codecs operating at similar bit rates.

Keywords: Report_IX, IM2.IP1, Group Bourlard, article
[987] A. Pronobis, J. Luo, and Barbara Caputo. The more you learn, the less you store: Memory-controlled incremental svm for visual place recognition. Image and Vision Computing, February 2010. [ bib | DOI | .pdf ]
The capability to learn from experience is a key property for autonomous cognitive systems working in realistic settings. To this end, this paper presents an SVM-based algorithm, capable of learning model representations incrementally while keeping under control memory requirements. We combine an incremental extension of SVMs with a method reducing the number of support vectors needed to build the decision function without any loss in performance introducing a parameter which permits a user-set trade-off between performance and memory. The resulting algorithm is able to achieve the same recognition results as the original incremental method while reducing the memory growth. Our method is especially suited to work for autonomous systems in realistic settings. We present experiments on two common scenarios in this domain: adaptation in presence of dynamic changes and transfer of knowledge between two different autonomous agents, focusing in both cases on the problem of visual place recognition applied to mobile robot topological localization. Experiments in both scenarios clearly show the power of our approach.

Keywords: Report_IX, IM2.IP1, Group Bourlard, article
[988] A. Roy and S. Marcel. Visual processing-inspired fern-audio features for noise-robust speaker verification. In ACM 25th Symposium on Applied Computing, 2010, Sierre, Switzerland. Association for Computing Machinery, March 2010. [ bib | .pdf ]
In this paper, we consider the problem of speaker verification as a two-class object detection problem in computer vision, where the object instances are 1-D short-time spectral vectors obtained from the speech signal. More precisely, we investigate the general problem of speaker verification in the presence of additive white Gaussian noise, which we consider as analogous to visual object detection under varying illumination conditions. Inspired by their recent success in illumination-robust object detection, we apply a certain class of binary-valued pixel-pair based features called Ferns for noise-robust speaker verification. Intensive experiments on a benchmark database according to a standard evaluation protocol have shown the advantage of the proposed features in the presence of moderate to extremely high amounts of additive noise.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[989] G. Garau and H. Bourlard. Using audio and visual cues for speaker diarisation initialisation. In International Conference on Acoustics, Speech and Signal Processing, March 2010. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[990] A. Roy, M. Magimai-Doss, and S. Marcel. Boosted binary features for noise-robust speaker verification. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, March 2010. [ bib | .pdf ]
The standard approach to speaker verification is to extract cepstral features from the speech spectrum and model them by generative or discriminative techniques. We propose a novel approach where a set of client-specific binary features carrying maximal discriminative information specific to the individual client are estimated from an ensemble of pair-wise comparisons of frequency components in magnitude spectra, using Adaboost algorithm. The final classifier is a simple linear combination of these selected features. Experiments on the XM2VTS database strictly according to a standard evaluation protocol have shown that although the proposed framework yields comparatively lower performance on clean speech, it significantly outperforms the state-of-the-art MFCC-GMM system in mismatched conditions with training on clean speech and testing on speech corrupted by four types of additive noise from the standard Noisex-92 database.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[991] D. Korchagin, P. N. Garner, and J. Dines. Automatic temporal alignment of av data with confidence estimation. In Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing [867]. [ bib | .pdf ]
In this paper, we propose a new approach for the automatic audio-based temporal alignment with confidence estimation of audio-visual data, recorded by different cameras, camcorders or mobile phones during social events. All recorded data is temporally aligned based on ASR-related features with a common master track, recorded by a reference camera, and the corresponding confidence of alignment is estimated. The core of the algorithm is based on perceptual time-frequency analysis with a precision of 10 ms. The results show correct alignment in 99% of cases for a real life dataset and surpass the performance of cross correlation while keeping lower system requirements.

Keywords: pattern matching, reliability estimation, time synchronization, time-frequency analysis, Report_IX, IM2.IP1, Group Bourlard, inproceedings
[992] Gokul Chittaranjan and H. Hung. Are you a werewolf? detecting deceptive roles and outcomes in a conversational role-playing game. In IEEE International Conference on Acoustics, Speech and Signal Processing, March 2010. [ bib | .pdf ]
This paper addresses the task of automatically detecting outcomes of social interaction patterns, using non-verbal audio cues in competi- tive role-playing games (RPGs). For our experiments, we introduce a new data set which features 3 hours of audio-visual recordings of the popular �Are you a Werewolf?� RPG. Two problems are ap- proached in this paper: Detecting lying or suspicious behavior using non-verbal audio cues in a social context and predicting participants� decisions in a game-day by analyzing speaker turns. Our best clas- sifier exhibits a performance improvement of 87% over the baseline for detecting deceptive roles. Also, we show that speaker turn based features can be used to determine the outcomes in the initial stages of the game, when the group is large.

Keywords: deception, Nonverbal behavior, role analysis, Report_IX, IM2.IP3, Group Bourlard, inproceedings
[993] D. Imseng and G. Friedland. An adaptive initialization method for speaker diarization based on prosodic features. In Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4946–4949, March 2010. [ bib | .pdf ]
The following article presents a novel, adaptive initialization scheme that can be applied to most state-ofthe-art Speaker Diarization algorithms, i.e. algorithms that use agglomerative hierarchical clustering with Bayesian Information Criterion (BIC) and Gaussian Mixture Models (GMMs) of frame-based cepstral features (MFCCs). The initialization method is a combination of the recently proposed �adaptive seconds per Gaussian� (ASPG) method and a new pre-clustering and number of initial clusters estimation method based on prosodic features. The presented initialization method has two important advantages. First, the method requires no manual tuning and is robust against file length and speaker count variations. Second, the method outperforms our previously used initialization methods on all benchmark files that were presented in the 2006, 2007, and 2009 NIST Rich Transcription (RT) evaluations and results in a Diarization Error Rate (DER) improvement of up to 67% (relative).

Keywords: Gaussian Mixture Models, Prosodic features, Speaker Diarization, Report_IX, IM2.IP1, Group Bourlard, inproceedings
[994] H. Liang, J. Dines, and L. Saheer. A comparison of supervised and unsupervised cross-lingual speaker adaptation approaches for hmm-based speech synthesis. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 4598–4601, March 2010. [ bib | .pdf ]
The EMIME project aims to build a personalized speech-to-speech translator, such that spoken input of a user in one language is used to produce spoken output that still sounds like the user's voice however in another language. This distinctiveness makes unsupervised cross-lingual speaker adaptation one key to the project's success. So far, research has been conducted into unsupervised and cross-lingual cases separately by means of decision tree marginalization and HMM state mapping respectively. In this paper we combine the two techniques to perform unsupervised cross-lingual speaker adaptation. The performance of eight speaker adaptation systems (supervised vs. unsupervised, intra-lingual vs. cross-lingual) are compared using objective and subjective evaluations. Experimental results show the performance of unsupervised cross-lingual speaker adaptation is comparable to that of the supervised case in terms of spectrum adaptation in the EMIME scenario, even though automatically obtained transcriptions have a very high phoneme error rate.

Keywords: decision tree marginalization, HMM state mapping, unsupervised cross-lingual speaker adaptation, Report_IX, IM2.IP1, Group Bourlard, inproceedings
[995] J. Luo, F. Orabona, Marco Fornoni, B. Caputo, and Nicolo Cesa-Bianchi. Om-2: An online multi-class multi-kernel learning algorithm. Idiap-RR Idiap-RR-06-2010, Idiap, April 2010. [ bib | .pdf ]
Efficient learning from massive amounts of information is a hot topic in computer vision. Available training sets contain many examples with several visual descriptors, a setting in which current batch approaches are typically slow and does not scale well. In this work we introduce a theo- retically motivated and efficient online learning algorithm for the Multi Kernel Learning (MKL) problem. For this algorithm we prove a theoretical bound on the number of multiclass mistakes made on any arbitrary data sequence. Moreover, we empirically show that its performance is on par, or better, than standard batch MKL (e.g. SILP, Sim- pleMKL) algorithms.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[996] A. Roy and S. Marcel. Crossmodal matching of speakers using lip and voice features in temporally non-overlapping audio and video streams. In 20th International Conference on Pattern Recognition, Istanbul, Turkey. International Association for Pattern Recognition (IAPR), April 2010. [ bib | .pdf ]
Person identification using audio (speech) and visual (facial appearance, static or dynamic) modalities, either independently or jointly, is a thoroughly investigated problem in pattern recognition. In this work, we explore a novel task : person identification in a cross-modal scenario, i.e., matching the speaker in an audio recording to the same speaker in a video recording, where the two recordings have been made during different sessions, using speaker specific information which is common to both the audio and video modalities. Several recent psychological studies have shown how humans can indeed perform this task with an accuracy significantly higher than chance. Here we propose two systems which can solve this task comparably well, using purely pattern recognition techniques. We hypothesize that such systems could be put to practical use in multimodal biometric and surveillance systems.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[997] P. Motlicek and F. Valente. Application of out-of-language detection to spoken-term detection. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, April 2010. [ bib | .pdf ]
This paper investigates the detection of English spoken terms in a conversational multi-language scenario. The speech is processed using a large vocabulary continuous speech recognition system. The recognition output is represented in the form of word recognition lattices which are then used to search required terms. Due to the potential multi-lingual speech segments at the input, the spoken term detection system is combined with a module performing out-of language detection to adjust its confidence scores. First, experimental results of spoken term detection are provided on the conversational telephone speech database distributed by NIST in 2006. Then, the system is evaluated on a multi-lingual database with and without employment of the out-of-language detection module, where we are only interested in detecting English terms (stored in the index database). Several strategies to combine these two systems in an efficient way are proposed and evaluated. Around 7% relative improvement over a stand-alone STD is achieved

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[998] P. N. Garner and J. Dines. Tracter: A lightweight dataflow framework. Idiap-RR Idiap-RR-10-2010, Idiap, May 2010. [ bib | .pdf ]
Tracter is introduced as a dataflow framework particularly useful for speech recognition. It is designed to work on-line in real-time as well as off-line, and is the feature extraction means for the Juicer transducer based decoder. This paper places Tracter in context amongst the dataflow literature and other commercial and open source packages. Some design aspects and capabilities are discussed. Finally, a fairly large processing graph incorporating voice activity detection and feature extraction is presented as an example of Tracter's capabilites.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[999] Joan-Isaac Biel and Daniel Gatica-Perez. Voices of vlogging. In Proc. AAAI Int. Conf. on Weblogs and Social Media (ICWSM), Washington DC, May 2010. [ bib | .pdf ]
Vlogs have rapidly evolved from the �chat from your bedroom� format to a highly creative form of expression and communication. However, despite the high popularity of vlogging, automatic analysis of conversational vlogs have not been attempted in the literature. In this paper, we present a novel analysis of conversational vlogs based on the characterization of vloggers� nonverbal behavior. We investigate the use of four nonverbal cues extracted automatically from the audio channel to measure the behavior of vloggers and explore the relation to their degree of popularity and that of their videos. Our study is validated on over 2200 videos and 150 hours of data, and shows that one nonverbal cue (speaking time) is correlated with levels of popularity with a medium size effect.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[1000] Harry Bunt, Jan Alexandersson, J. Carletta, Jae-Woong Choe, Alex Fang, Koiti Hasida, Kiyong Lee, Volha Petukhova, A. Popescu-Belis, Laurent Romary, Claudia Soria, and Traum. David. Towards a standard for dialogue act annotation. In 7th International Conference on Language Resources and Evaluation, May 2010. [ bib | .html | .pdf ]
This paper describes an ISO project which aims at developing a standard for annotating spoken and multimodal dialogue with semantic information concerning the communicative functions of utterances, the kind of semantic content they address, and their relations with what was said and done earlier in the dialogue. The project, ISO 24617-2 "Semantic annotation framework, Part 2: Dialogue acts", is currently at DIS stage. The proposed annotation schema distinguishes 9 orthogonal dimensions, allowing each functional segment in dialogue to have a function in each of these dimensions, thus accounting for the multifunctionality that utterances in dialogue often have. A number of core communicative functions is defined in the form of ISO data categories, available at http://semantic-annotation.uvt.nl/dialogue-acts/iso-datcats.pdf; they are divided into "dimension-specific" functions, which can be used only in a particular dimension, such as Turn Accept in the Turn Management dimension, and "general-purpose" functions, which can be used in any dimension, such as Inform and Request. An XML-based annotation language, "DiAML" is defined, with an abstract syntax, a semantics, and a concrete syntax.

Keywords: dialogue, semantics, Report_IX, IM2.IP1, Group Bourlard, inproceedings
[1001] Trinh-Minh-Tri Do and Thierry Artieres. Neural conditional random fields. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9, pages 177–184. JMLR: W&CP, May 2010. [ bib | .pdf ]
We propose a non-linear graphical model for structured prediction. It combines the power of deep neural networks to extract high level features with the graphical framework of Markov networks, yielding a powerful and scalable probabilistic model that we apply to signal labeling tasks.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[1002] S. Marcel, C. McCool, Pavel Matejka, Timo Ahonen, and Jan Cernocky. Mobile biometry (mobio) face and speaker verification evaluation. Idiap-RR Idiap-RR-09-2010, Idiap, rue Marconi 19, May 2010. [ bib | .pdf ]
This paper evaluates the performance of face and speaker verification techniques in the context of a mobile environment. The mobile environment was chosen as it provides a realistic and challenging test-bed for biometric person verification techniques to operate. For instance the audio environment is quite noisy and there is limited control over the illumination conditions and the pose of the subject for the video. To conduct this evaluation, a part of a database captured during the “Mobile Biometry” (MOBIO) European Project was used. In total there were nine participants to the evaluation who submitted a face verification system and five participants who submitted speaker verification systems. The nine face verification systems all varied significantly in terms of both verification algorithms and face detection algorithms. Several systems used the OpenCV face detector while the better systems used proprietary software for the task of face detection. This ended up making the evaluation of verification algorithms challenging. The five speaker verification systems were based on one of two paradigms: a Gaussian Mixture Model (GMM) or Support Vector Machine (SVM) paradigm. In general the systems based on the SVM paradigm performed better than those based on the GMM paradigm.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1003] Oya Aran and Lale Akarun. A multi-class classification strategy for fisher scores: Application to signer independent sign language recognition. Pattern Recognition, 43(5):1776–1788, May 2010. [ bib | DOI | .pdf ]
Fisher kernels combine the powers of discriminative and generative classifiers by mapping the variable-length sequences to a new fixed length feature space, called the Fisher score space. The mapping is based on a single generative model and the classifier is intrinsically binary. We propose a multi-class classification strategy that applies a multi-class classification on each Fisher score space and combines the decisions of multi-class classifiers. We experimentally show that the Fisher scores of one class provide discriminative information for the other classes as well. We compare several multi-class classification strategies for Fisher scores generated from the hidden Markov models of sign sequences. The proposed multi-class classification strategy increases the classification accuracy in comparison with the state of the art strategies based on combining binary classifiers. To reduce the computational complexity of the Fisher score extraction and the training phases, we also propose a score space selection method and show that, similar or even higher accuracies can be obtained by using only a subset of the score spaces. Based on the proposed score space selection method, a signer adaptation technique is also presented that does not require any re-training.

Keywords: Report_IX, IM2.IP1, Group Bourlard, article
[1004] F. Orabona, J. Luo, and B. Caputo. Online-batch strongly convex multi kernel learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2010. [ bib | .pdf ]
Several object categorization algorithms use kernel methods over multiple cues, as they offer a principled ap- proach to combine multiple cues, and to obtain state-of-the- art performance. A general drawback of these strategies is the high computational cost during training, that prevents their application to large-scale problems. They also do not provide theoretical guarantees on their convergence rate. Here we present a Multiclass Multi Kernel Learning (MKL) algorithm that obtains state-of-the-art performance in a considerably lower training time. We generalize the standard MKL formulation to introduce a parameter that al- lows us to decide the level of sparsity of the solution. Thanks to this new setting, we can directly solve the problem in the primal formulation. We prove theoretically and experimen- tally that 1) our algorithm has a faster convergence rate as the number of kernels grow; 2) the training complexity is linear in the number of training examples; 3) very few iter- ations are enough to reach good solutions. Experiments on three standard benchmark databases support our claims.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[1005] Afsaneh Asaei, H. Bourlard, and B. Picart. Investigation of knn classifier on posterior features towards application in automatic speech recognition. Idiap-RR Idiap-RR-11-2010, Idiap, June 2010. [ bib | .pdf ]
Class posterior distributions can be used to classify or as intermediate features, which can be further exploited in different classifiers (e.g., Gaussian Mixture Models, GMM) towards improving speech recognition performance. In this paper we examine the possibility to use kNN classifier to perform local phonetic classification of class posterior distribution extracted from acoustic vectors. In that framework, we also propose and evaluate a new kNN metric based on the relative angle between feature vectors to define the nearest neighbors. This idea is inspired by the orthogonality characteristic of the posterior features. To fully exploit this attribute, kNN is used in two main steps: (1) the distance is computed as the cosine function of the relative angle between the test vector and the training vector and (2) the nearest neighbors are defined as the samples within a specific relative angle to the test data and the test samples which do not have enough labels in such a hyper-cone are considered as uncertainties and left undecided. This approach is evaluated on TIMIT database and compared to other metrics already used in literature for measuring the similarity between posterior probabilities. Based on our experiments, the proposed approach yield 78.48% frame level accuracy while specifying 15.17% uncertainties in the feature space.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1006] H. Hung and D. Gatica-Perez. Estimating cohesion in small groups using audio-visual nonverbal behavior. Idiap-RR Idiap-RR-12-2010, Idiap, June 2010. [ bib | .pdf ]
Cohesiveness in teams is an essential part of ensuring the smooth running of task-oriented groups. Research in social psychology and management has shown that good cohesion in groups can be correlated with team effectiveness or productivity so automatically estimating group cohesion for team training can be a useful tool. This paper addresses the problem of analyzing group behavior within the context of cohesion. 4 hours of audio-visual group meeting data was used for collecting annotations on the cohesiveness of 4-participant teams. We propose a series of audio and video features, which are inspired by findings in the social sciences literature. Our study is validated on as set of 61 2-minute meeting segments which showed high agreement amongst human annotators who were asked to identify meetings which have high or low cohesion.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1007] A. Popescu-Belis, J. Kilgour, A. Nanchen, and P. Poller. The acld: Speech-based just-in-time retrieval of multimedia documents and websites. Idiap-RR Idiap-RR-26-2010, Idiap, July 2010. [ bib | .pdf ]
The Automatic Content Linking Device (ACLD) is a just-in-time retrieval system that monitors an ongoing conversation or a monologue and enriches it with potentially related documents, including transcripts of past meetings, from local repositories or from the Internet. The linked content is displayed in real-time to the participants in the conversation, or to users watching a recorded conversation or talk. The system can be demonstrated in both settings, using real-time automatic speech recognition (ASR) or replaying offline ASR, via a flexible user interface that displays results and provides access to the content of past meetings and documents.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1008] N. Kiukkonen, Blom J., O. Dousse, Daniel Gatica-Perez, and Laurila J. Towards rich mobile phone datasets: Lausanne data collection campaign. In Proc. ACM Int. Conf. on Pervasive Services (ICPS), Berlin., July 2010. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[1009] L. Saheer, P. N. Garner, and J. Dines. Study of jacobian normalization for vtln. Idiap-RR Idiap-RR-25-2010, Idiap, July 2010. [ bib | .pdf ]
The divergence of the theory and practice of vocal tract length normalization (VTLN) is addressed, with particular emphasis on the role of the Jacobian determinant. VTLN is placed in a Bayesian setting, which brings in the concept of a prior on the warping factor. The form of the prior, together with acoustic scaling and numerical conditioning are then discussed and evaluated. It is concluded that the Jacobian determinant is important in VTLN, especially for the high dimensional features used in HMM based speech synthesis, and difficulties normally associated with the Jacobian determinant can be attributed to prior and scaling.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1010] R. A. Negoescu and D. Gatica-Perez. Modeling and understanding flickr communities through topic-based analysis. Idiap-RR Idiap-RR-19-2010, Idiap, July 2010. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1011] R. A. Negoescu, Alexander Loui, and D. Gatica-Perez. Kodak moments and flickr diamonds: How users shape large-scale media. Idiap-RR Idiap-RR-20-2010, Idiap, July 2010. [ bib | .pdf ]
In today's age of digital multimedia deluge, a clear understanding of the dynamics of online communities is capital. Users have abandoned their role of passive consumers and are now the driving force behind large-scale media repositories, whose dynamics and shaping factors are not yet fully understood. In this paper we present a novel human-centered analysis of two major photo sharing websites, Flickr and Kodak Gallery. On a combined dataset of over 5 million tagged photos, we investigate fundamental differences and similarities at the level of tag usage and propose a joint probabilistic topic model to provide further insight into semantic differences between the two communities. Our results show that the effects of the users' motivations and needs can be strongly observed in this large-scale data, in the form of what we call Kodak Moments and Flickr Diamonds. They are an indication that system designers should carefully take into account the target audience and its needs.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1012] R. A. Negoescu and D. Gatica-Perez. Flickr groups: Multimedia communities for multimedia analysis. Idiap-RR Idiap-RR-18-2010, Idiap, July 2010. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1013] D. Vijayasenan, F. Valente, and H. Bourlard. An information theoretic combination of mfcc and tdoa features for speaker diarization. Idiap-RR Idiap-RR-22-2010, Idiap, July 2010. [ bib | .pdf ]
This work describes a novel system for speaker diarization of meetings recordings based on the combination of acoustic features (MFCC) and Time Delay of Arrivals (TDOA). The first part of the paper analyzes differences between MFCC and TDOA features which possess completely different statistical properties. When Gaussian Mixture Models are used, experiments reveal that the diarization system is sensitive to the different recording scenarios (i.e. meeting rooms with varying number of microphones). In the second part, a new multistream diarization system is proposed extending previous work on Information Theoretic diarization. Both speaker clustering and speaker realignment steps are discussed; in contrary to current systems, the proposed method avoids to perform the feature combination averaging log-likelihood scores. Experiments on meetings data reveal that the proposed approach outperforms the GMM based system when the recording is done with varying number of microphones.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1014] D. Vijayasenan, F. Valente, and H. Bourlard. Advances in fast multistream diarization based on the information bottleneck framework. Idiap-RR Idiap-RR-23-2010, Idiap, July 2010. [ bib | .pdf ]
Multistream diarization is an effective way to improve the diarization performance, MFCC and Time Delay Of Arrivals (TDOA) being the most commonly used features. This paper extends our previous work on information bottleneck diarization aiming to include large number of features besides MFCC and TDOA while keeping computational costs low. At first HMM/GMM and IB systems are compared in case of two and four feature streams and analysis of errors is performed. Results on a dataset of 17 meetings show that, in spite of comparable oracle performances, the IB system is more robust to feature weight variations. Then a sequential optimization is introduced that further improves the speaker error by 5 � 8% relative. In the last part, computational issues are discussed. The proposed approach is significantly faster and its complexity marginally grows with the number of feature streams running in 0.75 realtime even with four streams achieving a speaker error equal to 6%.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1015] K. Farrahi and D. Gatica-Perez. Probabilistic mining of socio-geographic routines from mobile phone data. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 4(4):746–755, August 2010. [ bib | .pdf ]
There is relatively little work on the investigation of large-scale human data in terms of multimodality for human activity discovery. In this paper we suggest that human interaction data, or human proximity, obtained by mobile phone Bluetooth sensor data, can be integrated with human location data, obtained by mobile cell tower connections, to mine meaningful details about human activities from large and noisy datasets. We propose a model, called bag of multimodal behavior, that integrates the modeling of variations of location over multiple time-scales, and the modeling of interaction types from proximity. Our representation is simple yet robust to characterize real-life human behavior sensed from mobile phones, which are devices capable of capturing large-scale data known to be noisy and incomplete. We use an unsupervised approach, based on probabilistic topic models, to discover latent human activities in terms of the joint interaction and location behaviors of 97 individuals over the course of approximately a 10 month period using data from MIT's Reality Mining project. Some of the human activities discovered with our multimodal data representation include “going out from 7pm-midnight alone" and “working from 11am-5pm with 3-5 other people", further finding that this activity dominantly occurs on specific days of the week. Our methodology also finds dominant work patterns occurring on other days of the week. We further demonstrate the feasibility of the topic modeling framework to discover human routines to predict missing multimodal phone data on specific times of the day.

Keywords: Report_IX, IM2.IP3, Group Bourlard, article
[1016] S. Marcel, C. McCool, Pavel Matejka, Timo Ahonen, Jan Cernocky, and al. On the results of the first mobile biometry (mobio) face and speaker verification evaluation. Idiap-RR Idiap-RR-30-2010, Idiap, August 2010. [ bib | .pdf ]
This paper evaluates the performance of face and speaker verification techniques in the context of a mobile environment. The mobile environment was chosen as it provides a realistic and challenging test-bed for biometric person verification techniques to operate. For instance the audio environment is quite noisy and there is limited control over the illumination conditions and the pose of the subject for the video. To conduct this evaluation, a part of a database captured during the �Mobile Biometry� (MOBIO) European Project was used. In total there were nine participants to the evaluation who submitted a face verification system and five participants who submitted speaker verification systems.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1017] R. A. Negoescu and D. Gatica-Perez. Modeling and understanding flickr communities through topic-based analysis. IEEE Transactions on Multimedia, 12(5):399–416, August 2010. [ bib | DOI ]
With the increased presence of digital imaging devices there also came an explosion in the amount of multimedia content available online. Users have transformed from passive consumers of media into content creators and have started organizing themselves in and around online communities. Flickr has more than 30 million users and over 3 billion photos, and many of them are tagged and public. One very important aspect in Flickr is the ability of users to organize in self-managed communities called groups. This paper examines an unexplored problem, which is jointly analyzing Flickr groups and users. We show that although users and groups are conceptually different, in practice they can be represented in a similar way via a bag-of-tags derived from their photos, which is amenable for probabilistic topic modeling. We then propose a probabilistic topic model representation learned in an unsupervised manner that allows the discovery of similar users and groups beyond direct tag-based strategies and we demonstrate that higher-level information such as topics of interest are a viable alternative. On a dataset containing users of 10,000 Flickr groups and over 1 milion photos, we show how this common topic-based representation allows for a novel analysis of the groups-users Flickr ecosystem, which results into new insights about the structure of the entities in this social media source. We demonstrate novel practical applications of our topic-based representation, such as similarity-based exploration of entities, or single and multi-topic tag-based search, which address current limitations in the ways Flickr is used today.

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1018] K. Farrahi and D. Gatica-Perez. Mining human location-routines using a multi-level topic model. Idiap-RR Idiap-RR-28-2010, Idiap, August 2010. [ bib | .pdf ]
Keywords: Report_IX, IM2.IP3, Group Bourlard, techreport
[1019] S. Marcel, C. McCool, Cosmin Atanasoaei, Flavio Tarsetti, Jan Pesan, Pavel Matejka, Jan Cernocky, Mika Helistekangas, and Markus Turtinen. Mobio: Mobile biometric face and speaker authentication. Idiap-RR Idiap-RR-31-2010, Idiap, rue Marconi 19, August 2010. [ bib | .pdf ]
This paper presents a mobile biometric person authentication demonstration system. It consists of verifying a user's claimed identity by biometric means and more particularly using their face and their voice simultaneously on a Nokia N900 mobile device with its built-in sensors (frontal video camera and microphone).

Keywords: Report_IX, IM2.IP1, Group Bourlard, techreport
[1020] Oya Aran and D. Gatica-Perez. Fusing audio-visual nonverbal cues to detect dominant people in conversations. In 20th International Conference on Pattern Recognition, Istanbul, Turkey, 2010, August 2010. [ bib | .pdf ]
This paper addresses the multimodal nature of social dominance and presents multimodal fusion techniques to combine audio and visual nonverbal cues for dominance estimation in small group conversations. We combine the two modalities both at the feature extraction level and at the classifier level via score and rank level fusion. The classification is done by a simple rule-based estimator. We perform experiments on a new 10-hour dataset derived from the popular AMI meeting corpus. We objectively evaluate the performance of each modality and each cue alone and in combination. Our results show that the combination of audio and visual cues is necessary to achieve the best performance.

Keywords: Report_IX, IM2.IP3, Group Bourlard, inproceedings
[1021] P. N. Garner and J. Dines. Tracter: A lightweight dataflow framework. In Proceedings of Interspeech [998]. [ bib | .pdf ]
Tracter is introduced as a dataflow framework particularly useful for speech recognition. It is designed to work on-line in real-time as well as off-line, and is the feature extraction means for the Juicer transducer based decoder. This paper places Tracter in context amongst the dataflow literature and other commercial and open source packages. Some design aspects and capabilities are discussed. Finally, a fairly large processing graph incorporating voice activity detection and feature extraction is presented as an example of Tracter's capabilites.

Keywords: Report_IX, IM2.IP1, Group Bourlard, inproceedings
[1022] D. Imseng, H. Bourlard, and M. Magimai-Doss. Towards mixed language speech recognition systems. In Proceedings of Interspeech, September 2010. [