We are very pleased to announce that in this edition we will have the following outstanding keynote speakers:


Dr. Jan "Honza" Cernocky, Brno University of Technology (BUT).

Date: 21 November 2012

Hour: 11:00 - 12:00

Location: Salon de Actos, Building A        

Title: Building speech recognizers with limited resources

Speaker: Honza Černocký

With contributions of: Martin Karafiát, Miloš Janda, Mirko Hannemann, Karel Veselý, František Grézl, Ekaterina Egorova


Research groups and industry have reached excellent results in automatic speech recognition (ASR) of language with abundant speech and text resources. There is however a steady need for recognition, keyword spotting and spoken term detection systems in languages that lack these resources.

This talk will summarize the work Brno University of Technology (BUT) speech group has done so far in this field. Supported by a project of Czech Ministry of Trade and Commerce and by U.S. IARPA BABEL program, we have investigated approaches that allow for re-using data from well represented languages to obtain better results in the recognition of low-resource ones: in feature extraction, we have worked on language independency of nowadays very popular bottle-neck features generated by neural networks. We have also studies sub-space Gaussian Mixture Models (SGMMs) and Regional-Dependent Linear Transforms (RDLT) for acoustic modeling and adaptation, respectively. Finally, we will present several ideas in automatic building of pronunciation dictionaries. 


Speaker Bio: 

Jan "Honza" Cernocky (Ing. [MS] 1993 Brno University of Technology (BUT); Dr. [PhD] 1998 Universite Paris XI and BUT) was with the Institute of Radio-electronics, BUT (Faculty of Electrical Engineering and Computer Science) as assistant professor from 1997. Since February 2002, he is with the Faculty of Information Technology (FIT), BUT as Associate Professor (Doc.) and Head of the Department of Computer Graphics and Multimedia. With Dr. Burget and Prof. Hermansky, he is leading the BUT Speech@FIT research group. Dr.

Cernocky supervises several PhD students. He has been involved with several European projects: SPEECHDAT-E (4th FP, technical coordination), SpeeCon, Multimodal meeting manager (M4, both 5th FP), Augmented Multiparty interaction (AMI, 6th FP), Augmented Multiparty Interaction with Distance Access (AMIDA, 6th FP), and CareTaker (6th FP). He was BUT’s principal investigator in EC-sponsored MOBIO project (7th FP) and in projects supported by Czech Ministries of Industry and Interior. He is currently coordinating a project supported by Technology Agency of the Czech Republic (TACR) and supervises projects funded by U.S. Government.

His research interests include signal processing and speech processing (speaker and language recognition, keyword spotting and spoken term detection). He is author or co-author of more than 50 papers in journals and at conferences. He has served as reviewer for conferences and journals, including IEEE Transactions on Speech, Audio and Language Processing. He is on the scientific board of FIT, scientific board of Text-Speech-Dialogue conference, editorial board of the journal Radioengineering, on the board of Czechoslovak section of IEEE, and on the IEEE SLTC technical committee. He served as general co-chair of ICASSP 2011 in Prague.

As faculty member of FIT BUT, Dr. Cernocky is also active in teaching, he is responsible for signal-processing, pattern recognition, speech and natural language processing courses in all levels of studies (bachelor, master, doctoral).

Dr. Cernocky is a senior member of IEEE and member of ISCA.


Dr. Philip Rose, Australian National University.        

Date: 22 November 2012

Hour: 11:00 - 12:00

Location: Salon de Actos, Building A      


Title: Not too bad? – Forensic Voice Comparison in Theory and Reality.


In forensic voice comparison, speech recordings from an unknown voice, usually of an offender, are compared with recordings from a known voice, usually the suspect. The aim, of course, is to help the trier-of-fact decide whether the suspect has said the incriminating speech.

For some time now, the use of a likelihood ratio has been both theoretically recognised as the correct logical framework for the evaluation of forensic evidence, and implemented as a matter of course in some areas, e.g. DNA profiling. It is logically correct, since by Bayes’ Theorem a posterior probability – like “it is highly likely the suspect said the incriminating speech” – cannot be estimated absent prior odds, to which the expert is not usually privy. Since a posterior may well impinge on considerations of ultimate issue, the use of a likelihood ratio may also be the legally correct option.

The use of likelihood ratios in forensic voice comparison was an idea whose time came around the beginning of the new century, when its efficacy first began to be demonstrated both with automatic and traditional phonetic features (Morrison 2009a). The results from now well over a decade’s extensive, and continuing, research testing have shown that the approach works rather well – same-speaker speech samples can for example be rather well discriminated from different-speaker speech samples on the basis of their LRs (although in real cases discrimination is not the appropriate model) – and it has been shown that the approach can emulate the DNA gold-standard (Gonzalez-Rodriguez et al. 2007).

However, despite these results, and the many explanatory texts written for the legal profession (e.g. Robertson & Vignaux 1995, Aitken et al. n.d.), it is safe to say the idea has not yet caught on. Although forensic voice comparison is experiencing a remarkable paradigm shift in theory, in reality this is only in a few small areas in the world (appropriately for this conference, Spain and Australia); and even then only sporadically and in conditions where it is not at all clear that the interested parties actually understand the concepts. This is partly because of the inherent conservatism of the legal profession, and partly because, it appears, many of its practitioners, if they are aware of the approach at all, find it difficult to understand. For example, in a recent set of Australian draft standards, representatives of organizations including The Australian Attorney-General's Department, The Australian and New Zealand Forensic Science Society, The University of New South Wales’ Expertise, Evidence and Law Program, and The New South Wales Bar Association maintained that the “Interpretation [of forensic evidence] includes answering the question as to whether or not … items share a common origin … .” (Standards Australia 2012). Not just the legal profession finds it difficult, however. There have been many objections to the implementation of the approach – some well-founded and some simply misconceived – among many, mostlyUK, practitioners (Morrison 2009b). And perhaps the best-known phonetics text-book (Ladefoged 2006) defines the likelihood ratio as “… the likelihood that the two voices in question are the same as compared with the likelihood that they are different.” thus confusing it with the prior odds.

In my talk, I want to first rehearse the conceptual background to likelihood ratio-based forensic voice comparison, and say what I see as the two immense barriers to the acceptance of the approach where it matters – in the courtroom. I then describe a real case where likelihood ratio-based forensic voice comparison played a role in the successful prosecution of a $150 million telephone fraud. This will give an idea of what the approach actually involves. The case I describe is important for several reasons unrelated to the size of the sum of money involved. It remains the only one where likelihood ratio-based forensic voice comparison has been received in an Australian court, and is a nice example of where traditional phonetic features, as opposed to automatic features, can come into their own. Finally it is important because it suggests that at least one of my barriers – the impasse at the boundary between likelihood ratio and its matrix concept of Bayes’ Theorem – may, in the reality of the courtroom as opposed to the theory of our research laboratories, be imaginary.

Aitken, C.G.G., Roberts, P., & Jackson, G. no date Fundamentals of Probability and Statistical Evidence in Criminal Proceedings – Guidance for Judges, Lawyers, Forensic Scientists and Expert Witnesses, Royal Statistical Society.

Gonzalez-Rodriguez J., Rose P., Ramos, D., Torre, D. & Ortega-Garcia, J. 2007 “Emulating DNA: Rigorous Quantification of Evidential Weight in Transparent and Testable Forensic Speaker Recognition”, IEEE Trans. on Audio Speech and Language Processing, 15(7): 2104 – 2115.

Ladefoged, P. 2006 A Course in Phonetics, 5th ed. Thomson.

Morrison, G.S. 2009a “Forensic voice comparison and the paradigm shift”, Science & Justice, 49: 298–308.

Morrison, G.S. 2009b “Comments on Coulthard & Johnson’s (2007) portrayal of the likelihood-ratio framework”, Australian Journal of Forensic Sciences, 41: 155-161.

Robertson, B., & Vignaux, T. 1995 Interpreting Evidence, Wiley.

Standards Australia 2012 “Forensic Analysis Part 3: Interpretation”, Draft for Public Comment, DR AS 5388.3

Speaker Bio: 

Phil Rose is adjunct associate professor in the School of Language Studies at the Australian National University, where he taught Phonetics and Chinese Linguistics for thirty years. He studies phonetics and phonology, and is an expert on tone languages and on Chinese and Chinese dialects. He is also an expert on forensic voice comparison and in his 2002 and 2003 books pioneered the application of the Likelihood Ratio of Bayes’ Theorem to traditional forensic voice comparison, thus bringing it in line with forensic DNA profiling. He has done forensic voice comparison case-work on Australian English and varieties of Chinese for nearly 20 years. He has been a British Academy visiting professor at the University of Edinburgh’s Joseph Bell Centre for Forensic Statistics and Legal Reasoning, is chairman of the Forensic Speech Science Committee of the Australasian Speech Science & Technology Association and a member of the Australian Academy of Forensic Sciences. His attempt to retire from teaching in 2009 failed and in the first half of 2012 he was visiting professor at the Hong Kong University of Science and Technology, where he taught courses on Chinese Phonetics and Forensic Voice Comparison in Cantonese.


Dr. Pedro Moreno, Google Inc.                                                                      

Date: 23 November 2012

Hour: 11:00 - 12:00

Location: Salon de Actos, Building A  


Title: Google's  speech internationalization project: From 1 to 300 languages and beyond


The speech team at google has built speech recognition systems in more that 30 languages in little more than 2 years. In this talk we will describe the history of this project and what technologies have been developed to achieve this goal. I'll explore a bit some of the acoustic modeling, lexicon, language modeling, infrastructure and even social engineering techniques used to achieve our ultimate goal, to build speech recognition systems in the top 300 languages of the planet as fast as possible.

Speaker Bio:

Dr. Pedro J. Moreno leads the speech internationalization engineering group at the Android division of Google. His team  is in charge of the infrastructure, engineering, and research needed to deploy and maintain multilingual speech recognition services worldwide.

He joined google 8 years ago after working as a research scientist at HP Labs. During his work at HP he worked mostly in audio indexing systems.

Dr. Moreno completed his Ph.D. studies at Carnegie Mellon University under the direction of Prof. Richard Stern. His work there was focused on noise robustness in speech recognition systems.

His Ph.D. studies were sponsored by a Fulbright scholarship.

Before that he completed an Electrical Engineering degree at Universidad Politecnica de Madrid, Spain. 

Additional information