Nov. 29: 09:00 - 10:00
Tutorial 1: Rule-based voice conversion derived from expressive speech perception model: How do computers sing a song joyfully? (download)

In this presentation, we introduce two issues that rule-based voice conversion methods derived from expressive speech perception models are applied.

Although language is an important tool in speech communication, even without the understanding of one language, we can still judge the expressive content of a voice, such as emotions, individuality (ages, genders and personality), dialects, social positions etc. When willing to add expressive contents into synthesized speech, voice conversion is one of the most important tools. Much of the earlier work has focused on the relationship between expressive speech and acoustic features that is characterized by statistics, which were learned using huge database. The problem is that there lacks a model which takes the aspects of vagueness and human perception into consideration.

We focus on modeling of expressive speech perception, a three-layer model for expressive speech perception, which involves a concept called “semantic primitives”-- adjectives for describing voice perception. This concept simplifies and clarifies the discussion of common features in terms of acoustic cues and expressive speech perception categories. Based on the models, we generate rules for voice conversion and apply them to add expressive contents into speech

The first issue is conversion of neutral speech into emotional speech. We apply the three-layer model for the perception of emotional speech, that is, the topmost layer: five categories of emotion; the middle layer: human perceptual aspect; and the bottommost layer: acoustic features, to the problem. Some demonstrations of synthesizing emotional speech are presented, using the perception model of emotional speech and a new voice conversion method based on temporal decomposition.

The second issue is conversion of speaking voice into singing voice, which is an important issue to investigate how singing-voices are perceived and generated, as a part of studies of non-linguistic information in speech sounds. Since most speech synthesis methods were not proposed for singing-voice synthesis but for spoken-voice synthesis, it is impossible to reveal how singing-voices are perceived and generated by using these methods. This presentation shows the possibility of synthesizing a singing-voice by adding non-linguistic information into speaking voice.

Speaker: Dr. Masato Akagi

Professor, School of Information Science, Japan Advanced Institute of Science and Technology (Japan)

Masato Akagi received the B.E. degree in electronic engineering from Nagoya Institute of Technology in 1979, and the M.E. and the Ph.D. Eng. degrees in computer science from Tokyo Institute of Technology in 1981 and 1984. In 1984, he joined the Electrical Communication Laboratory, Nippon Telegraph and Telephone Corporation (NTT). From 1986 to 1990, he worked at the ATR Auditory and Visual Perception Research Laboratories. Since 1992, he has been with the School of Information Science , Japan Advanced Institute of Science and Technology (JAIST), where he is currently a professor. His research interests include speech perception mechanisms of human beings, and speech signal processing. Dr. Akagi received the IEICE Excellent Paper Award from the Institute of Electronics, Information and Communication Engineers (IEICE) in 1987, the Sato Prize for Outstanding Paper from the Acoustical Society of Japan (ASJ) in 1998, 2005 and 2010, and the Best Paper Award from the Research Institute of Signal Processing Japan in 2009. He was a vice-president of the ASJ in 2007-2009.
Nov. 29: 10:30 - 12:00
Tutorial 2: Decision-Feedback Learning (download)

Machine learning is now becoming a general practice referring to data-driven methods attempting to estimate decision model parameters for statistical inference based on an extensive collection of labeled training data. It has been applied to many large-scale problems, including automatic speech and speaker recognition, machine translation, automatic image annotation, text categorization and bioinformatics, just to name a few. Conventional learning criteria, such as maximum likelihood and least squares, are often not designed to optimize performance of the intended decision operations to be executed by the learned models.

In contrast to conventional function based learning we have recently witnessed a paradigm shift to
designing objective functions to match the expected performance of the decision making process so that feedbacks can be collected by testing the training samples with the current model in order to improve model learning in an iterative manner. This is called a decision-feedback learning (DFL) paradigm. It has been applied to many pattern recognition and verification problems in the speech and language processing community. Approximating discrete errors with smooth 0-1 functions and embedding decision rules into the objection function are two critical components in decision-feedback learning. The sample objectives can now be expressed as continuous and differentiable functions of the model parameters and optimized with the generalized probabilistic descend algorithms for any choice of decision models and input feature vectors.

We first present a few popular choices of learning criteria, such as minimum classification error
(MCE), minimum verification error (MVE), maximal figure-of-merit (MFoM), and minimum area under the
receiver operating characteristic curve (MAUC). Finally, we focus the remainder of our discussion on a
recently proposed learning framework, called soft margin estimation (SME) of hidden Markov models. SME attempts to combine both DFL and margin based learning in support vector machines. We will show that SME not only improves performance accuracies but also enhances system robustness.

Speaker: Dr. Chin-Hui Lee

Professor, School of Electrical and Computer Engineering, Georgia Institute of Technology (USA)

Chin-Hui Lee is a professor at School of Electrical and Computer Engineering, Georgia Institute of Technology. Dr. Lee received the B.S. degree in Electrical Engineering from National Taiwan University, Taipei, in 1973, the M.S. degree in Engineering and Applied Science from Yale University, New Haven, in 1977, and the Ph.D. degree in Electrical Engineering with a minor in Statistics from University of Washington, Seattle, in 1981.

Dr. Lee started his professional career at Verbex Corporation, Bedford, MA, and was involved in research on connected word recognition. In 1984, he became affiliated with Digital Sound Corporation, Santa Barbara, where he engaged in research and product development in speech coding, speech synthesis, speech recognition and signal processing for the development of the DSC-2000 Voice Server. Between 1986 and 2001, he was with Bell Laboratories, Murray Hill, New Jersey, where he became a Distinguished Member of Technical Staff and Director of the Dialogue Systems Research Department. His research interests include multimedia communication, multimedia signal and information processing, speech and speaker recognition, speech and language modeling, spoken dialogue processing, adaptive and discriminative learning, biometric authentication, and information retrieval. From August 2001 to August 2002 he was a visiting professor at School of Computing, The National University of Singapore. In September 2002, he joined the Faculty Georgia Institute of Technology.

Prof. Lee has participated actively in professional societies. He is a member of the IEEE Signal Processing Society (SPS), Communication Society, and the International Speech Communication Association (ISCA). In 1991-1995, he was an associate editor for the IEEE Transactions on Signal Processing and Transactions on Speech and Audio Processing. During the same period, he served as a member of the ARPA Spoken Language Coordination Committee. In 1995-1998 he was a member of the Speech Processing Technical Committee and later became the chairman from 1997 to 1998. In 1996, he helped promote the SPS Multimedia Signal Processing Technical Committee in which he is a founding member.

Dr. Lee is a Fellow of the IEEE, and has published more than 300 papers and 25 patents. He received the SPS Senior Award in 1994 and the SPS Best Paper Award in 1997 and 1999, respectively. In 1997, he was awarded the prestigious Bell Labs President's Gold Award for his contributions to the Lucent Speech Processing Solutions product. Dr. Lee often gives seminal lectures to a wide international audience. In 2000, he was named one of the six Distinguished Lecturers by the IEEE Signal Processing Society. He was also named one of the two ISCA's inaugural Distinguished Lecturers in 2007-2008. Recently he won the SPS's 2006 Technical Achievement Award for "Exceptional Contributions to the Field of Automatic Speech Recognition".

Nov. 29: 13:00 - 15:00
Tutorial 3: New and emerging applications of speech synthesis (download)

Until recently, text-to-speech was often just an 'optional extra' which allowed text to be read out loud. But now, thanks to statistical and machine learning approaches, speech synthesis can mean more than just the reading out of text in a predefined voice. New research areas and more interesting applications are emerging.

In this tutorial, after a quick overview of the basic approaches to statistical speech synthesis including speaker adaptation, we consider some of these new applications of speech synthesis. We look behind each application at the underlying techniques used and describe the scientific advances that have made them possible. The applications we will examine include personalised speech-to-speech translation, 'robust speech synthesis' (the making thousands of different voices automatically from imperfect data), clinical applications such as voice reconstruction of patients who have disordered speech, and articulatory-controllable statistical speech synthesis.

The really interesting problems still to be solved in speech synthesis go beyond simply improving 'quality' or 'naturalness' (typically measured using Mean Opinion Scores). The key problem of personalised speech-to-speech translation is to reproduce or transfer speaker characteristics across languages. The aim of robust speech synthesis is to create good quality synthetic speech from noisy and imperfect data. The core problems in voice reconstruction centre around retaining or reconstructing the original characteristics of patients, given only samples of their disordered speech.

We illustrate our multidisciplinary approach to speech synthesis, bringing in techniques and knowledge from ASR, speech enhancement and speech production in order to develop the techniques required for these new applications. We will conclude by attempting to predict some future directions of speech synthesis.

Speaker: Dr. Junichi Yamagishi

Research Fellow, Centre for Speech Technology Research, University of Edinburgh (UK)

Junichi Yamagishi was awarded a Ph.D. by Tokyo Institute of Technology in 2006 with a thesis which pioneered the use of adaptation techniques in HMM-based speech synthesis and was awarded the Tejima Prize as the best Ph.D. thesis of Tokyo Institute of Technology in 2007. Since 2007, he has been in the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. In addition to authoring and co-authoring over 70 papers in international journals and conferences, his work has led directly to two large-scale European Commission (EC) FP7 projects and two collaborations based around clinical applications of this technology. In 2010, he was awarded the Itakura Prize (Innovative Young Researchers Prize) from the Acoustical Society of Japan for his achievements in adaptive speech synthesis.

Speaker: Dr. Simon King

Professor, Centre for Speech Technology Research, University of Edinburgh (UK)

Simon King is professor of speech processing at the University of Edinburgh. He leads a team that has developed novel techniques in speech synthesis and acoustic modelling for speech recognition. His interests range across speech synthesis, automatic speech recognition and speech processing. He has published about 100 papers on topics including novel acoustic models for ASR such as Dynamic Bayesian Networks, concatenative and HMM-based speech synthesis, voice conversion, articulatory feature extraction, acoustic-articulatory inversion and spoken term detection. He is particularly interested in how to make use of knowledge about human speech production and perception in speech technology. He coordinates the EC project EMIME, is a member of the IEEE Speech and Language Technical Committee, a member of the editorial board of Computer Speech & Language, and was previously an associate editor of IEEE Trans. Audio, Speech & Language Processing.

TEL:+886 6 2096455
FAX:+8866 2381422