Goals and Objectives - ICSI

An Elitist Approach to Articulatory-Acoustic Feature Classification in English and in Dutch Steven Greenberg, Shawn Chang and Mirjam Wester International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng [email protected] http://www.icsi.berkeley.edu/~shawnc [email protected] Mirjam Wester A2 RT, Department of Language and Speech

Nijmegen University, Netherlands http://www.lands.let.kun.nl/Tspublic/wester [email protected] Acknowledgements and Thanks Automatic Feature Classification and Analysis Joy Hollenback, Lokendra Shastri, Rosaria Silipo Research Funding U.S. National Science Foundation U.S. Department of Defense Motivation for Automatic Transcription Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

Motivation for Automatic Transcription Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization Motivation for Automatic Transcription Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization Phonetic and Prosodic Annotation Material is of Limited Quantity Motivation for Automatic Transcription

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization Phonetic and Prosodic Annotation Material is of Limited Quantity Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis Motivation for Automatic Transcription Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization

Phonetic and Prosodic Annotation Material is of Limited Quantity Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce Motivation for Automatic Transcription Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization Phonetic and Prosodic Annotation Material is of Limited Quantity Phonetic and prosodic material important for understanding spoken language

and developing superior technology for recognition and synthesis Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce Hand labeling and segmentation is time consuming and expensive Motivation for Automatic Transcription Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization Phonetic and Prosodic Annotation Material is of Limited Quantity Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis

Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce Hand labeling and segmentation is time consuming and expensive It is difficult to find qualified transcribers and training can be arduous Motivation for Automatic Transcription Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization Phonetic and Prosodic Annotation Material is of Limited Quantity Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis

Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce Hand labeling and segmentation is time consuming and expensive It is difficult to find qualified transcribers and training can be arduous Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation Motivation for Automatic Transcription Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization

Phonetic and Prosodic Annotation Material is of Limited Quantity Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce Hand labeling and segmentation is time consuming and expensive It is difficult to find qualified transcribers and training can be arduous Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries Motivation for Automatic Transcription Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization Phonetic and Prosodic Annotation Material is of Limited Quantity Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce Hand labeling and segmentation is time consuming and expensive It is difficult to find qualified transcribers and training can be arduous Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation

Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries Phone classification error is ca. 30-50% Motivation for Automatic Transcription Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization Phonetic and Prosodic Annotation Material is of Limited Quantity Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce

Hand labeling and segmentation is time consuming and expensive It is difficult to find qualified transcribers and training can be arduous Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries Phone classification error is ca. 30-50% Speech recognition systems do not currently deal with prosody Motivation for Automatic Transcription Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization

Phonetic and Prosodic Annotation Material is of Limited Quantity Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce Hand labeling and segmentation is time consuming and expensive It is difficult to find qualified transcribers and training can be arduous Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries Phone classification error is ca. 30-50% Speech recognition systems do not currently deal with prosody

Automatic Transcription is Likely to Aid in the Development of Speech Recognition and Synthesis Technology Motivation for Automatic Transcription Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization Phonetic and Prosodic Annotation Material is of Limited Quantity Phonetic and prosodic material important for understanding spoken language and developing superior technology for recognition and synthesis Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce

Hand labeling and segmentation is time consuming and expensive It is difficult to find qualified transcribers and training can be arduous Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries Phone classification error is ca. 30-50% Speech recognition systems do not currently deal with prosody Automatic Transcription is Likely to Aid in the Development of Speech Recognition and Synthesis Technology And therefore is worth the effort to develop Road Map of the Presentation Introduction

Motivation for developing automatic phonetic transcription systems Rationale for the current focus on articulatory-acoustic features (AFs) The development corpus - NTIMIT Justification for using NTIMIT for development of AF classifiers Road Map of the Presentation Introduction

Motivation for developing automatic phonetic transcription systems Rationale for the current focus on articulatory-acoustic features (AFs) The development corpus - NTIMIT Justification for using NTIMIT for development of AF classifiers The ELITIST Approach and Its Application to English The baseline system The ELITIST approach Manner-specific classification for place of articulation features Road Map of the Presentation Introduction

Motivation for developing automatic phonetic transcription systems Rationale for the current focus on articulatory-acoustic features (AFs) The development corpus - NTIMIT Justification for using NTIMIT for development of AF classifiers The ELITIST Approach and Its Application to English The baseline system The ELITIST approach Manner-specific classification for place of articulation features Application of the ELITIST Approach to Dutch

The training and testing corpus - VIOS The nature of cross-linguistic transfer of articulatory-acoustic features The ELITIST approach to frame selection as applied to the VIOS corpus Improvement of place-of-articulation classification using manner-specific training in Dutch Road Map of the Presentation Introduction

Motivation for developing automatic phonetic transcription systems Rationale for the current focus on articulatory-acoustic features (AFs) The development corpus - NTIMIT Justification for using NTIMIT for development of AF classifiers The ELITIST Approach and Its Application to English The baseline system The ELITIST approach Manner-specific classification for place of articulation features Application of the ELITIST Approach to Dutch

The training and testing corpus - VIOS The nature of cross-linguistic transfer of articulatory-acoustic features The ELITIST approach to frame selection as applied to the VIOS corpus Improvement of place-of-articulation classification using manner-specific training in Dutch Conclusions and Future Work Development of fully automatic phonetic and prosodic transcription systems An empirically oriented discipline based on annotated corpora Part One

INTRODUCTION Motivation for Developing Automatic Phonetic Transcription Systems Rationale for the Current Focus on Articulatory-Acoustic Features Description of the Development Corpus NTIMIT Justification for Using the NTIMIT Corpus Corpus Generation - Objectives Provides Detailed, Empirical Material for the Study of Spoken Language Such data provide an important basis for scientific insight and understanding Facilitates development of new models for spoken language Corpus Generation - Objectives

Provides Detailed, Empirical Material for the Study of Spoken Language Such data provide an important basis for scientific insight and understanding Facilitates development of new models for spoken language Provides Training Material for Technology Applications Automatic speech recognition, particularly pronunciation models Speech synthesis, ditto Cross-linguistic transfer of technology algorithms Corpus Generation - Objectives

Provides Detailed, Empirical Material for the Study of Spoken Language Such data provide an important basis for scientific insight and understanding Facilitates development of new models for spoken language Provides Training Material for Technology Applications Automatic speech recognition, particularly pronunciation models Speech synthesis, ditto Cross-linguistic transfer of technology algorithms Promotes Development of NOVEL Algorithms for Speech Technology Pronunciation models and lexical representations for

automatic speech recognition speech synthesis Multi-tier representations of spoken language Corpus-Centric View of Spoken Language Our Focus in Todays Presentation is on Articulatory Feature Classification Other levels of linguistic representation are also extremely important to annotate Our Focus Rationale for Articulatory-Acoustic Features Articulatory-Acoustic Features (AFs) are the Building Blocks of the Lowest (i.e., Phonetic) Tier of Spoken Language AFs can be combined in a variety of ways to specify virtually any speech sound found in the worlds languages

Rationale for Articulatory-Acoustic Features Articulatory-Acoustic Features (AFs) are the Building Blocks of the Lowest (i.e., Phonetic) Tier of Spoken Language AFs can be combined in a variety of ways to specify virtually any speech sound found in the worlds languages AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments Rationale for Articulatory-Acoustic Features Articulatory-Acoustic Features (AFs) are the Building Blocks of the Lowest (i.e., Phonetic) Tier of Spoken Language AFs can be combined in a variety of ways to specify virtually any speech sound found in the worlds languages AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments

AFs are Systematically Organized at the Level of the Syllable Syllables are a basic articulatory unit in speech Rationale for Articulatory-Acoustic Features Articulatory-Acoustic Features (AFs) are the Building Blocks of the Lowest (i.e., Phonetic) Tier of Spoken Language AFs can be combined in a variety of ways to specify virtually any speech sound found in the worlds languages AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments AFs are Systematically Organized at the Level of the Syllable Syllables are a basic articulatory unit in speech The pronunciation patterns observed in casual conversation are systematic at the AF level, but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments

Rationale for Articulatory-Acoustic Features Articulatory-Acoustic Features (AFs) are the Building Blocks of the Lowest (i.e., phonetic) Tier of Spoken Language AFs can be combined in a multitude of ways to specify virtually any speech sound found in the worlds languages AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments AFs are Systematically Organized at the Level of the Syllable Syllables are a basic articulatory unit in speech The pronunciation patterns observed in casual conversation are systematic at the AF level, but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments AFs are Potentially More Effective in Speech Recognition Systems

More accurate and flexible pronunciation models (tied to syllabic and lexical units) Rationale for Articulatory-Acoustic Features Articulatory-Acoustic Features (AFs) are the Building Blocks of the Lowest (i.e., phonetic) Tier of Spoken Language AFs can be combined in a multitude of ways to specify virtually any speech sound found in the worlds languages AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments AFs are Systematically Organized at the Level of the Syllable Syllables are a basic articulatory unit in speech The pronunciation patterns observed in casual conversation are systematic at the AF level, but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments

AFs are Potentially More Effective in Speech Recognition Systems More accurate and flexible pronunciation models (tied to syllabic and lexical units) Are generally more robust under acoustic interference than phonetic segments Rationale for Articulatory-Acoustic Features Articulatory-Acoustic Features (AFs) are the Building Blocks of the Lowest (i.e., phonetic) Tier of Spoken Language AFs can be combined in a multitude of ways to specify virtually any speech sound found in the worlds languages AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments AFs are Systematically Organized at the Level of the Syllable Syllables are a basic articulatory unit in speech The pronunciation patterns observed in casual conversation are systematic at the AF level, but not at the phonetic-segment level, and therefore can be used to develop more

accurate and flexible pronunciation models than phonetic segments AFs are Potentially More Effective in Speech Recognition Systems More accurate and flexible pronunciation models (tied to syllabic and lexical units) Are generally more robust under acoustic interference than phonetic segments Relatively few alternative features for various AF dimensions makes classification inherently more robust than phonetic segments Rationale for Articulatory-Acoustic Features Articulatory-Acoustic Features (AFs) are the Building Blocks of the Lowest (i.e., phonetic) Tier of Spoken Language AFs can be combined in a multitude of ways to specify virtually any speech sound found in the worlds languages AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments

AFs are Systematically Organized at the Level of the Syllable Syllables are a basic articulatory unit in speech The pronunciation patterns observed in casual conversation are systematic at the AF level, but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments AFs are Potentially More Effective in Speech Recognition Systems More accurate and flexible pronunciation models (tied to syllabic and lexical units) Are generally more robust under acoustic interference than phonetic segments Relatively few alternative features for various AF dimensions makes classification inherently more robust than phonetic segments AFs are Potentially More Effective in Speech Synthesis Systems More accurate and flexible pronunciation models (tied to syllabic and lexical units)

Primary Development Corpus NTIMIT Sentences Read by Native Speakers of American English Quasi-phonetically balanced set of materials Wide range of dialect variability , both genders, variation in speaker age Relatively low semantic predictability She washed his dark suit in greasy wash water all year Primary Development Corpus NTIMIT Sentences Read by Native Speakers of American English Quasi-phonetically balanced set of materials Wide range of dialect variability , both genders, variation in speaker age

Relatively low semantic predictability She washed his dark suit in greasy wash water all year Corpus Manually Labeled and Segmented at the Phonetic-Segment Level The precision of phonetic annotation provides an excellent training corpus Corpus was annotated at MIT Primary Development Corpus NTIMIT Sentences Read by Native Speakers of American English Quasi-phonetically balanced set of materials Wide range of dialect variability , both genders, variation in speaker age

Relatively low semantic predictability She washed his dark suit in greasy wash water all year Corpus Manually Labeled and Segmented at the Phonetic-Segment Level The precision of phonetic annotation provides an excellent training corpus Corpus was annotated at MIT A Large Amount of Annotated Material Over 2.5 hours of material used for training the classifiers 20 minutes of material used for testing

Primary Development Corpus NTIMIT Sentences Read by Native Speakers of American English Quasi-phonetically balanced set of materials Wide range of dialect variability , both genders, variation in speaker age Relatively low semantic predictability She washed his dark suit in greasy wash water all year Corpus Manually Labeled and Segmented at the Phonetic-Segment Level The precision of phonetic annotation provides an excellent training corpus Corpus was annotated at MIT

A Large Amount of Annotated Material Over 2.5 hours of material used for training the classifiers 20 minutes of material used for testing Relatively Canonical Pronunciation Ideal for Training AF Classifiers Formal pronunciation patterns provides a means of deriving articulatory features from phonetic-segment labels via mapping rules (cf. Proceedings paper for details) Primary Development Corpus NTIMIT

Sentences Read by Native Speakers of American English Quasi-phonetically balanced set of materials Wide range of dialect variability , both genders, variation in speaker age Relatively low semantic predictability She washed his dark suit in greasy wash water all year Corpus Manually Labeled and Segmented at the Phonetic-Segment Level The precision of phonetic annotation provides an excellent training corpus Corpus was annotated at MIT

A Large Amount of Annotated Material Over 2.5 hours of material used for training the classifiers 20 minutes of material used for testing Relatively Canonical Pronunciation Ideal for Training AF Classifiers Formal pronunciation patterns provides a means of deriving articulatory features from phonetic-segment labels via mapping rules (cf. Proceedings paper for details) NTIMIT is a Telephone Pass-band Version of the TIMIT Corpus Sentential material passed through a channel between 0.3 and 3.4 kHz

Provides capability of transfer to other telephone corpora (such as VIOS) Part Two THE ELITIST APPROACH The Baseline System for Articulatory-Acoustic Feature Classification The ELITIST Approach to Systematic Frame Selection for AF Classification Improving Place-of-Articulation Classification Using Manner-Specific Training The Baseline System for AF Classification Spectro-Temporal Representation of the Speech Signal Derived from logarithmically compressed, critical-band energy pattern 25-ms analysis windows (i.e., a frame) 10-ms frame-sampling interval (i.e., 60% overlap between adjacent frames)

The Baseline System for AF Classification Spectro-Temporal Representation of the Speech Signal Derived from logarithmically compressed, critical-band energy pattern 25-ms analysis windows (i.e., a frame) 10-ms frame-sampling interval (i.e., 60% overlap between adjacent frames) Multilayer Perceptron (MLP) Neural Network Classifiers Single hidden layer of 200-400 units, trained with back-propagation Nine frames of context used in the input

The Baseline System for AF Classification An MLP Network for Each Articulatory Feature (AF) Dimension A separate network trained on voicing, place and manner of articulation, etc. Training targets were derived from hand-labeled phonetic transcripts and a fixed phone-to-AF mapping Silence was a feature included in the classification of each AF dimension All of the results reported are for FRAME accuracy (not segmental accuracy) The Baseline System for AF Classification Focus on Articulatory Feature Classification Rather than Phone Identity Provides a more accurate means of assessing MLP-based classification

system Baseline System Performance Summary Classification of Articulatory Features Exceeds 80% Except for Place Objective Improve Classification across All AF Dimensions, but Particularly on Place-of-Articulation NTIMIT Corpus Not All Frames are Created Equal

Correlation Between Frame Position and Classification Accuracy for MANNER of articulation features: The 20% of the frames closest to the segment BOUNDARIES are 73% correct The 20% of the frames closest to the segment CENTER are 90% correct Not All Frames are Created Equal Correlation Between Frame Position and Classification Accuracy for MANNER of articulation features: The 20% of the frames closest to the segment BOUNDARIES are 73% correct The 20% of the frames closest to the segment CENTER are 90% correct

Correlation between frame position within a segment and classifier output for MANNER features: The 20% of the frames closest to the segment BOUNDARIES have a mean maximum output (confidence) level of 0.797 The 20% of the frames closest to the segment CENTER have a mean maximum output (confidence) level of 0.892 This dynamic range of 0.1 (in absolute terms) is HIGHLY significant Not All Frames are Created Equal Manner Classification is Best for Frames in the Phonetic-Segment Center

Not All Frames are Created Equal Manner Classification is Best for Frames in the Phonetic-Segment Center MLP Network Confidence Level is Highly Correlated with Frame Accuracy Not All Frames are Created Equal Manner Classification is Best for Frames in the Phonetic-Segment Center

MLP Network Confidence Level is Highly Correlated with Frame Accuracy The Most Confidently Classified Frames are Generally More Accurate Selecting a Threshold for Frame Selection The Correlation Between Neural Network Confidence Level and Frame Position within the Phonetic Segment Can Be Exploited to Enhance Articulatory Feature Classification This insight provides the basis for the Elitist approach

Selecting a Threshold for Frame Selection The Most Confidently Classified Frames are Generally More Accurate Selecting a Threshold for Frame Selection The Most Confidently Classified Frames are Generally More Accurate Frames with Confidence Levels Below Threshold are Discarded Setting the threshold to 0.7 filters out ca. 20% of the frames Boundary frames are twice as likely to be discarded as central frames

Criterion Selecting a Threshold for Frame Selection The Most Confidently Classified Frames are Generally More Accurate Frames with Confidence Levels Below Threshold are Discarded Setting the threshold to 0.7 filters out ca. 20% of the frames Boundary frames are twice as likely to be discarded as central frames

Primary Drawback of Using This Threshold for Frame Selection 6% of the phonetic segments have most of their frames discarded Criterion The Elitist Approach to Manner Classification The Accuracy of MANNER Frame Classification Improves Frame-level classification accuracy increases overall from 85% to 93% The Elitist Approach to Manner Classification

The Accuracy of MANNER Frame Classification Improves Frame-level classification accuracy increases overall from 85% to 93% Certain Manner Classes Improve Highly with Frame Selection Nasals, Stops, Fricatives, Flaps all show strong improvement in performance Manner-Dependency for Place of Articulation Objective Reduce the Number of Place Features to Classify for Any Single Manner Class

Although there are NINE distinct place of articulation features overall ... For any single manner class there are only three or four place features The specific PLACES of articulation for stops differs from fricatives, etc. HOWEVER, the SPATIAL PATTERNING of the constriction loci is SIMILAR Manner-Dependency for Place of Articulation Objective Reduce the Number of Place Features to Classify for Any Single Manner Class

Although there are NINE distinct place of articulation features overall ... For any single manner class there are only three or four place features The specific PLACES of articulation for stops differs from fricatives, etc. HOWEVER, the SPATIAL PATTERNING of the constriction loci is SIMILAR Because Classification Accuracy for Manner Features is High, MannerSpecific Training for Place of Articulation is Feasible (as well show you) Manner-Specific Place Classification Thus, Each Manner Class can be Trained on Comparable Relational Place Features:

ANTERIOR CENTRAL POSTERIOR Manner-Specific Place Classification Thus, Each Manner Class can be Trained on Comparable Relational Place Features: ANTERIOR CENTRAL POSTERIOR Classifying Place of Articulation in Manner-Specific Fashion Can Improve the Classification Accuracy of this Feature Dimension The training material is far more homogeneous under this regime and is thus more reliable and robust

NTIMIT (telephone) Corpus Manner-Specific Classification Vowels Knowing the Manner Improves Place Classification for Vowels as Well Also Improves Height Classification NTIMIT (telephone) Corpus Manner-Specific Place Classification - Overall

Overall, Performance Improves Between 5% and 14% (in absolute terms) Improvement is Greatest for Stops, Nasals and Flaps NTIMIT (telephone) Corpus Summary ELITIST Approach A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature Classification

Summary ELITIST Approach A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature Classification The ELITIST Approach is Based on the Observation that Frames In the Center of Phonetic Segments are More Accurately Classified than Those at Segment Boundaries Summary ELITIST Approach

A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature Classification The ELITIST Approach is Based on the Observation that Frames In the Center of Phonetic Segments are More Accurately Classified than Those at Segment Boundaries Frame Classification Accuracy is Highly Correlated with MLP Network Confidence Level and can be Used to Systematically Discard Frames Summary ELITIST Approach

A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature Classification The ELITIST Approach is Based on the Observation that Frames In the Center of Phonetic Segments are More Accurately Classified than Those at Segment Boundaries Frame Classification Accuracy is Highly Correlated with MLP Network Confidence Level and can be Used to Systematically Discard Frames

Discarding such Low-Confidence Frames Improves AF Classification Summary ELITIST Approach A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature Classification The ELITIST Approach is Based on the Observation that Frames In the Center of Phonetic Segments are More Accurately Classified than

Those at Segment Boundaries Frame Classification Accuracy is Highly Correlated with MLP Network Confidence Level and can be Used to Systematically Discard Frames Discarding such Low-Confidence Frames Improves AF Classification Manner Classification is Sufficiently Improved as to be Capable of Performing Manner-Specific Training for Place-of-Articulation Features

Summary ELITIST Approach A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature Classification The ELITIST Approach is Based on the Observation that Frames In the Center of Phonetic Segments are More Accurately Classified than Those at Segment Boundaries

Frame Classification Accuracy is Highly Correlated with MLP Network Confidence Level and can be Used to Systematically Discard Frames Discarding such Low-Confidence Frames Improves AF Classification Manner Classification is Sufficiently Improved as to be Capable of Performing Manner-Specific Training for Place-of-Articulation Features Place of Articulation Feature Classification Improves using MannerSpecific Training

Summary ELITIST Approach A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature Classification The ELITIST Approach is Based on the Observation that Frames In the Center of Phonetic Segments are More Accurately Classified than Those at Segment Boundaries

Frame Classification Accuracy is Highly Correlated with MLP Network Confidence Level and can be Used to Systematically Discard Frames Discarding such Low-Confidence Frames Improves AF Classification Manner Classification is Sufficiently Improved as to be Capable of Performing Manner-Specific Training for Place-of-Articulation Features Place of Articulation Feature Classification Improves using MannerSpecific Training

This Performance Enhancement is Probably the Result of: Fewer features to classify for any given manner class More homogeneous place-of-articulation training material Summary ELITIST Approach A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature Classification

The ELITIST Approach is Based on the Observation that Frames In the Center of Phonetic Segments are More Accurately Classified than Those at Segment Boundaries Frame Classification Accuracy is Highly Correlated with MLP Network Confidence Level and can be Used to Systematically Discard Frames Discarding such Low-Confidence Frames Improves AF Classification

Manner Classification is Sufficiently Improved as to be Capable of Performing Manner-Specific Training for Place-of-Articulation Features Place of Articulation Feature Classification Improves using MannerSpecific Training This Performance Enhancement is Probably the Result of: Fewer features to classify for any given manner class More homogeneous place-of-articulation training material

Such Improvements in AF Classification Accuracy Can Be Used to Improve the Quality of Automatic Phonetic Annotation Part Three THE ELITIST APPROACH GOES DUTCH Description of the Development Corpus - VIOS The Nature of Cross-Linguistic Transfer of Articulatory Features Application of the ELITIST Approach to Dutch Manner-Specific, Place-of-Articulation Classification for Dutch Dutch Development Corpus VIOS Extemporaneous, Prompted Human-Machine Telephone Dialogues

Human speakers querying an automatic system for Dutch Railway timetables Wide range of dialect variability , both genders, variation in speaker age Dutch Development Corpus VIOS Extemporaneous, Prompted Human-Machine Telephone Dialogues Human speakers querying an automatic system for Dutch Railway timetables Wide range of dialect variability , both genders, variation in speaker age A Portion of the Corpus Manually Labeled at the Phonetic-Segment Level Material labeled by speech science students at Nijmegen University This component of the corpus served as the testing material

There was 18 minutes of material in this portion of the corpus Dutch Development Corpus VIOS Extemporaneous, Prompted Human-Machine Telephone Dialogues Human speakers querying an automatic system for Dutch Railway timetables Wide range of dialect variability , both genders, variation in speaker age A Portion of the Corpus Manually Labeled at the Phonetic-Segment Level Material labeled by speech science students at Nijmegen University This component of the corpus served as the testing material There was 18 minutes of material in this portion of the corpus

The Major Portion of the Corpus Automatically Labeled and Segmented The automatic method incorporated a certain degree of pronunciation-model knowledge derived from language-specific phonological rules This part of the corpus served as the training material There was 60 minutes of material in this portion of the corpus How Dutch Differs from English Dutch and English are Genetically Closely Related Languages Perhaps 1500 years of time depth separating the languages They share some (but not all - see below) phonetic properties in common

How Dutch Differs from English Dutch and English are Genetically Closely Related Languages Perhaps 1500 years of time depth separating the languages They share some (but not all - see below) phonetic properties in common The Dental Place of Articulation is Present in English, but not in Dutch How Dutch Differs from English

Dutch and English are Genetically Closely Related Languages Perhaps 1500 years of time depth separating the languages They share some (but not all - see below) phonetic properties in common The Dental Place of Articulation is Present in English, but not in Dutch The Manner Flap is Present in English, but not in Dutch How Dutch Differs from English

Dutch and English are Genetically Closely Related Languages Perhaps 1500 years of time depth separating the languages They share some (but not all - see below) phonetic properties in common The Dental Place of Articulation is Present in English, but not in Dutch The Manner Flap is Present in English, but not in Dutch Certain Manner/Place Combinations in Dutch are not Found in English

For example the velar fricative associated with orthographic g How Dutch Differs from English Dutch and English are Genetically Closely Related Languages Perhaps 1500 years of time depth separating the languages They share some (but not all - see below) phonetic properties in common The Dental Place of Articulation is Present in English, but not in Dutch

The Manner Flap is Present in English, but not in Dutch Certain Manner/Place Combinations in Dutch are not Found in English For example the velar fricative associated with orthographic g The Vocalic System (particularly diphthongs) Differs Between Dutch and English Cross-Linguistic Classification

Classification Accuracy on the VIOS Corpus Results depend on whether the classifiers were trained on VIOS (Dutch) or NTIMIT (English) material Cross-Linguistic Classification Classification Accuracy on the VIOS Corpus Results depend on whether the classifiers were trained on VIOS (Dutch) or NTIMIT (English) material Voicing and manner classification is comparable between the two training corpora Cross-Linguistic Classification

Classification Accuracy on the VIOS Corpus Results depend on whether the classifiers were trained on VIOS (Dutch) or NTIMIT (English) material Voicing and manner classification is comparable between the two training corpora Place classification is significantly worse when training on NTIMIT Cross-Linguistic Classification Classification Accuracy on the VIOS Corpus

Results depend on whether the classifiers were trained on VIOS (Dutch) or NTIMIT (English) material Voicing and manner classification is comparable between the two training corpora Place classification is significantly worse when training on NTIMIT Other feature dimensions exhibit only slightly worse performance training on NTIMIT The Elitist Approach Applied to Dutch For VIOS-trained Classifiers Frames with Confidence Levels Below Threshold are Discarded Setting the threshold to 0.7 filters out ca. 15% of the frames, corresponding to 6% of the segments The Elitist Approach Applied to Dutch

For VIOS-trained Classifiers Frames with Confidence Levels Below Threshold are Discarded Setting the threshold to 0.7 filters out ca. 15% of the frames, corresponding to 6% of the segments The Accuracy of MANNER Frame Classification Improves Frame-level classification accuracy increases from 85% to 91% The Elitist Approach Applied to Dutch For VIOS-trained Classifiers

Frames with Confidence Levels Below Threshold are Discarded Setting the threshold to 0.7 filters out ca. 15% of the frames, corresponding to 6% of the segments The Accuracy of MANNER Frame Classification Improves Frame-level classification accuracy increases from 85% to 91% For NTIMIT-trained Classifiers (but classifying VIOS material) Frames with Confidence Levels Below Threshold are Discarded Setting the threshold to 0.7 filters out ca. 19% of the frames

The Elitist Approach Applied to Dutch For VIOS-trained Classifiers Frames with Confidence Levels Below Threshold are Discarded Setting the threshold to 0.7 filters out ca. 15% of the frames, corresponding to 6% of the segments The Accuracy of MANNER Frame Classification Improves Frame-level classification accuracy increases from 85% to 91% For NTIMIT-trained Classifiers (but classifying VIOS material)

Frames with Confidence Levels Below Threshold are Discarded Setting the threshold to 0.7 filters out ca. 19% of the frames The Accuracy of MANNER Frame Classification Improves Frame-level classification accuracy increases from 73% to 81% Place of Articulation is Manner-Dependent Although There are Nine Distinct Place of Articulation Features Overall

Place of Articulation is Manner-Dependent Although There are Nine Distinct Place of Articulation Features Overall For Any Single Manner Class There are Only Three Place Features Place of Articulation is Manner-Dependent Although There are Nine Distinct Place of Articulation Features Overall

For Any Single Manner Class There are Only Three Place Features The Locus of Articulation Constriction Differs Among Manner Classes Place of Articulation is Manner-Dependent Thus, if the Manner is Classified Correctly, this Information can be Exploited to Enhance Place of Articulation Classification Place of Articulation is Manner-Dependent

Thus, if the Manner is Classified Correctly, this Information can be Exploited to Enhance Place of Articulation Classification Thus, Each Manner Class can be Trained on Comparable Relational Place Features: ANTERIOR CENTRAL POSTERIOR Place of Articulation is Manner-Dependent

Thus, if the Manner is Classified Correctly, this Information can be Exploited to Enhance Place of Articulation Classification Thus, Each Manner Class can be Trained on Comparable Relational Place Features: ANTERIOR CENTRAL POSTERIOR Knowing the Manner Improves Place Classification for both Consonants and Vowels in DUTCH

VIOS (telephone) Corpus Manner-Specific Place Classification Dutch Knowing the Manner Improves Place Classification for the Segments in DUTCH VIOS (telephone) Corpus Approximant Manner-Specific Place Classification Dutch

Knowing the Manner Improves Place Classification for the Approximant Segments in DUTCH Approximants are Classified as Vocalic Rather Than as Consonantal VIOS (telephone) Corpus Summary ELITIST Goes Dutch Cross-linguistic Transfer of Articulatory Features Classifiers are more than 80% correct on all AF dimensions except for place when trained and tested on VIOS Summary ELITIST Goes Dutch

Cross-linguistic Transfer of Articulatory Features Classifiers are more than 80% correct on all AF dimensions except for place when trained and tested on VIOS Voicing and manner classification is comparable between VIOS and NTIMIT Summary ELITIST Goes Dutch Cross-linguistic Transfer of Articulatory Features Classifiers are more than 80% correct on all AF dimensions except for place when trained and tested on VIOS Voicing and manner classification is comparable between VIOS and NTIMIT Place classification (for VIOS) is much worse when trained on NTIMIT Summary ELITIST Goes Dutch Cross-linguistic Transfer of Articulatory Features Classifiers are more than 80% correct on all AF dimensions except for

place when trained and tested on VIOS Voicing and manner classification is comparable between VIOS and NTIMIT Place classification (for VIOS) is much worse when trained on NTIMIT Other AF dimensions are only slightly worse when trained on NTIMIT Summary ELITIST Goes Dutch Cross-linguistic Transfer of Articulatory Features Classifiers are more than 80% correct on all AF dimensions except for place when trained and tested on VIOS Voicing and manner classification is comparable between VIOS and NTIMIT Place classification (for VIOS) is much worse when trained on NTIMIT Other AF dimensions are only slightly worse when trained on NTIMIT Application of the ELITIST Approach to the VIOS Corpus Results improve when the ELITIST approach is used

Summary ELITIST Goes Dutch Cross-linguistic Transfer of Articulatory Features Classifiers are more than 80% correct on all AF dimensions except for place when trained and tested on VIOS Voicing and manner classification is comparable between VIOS and NTIMIT Place classification (for VIOS) is much worse when trained on NTIMIT Other AF dimensions are only slightly worse when trained on NTIMIT Application of the ELITIST Approach to the VIOS Corpus Results improve when the ELITIST approach is used Training on VIOS: frame-level classification accuracy increases from 85% to 91% (15% of the frames discarded)

Summary ELITIST Goes Dutch Cross-linguistic Transfer of Articulatory Features Classifiers are more than 80% correct on all AF dimensions except for place when trained and tested on VIOS Voicing and manner classification is comparable between VIOS and NTIMIT Place classification (for VIOS) is much worse when trained on NTIMIT Other AF dimensions are only slightly worse when trained on NTIMIT Application of the ELITIST Approach to the VIOS Corpus Results improve when the ELITIST approach is used Training on VIOS: frame-level classification accuracy increases from 85% to 91% (15% of the frames discarded) Training on NTIMIT: frame-level classification accuracy increases from 73% to 81%

(19% of frames discarded) Summary ELITIST Goes Dutch Cross-linguistic Transfer of Articulatory Features Classifiers are more than 80% correct on all AF dimensions except for place when trained and tested on VIOS Voicing and manner classification is comparable between VIOS and NTIMIT Place classification (for VIOS) is much worse when trained on NTIMIT Other AF dimensions are only slightly worse when trained on NTIMIT Application of the ELITIST Approach to the VIOS Corpus Results improve when the ELITIST approach is used Training on VIOS: frame-level classification accuracy increases from 85% to 91% (15% of the frames discarded)

Training on NTIMIT: frame-level classification accuracy increases from 73% to 81% (19% of frames discarded) Manner-Specific Classification for Place of Articulation Features Knowing the manner improves place classification for vowels and for consonants Summary ELITIST Goes Dutch Cross-linguistic Transfer of Articulatory Features Classifiers are more than 80% correct on all AF dimensions except for place when trained and tested on VIOS Voicing and manner classification is comparable between VIOS and NTIMIT Place classification (for VIOS) is much worse when trained on NTIMIT Other AF dimensions are only slightly worse when trained on NTIMIT

Application of the ELITIST Approach to the VIOS Corpus Results improve when the ELITIST approach is used Training on VIOS: frame-level classification accuracy increases from 85% to 91% (15% of the frames discarded) Training on NTIMIT: frame-level classification accuracy increases from 73% to 81% (19% of frames discarded) Manner-Specific Classification for Place of Articulation Features Knowing the manner improves place classification for vowels and for consonants Accuracy increases between 10 and 20% (absolute) for all place features

Summary ELITIST Goes Dutch Cross-linguistic Transfer of Articulatory Features Classifiers are more than 80% correct on all AF dimensions except for place when trained and tested on VIOS Voicing and manner classification is comparable between VIOS and NTIMIT Place classification (for VIOS) is much worse when trained on NTIMIT Other AF dimensions are only slightly worse when trained on NTIMIT Application of the ELITIST Approach to the VIOS Corpus Results improve when the ELITIST approach is used Training on VIOS: frame-level classification accuracy increases from 85% to 91% (15% of the frames discarded) Training on NTIMIT: frame-level classification accuracy increases from 73% to 81%

(19% of frames discarded) Manner-Specific Classification for Place of Articulation Features Knowing the manner improves place classification for vowels and for consonants Accuracy increases between 10 and 20% (absolute) for all place features Approximants are classified as vocalic not consonantal knowing the manner improves place classification for approximant segments Part Four INTO THE FUTURE Towards Fully Automatic Transcription Systems An Empirically Oriented Discipline Based on Annotated Corpora The Eternal Pentangle

Phonetic and Prosodic Annotation is Limited in Quantity The Eternal Pentangle Phonetic and Prosodic Annotation is Limited in Quantity This material is important for understanding spoken language and developing superior technology for recognition and synthesis The Eternal Pentangle Phonetic and Prosodic Annotation is Limited in Quantity This material is important for understanding spoken language and developing superior technology for recognition and synthesis I Have a Dream, That One Day . I Have a Dream, That One Day .

There will be Annotated Corpora for All Major Languages of the World I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About: I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About:

Articulatory-acoustic features I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About: Articulatory-acoustic features Phonetic segments I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About: Articulatory-acoustic features Phonetic segments

Pronunciation variation I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About: Articulatory-acoustic features Phonetic segments Pronunciation variation Syllable units

I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About: Articulatory-acoustic features Phonetic segments Pronunciation variation Syllable units

Lexical representations I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About: Articulatory-acoustic features Phonetic segments

Pronunciation variation Syllable units Lexical representations Prosodic information pertaining to accent and intonation I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About:

Articulatory-acoustic features Phonetic segments Pronunciation variation Syllable units Lexical representations Prosodic information pertaining to accent and intonation Morphological patterns, as well as syntactic and grammatical material I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About:

Articulatory-acoustic features Phonetic segments Pronunciation variation Syllable units Lexical representations Prosodic information pertaining to accent and intonation Morphological patterns, as well as syntactic and grammatical material

Semantics and its relation to the lower tiers of spoken language I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About:

Articulatory-acoustic features Phonetic segments Pronunciation variation Syllable units Lexical representations Prosodic information pertaining to accent and intonation Morphological patterns, as well as syntactic and grammatical material Semantics and its relation to the lower tiers of spoken language Audio and video detail pertaining to all aspects of spoken language I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About:

Articulatory-acoustic features Phonetic segments Pronunciation variation Syllable units Lexical representations

Prosodic information pertaining to accent and intonation Morphological patterns, as well as syntactic and grammatical material Semantics and its relation to the lower tiers of spoken language Audio and video detail pertaining to all aspects of spoken language That a Science of Spoken Language will be Empirically Based I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About:

Articulatory-acoustic features Phonetic segments Pronunciation variation Syllable units Lexical representations Prosodic information pertaining to accent and intonation Morphological patterns, as well as syntactic and grammatical material Semantics and its relation to the lower tiers of spoken language Audio and video detail pertaining to all aspects of spoken language

That a Science of Spoken Language will be Empirically Based Using these annotated corpora to perform detailed statistical analyses I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About:

Articulatory-acoustic features Phonetic segments Pronunciation variation Syllable units Lexical representations Prosodic information pertaining to accent and intonation Morphological patterns, as well as syntactic and grammatical material Semantics and its relation to the lower tiers of spoken language Audio and video detail pertaining to all aspects of spoken language That a Science of Spoken Language will be Empirically Based Using these annotated corpora to perform detailed statistical analyses

Generating hypotheses about the organization and function of spoken language I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About:

Articulatory-acoustic features Phonetic segments Pronunciation variation Syllable units Lexical representations Prosodic information pertaining to accent and intonation Morphological patterns, as well as syntactic and grammatical material Semantics and its relation to the lower tiers of spoken language Audio and video detail pertaining to all aspects of spoken language That a Science of Spoken Language will be Empirically Based Using these annotated corpora to perform detailed statistical analyses Generating hypotheses about the organization and function of spoken language Performing experiments based on insights garnered from such corpora

I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About:

Articulatory-acoustic features Phonetic segments Pronunciation variation Syllable units Lexical representations Prosodic information pertaining to accent and intonation Morphological patterns, as well as syntactic and grammatical material Semantics and its relation to the lower tiers of spoken language Audio and video detail pertaining to all aspects of spoken language That a Science of Spoken Language will be Empirically Based Using these annotated corpora to perform detailed statistical analyses Generating hypotheses about the organization and function of spoken language Performing experiments based on insights garnered from such corpora

That Such Corpora will be Used to Develop Wonderful Technology I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About:

Articulatory-acoustic features Phonetic segments Pronunciation variation Syllable units Lexical representations Prosodic information pertaining to accent and intonation Morphological patterns, as well as syntactic and grammatical material Semantics and its relation to the lower tiers of spoken language Audio and video detail pertaining to all aspects of spoken language That a Science of Spoken Language will be Empirically Based Using these annotated corpora to perform detailed statistical analyses Generating hypotheses about the organization and function of spoken language Performing experiments based on insights garnered from such corpora

That Such Corpora will be Used to Develop Wonderful Technology To create flawless speech recognition I Have a Dream, That One Day . There will be Annotated Corpora for All Major Languages of the World (generated by automatic means, but based on manual annotation) That Each of These Corpora will Contain Detailed Information About:

Articulatory-acoustic features Phonetic segments Pronunciation variation Syllable units Lexical representations Prosodic information pertaining to accent and intonation Morphological patterns, as well as syntactic and grammatical material Semantics and its relation to the lower tiers of spoken language Audio and video detail pertaining to all aspects of spoken language That a Science of Spoken Language will be Empirically Based Using these annotated corpora to perform detailed statistical analyses

Generating hypotheses about the organization and function of spoken language Performing experiments based on insights garnered from such corpora That Such Corpora will be Used to Develop Wonderful Technology To create flawless speech recognition And perfect speech synthesis Thats All, Folks Many Thanks for Your Time and Attention

Recently Viewed Presentations