A Medley of Potpourri

Wednesday, July 15, 2020

Speech perception

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Speech_perception

Speech perception is the process by which the sounds of language are heard, interpreted and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.

The process of perceiving speech begins at the level of the sound signal and the process of audition. (For a complete description of the process of audition see Hearing.) After processing the initial auditory signal, speech sounds are further processed to extract acoustic cues and phonetic information. This speech information can then be used for higher-level language processes, such as word recognition.

Acoustic cues

Figure 1: Spectrograms of syllables "dee" (top), "dah" (middle), and "doo" (bottom) showing how the onset formant transitions that define perceptually the consonant [d] differ depending on the identity of the following vowel. (Formants are highlighted by red dotted lines; transitions are the bending beginnings of the formant trajectories.)

Acoustic cues are sensory cues contained in the speech sound signal which are used in speech perception to differentiate speech sounds belonging to different phonetic categories. For example, one of the most studied cues in speech is voice onset time or VOT. VOT is a primary cue signaling the difference between voiced and voiceless plosives, such as "b" and "p". Other cues differentiate sounds that are produced at different places of articulation or manners of articulation. The speech system must also combine these cues to determine the category of a specific speech sound. This is often thought of in terms of abstract representations of phonemes. These representations can then be combined for use in word recognition and other language processes.

It is not easy to identify what acoustic cues listeners are sensitive to when perceiving a particular speech sound:

At first glance, the solution to the problem of how we perceive speech seems deceptively simple. If one could identify stretches of the acoustic waveform that correspond to units of perception, then the path from sound to meaning would be clear. However, this correspondence or mapping has proven extremely difficult to find, even after some forty-five years of research on the problem.

If a specific aspect of the acoustic waveform indicated one linguistic unit, a series of tests using speech synthesizers would be sufficient to determine such a cue or cues. However, there are two significant obstacles:

One acoustic aspect of the speech signal may cue different linguistically relevant dimensions. For example, the duration of a vowel in English can indicate whether or not the vowel is stressed, or whether it is in a syllable closed by a voiced or a voiceless consonant, and in some cases (like American English /ɛ/ and /æ/) it can distinguish the identity of vowels.^[2] Some experts even argue that duration can help in distinguishing of what is traditionally called short and long vowels in English.
One linguistic unit can be cued by several acoustic properties. For example, in a classic experiment, Alvin Liberman (1957) showed that the onset formant transitions of /d/ differ depending on the following vowel (see Figure 1) but they are all interpreted as the phoneme /d/ by listeners.

Linearity and the segmentation problem

Figure 2: A spectrogram of the phrase "I owe you". There are no clearly distinguishable boundaries between speech sounds.

Although listeners perceive speech as a stream of discrete units (phonemes, syllables, and words), this linearity is difficult to see in the physical speech signal (see Figure 2 for an example). Speech sounds do not strictly follow one another, rather, they overlap. A speech sound is influenced by the ones that precede and the ones that follow. This influence can even be exerted at a distance of two or more segments (and across syllable- and word-boundaries).

Because the speech signal is not linear, there is a problem of segmentation. It is difficult to delimit a stretch of speech signal as belonging to a single perceptual unit. As an example, the acoustic properties of the phoneme /d/ will depend on the production of the following vowel (because of coarticulation).

Lack of invariance

The research and application of speech perception must deal with several problems which result from what has been termed the lack of invariance. Reliable constant relations between a phoneme of a language and its acoustic manifestation in speech are difficult to find. There are several reasons for this:

Context-induced variation

Phonetic environment affects the acoustic properties of speech sounds. For example, /u/ in English is fronted when surrounded by coronal consonants. Or, the voice onset time marking the boundary between voiced and voiceless plosives are different for labial, alveolar and velar plosives and they shift under stress or depending on the position within a syllable.

Variation due to differing speech conditions

One important factor that causes variation is differing speech rate. Many phonemic contrasts are constituted by temporal characteristics (short vs. long vowels or consonants, affricates vs. fricatives, plosives vs. glides, voiced vs. voiceless plosives, etc.) and they are certainly affected by changes in speaking tempo. Another major source of variation is articulatory carefulness vs. sloppiness which is typical for connected speech (articulatory "undershoot" is obviously reflected in the acoustic properties of the sounds produced).

Variation due to different speaker identity

The resulting acoustic structure of concrete speech productions depends on the physical and psychological properties of individual speakers. Men, women, and children generally produce voices having different pitch. Because speakers have vocal tracts of different sizes (due to sex and age especially) the resonant frequencies (formants), which are important for recognition of speech sounds, will vary in their absolute values across individuals. Research shows that infants at the age of 7.5 months cannot recognize information presented by speakers of different genders; however by the age of 10.5 months, they can detect the similarities. Dialect and foreign accent can also cause variation, as can the social characteristics of the speaker and listener.

Perceptual constancy and normalization

Figure 3: The left panel shows the 3 peripheral American English vowels /i/, /ɑ/, and /u/ in a standard F1 by F2 plot (in Hz). The mismatch between male, female, and child values is apparent. In the right panel formant distances (in Bark) rather than absolute values are plotted using the normalization procedure proposed by Syrdal and Gopal in 1986. Formant values are taken from Hillenbrand et al. (1995)

Despite the great variety of different speakers and different conditions, listeners perceive vowels and consonants as constant categories. It has been proposed that this is achieved by means of the perceptual normalization process in which listeners filter out the noise (i.e. variation) to arrive at the underlying category. Vocal-tract-size differences result in formant-frequency variation across speakers; therefore a listener has to adjust his/her perceptual system to the acoustic characteristics of a particular speaker. This may be accomplished by considering the ratios of formants rather than their absolute values. This process has been called vocal tract normalization (see Figure 3 for an example). Similarly, listeners are believed to adjust the perception of duration to the current tempo of the speech they are listening to – this has been referred to as speech rate normalization.

Whether or not normalization actually takes place and what is its exact nature is a matter of theoretical controversy (see theories below). Perceptual constancy is a phenomenon not specific to speech perception only; it exists in other types of perception too.

Categorical perception

Figure 4: Example identification (red) and discrimination (blue) functions

Categorical perception is involved in processes of perceptual differentiation. People perceive speech sounds categorically, that is to say, they are more likely to notice the differences between categories (phonemes) than within categories. The perceptual space between categories is therefore warped, the centers of categories (or "prototypes") working like a sieve or like magnets for incoming speech sounds.

In an artificial continuum between a voiceless and a voiced bilabial plosive, each new step differs from the preceding one in the amount of VOT. The first sound is a pre-voiced [b], i.e. it has a negative VOT. Then, increasing the VOT, it reaches zero, i.e. the plosive is a plain unaspirated voiceless [p]. Gradually, adding the same amount of VOT at a time, the plosive is eventually a strongly aspirated voiceless bilabial [pʰ]. (Such a continuum was used in an experiment by Lisker and Abramson in 1970. The sounds they used are available online.) In this continuum of, for example, seven sounds, native English listeners will identify the first three sounds as /b/ and the last three sounds as /p/ with a clear boundary between the two categories. A two-alternative identification (or categorization) test will yield a discontinuous categorization function (see red curve in Figure 4).

In tests of the ability to discriminate between two sounds with varying VOT values but having a constant VOT distance from each other (20 ms for instance), listeners are likely to perform at chance level if both sounds fall within the same category and at nearly 100% level if each sound falls in a different category (see the blue discrimination curve in Figure 4).

The conclusion to make from both the identification and the discrimination test is that listeners will have different sensitivity to the same relative increase in VOT depending on whether or not the boundary between categories was crossed. Similar perceptual adjustment is attested for other acoustic cues as well.

Top-down influences

In a classic experiment, Richard M. Warren (1970) replaced one phoneme of a word with a cough-like sound. Perceptually, his subjects restored the missing speech sound without any difficulty and could not accurately identify which phoneme had been disturbed, a phenomenon known as the phonemic restoration effect. Therefore, the process of speech perception is not necessarily uni-directional.

Another basic experiment compared recognition of naturally spoken words within a phrase versus the same words in isolation, finding that perception accuracy usually drops in the latter condition. To probe the influence of semantic knowledge on perception, Garnes and Bond (1976) similarly used carrier sentences where target words only differed in a single phoneme (bay/day/gay, for example) whose quality changed along a continuum. When put into different sentences that each naturally led to one interpretation, listeners tended to judge ambiguous words according to the meaning of the whole sentence. That is, higher-level language processes connected with morphology, syntax, or semantics may interact with basic speech perception processes to aid in recognition of speech sounds.

It may be the case that it is not necessary and maybe even not possible for a listener to recognize phonemes before recognizing higher units, like words for example. After obtaining at least a fundamental piece of information about phonemic structure of the perceived entity from the acoustic signal, listeners can compensate for missing or noise-masked phonemes using their knowledge of the spoken language. Compensatory mechanisms might even operate at the sentence level such as in learned songs, phrases and verses, an effect backed-up by neural coding patterns consistent with the missed continuous speech fragments, despite the lack of all relevant bottom-up sensory input.

Acquired language impairment

The first ever hypothesis of speech perception was used with patients who acquired an auditory comprehension deficit, also known as receptive aphasia. Since then there have been many disabilities that have been classified, which resulted in a true definition of "speech perception". The term 'speech perception' describes the process of interest that employs sub lexical contexts to the probe process. It consists of many different language and grammatical functions, such as: features, segments (phonemes), syllabic structure (unit of pronunciation), phonological word forms (how sounds are grouped together), grammatical features, morphemic (prefixes and suffixes), and semantic information (the meaning of the words). In the early years, they were more interested in the acoustics of speech. For instance, they were looking at the differences between /ba/ or /da/, but now research has been directed to the response in the brain from the stimuli. In recent years, there has been a model developed to create a sense of how speech perception works; this model is known as the dual stream model. This model has drastically changed from how psychologists look at perception. The first section of the dual stream model is the ventral pathway. This pathway incorporates middle temporal gyrus, inferior temporal sulcus and perhaps the inferior temporal gyrus. The ventral pathway shows phonological representations to the lexical or conceptual representations, which is the meaning of the words. The second section of the dual stream model is the dorsal pathway. This pathway includes the sylvian parietotemporal, inferior frontal gyrus, anterior insula, and premotor cortex. Its primary function is to take the sensory or phonological stimuli and transfer it into an articulatory-motor representation (formation of speech).

Aphasia

Aphasia is an impairment of language processing caused by damage to the brain. Different parts of language processing are impacted depending on the area of the brain that is damaged, and aphasia is further classified based on the location of injury or constellation of symptoms. Damage to Broca's area of the brain often results in expressive aphasia which manifests as impairment in speech production. Damage to Wernicke's area often results in receptive aphasia where speech processing is impaired.

Aphasia with impaired speech perception typically shows lesions or damage located in the left temporal or parietal lobes. Lexical and semantic difficulties are common, and comprehension may be affected.

Agnosia

Agnosia is "the loss or diminution of the ability to recognize familiar objects or stimuli usually as a result of brain damage". There are several different kinds of agnosia that affect every one of our senses, but the two most common related to speech are speech agnosia and phonagnosia.

Speech agnosia: Pure word deafness, or speech agnosia, is an impairment in which a person maintains the ability to hear, produce speech, and even read speech, yet they are unable to understand or properly perceive speech. These patients seem to have all of the skills necessary in order to properly process speech, yet they appear to have no experience associated with speech stimuli. Patients have reported, "I can hear you talking, but I can't translate it". Even though they are physically receiving and processing the stimuli of speech, without the ability to determine the meaning of the speech, they essentially are unable to perceive the speech at all. There are no known treatments that have been found, but from case studies and experiments it is known that speech agnosia is related to lesions in the left hemisphere or both, specifically right temporoparietal dysfunctions.

Phonagnosia: Phonagnosia is associated with the inability to recognize any familiar voices. In these cases, speech stimuli can be heard and even understood but the association of the speech to a certain voice is lost. This can be due to "abnormal processing of complex vocal properties (timbre, articulation, and prosody—elements that distinguish an individual voice". There is no known treatment; however, there is a case report of an epileptic woman who began to experience phonagnosia along with other impairments. Her EEG and MRI results showed "a right cortical parietal T2-hyperintense lesion without gadolinium enhancement and with discrete impairment of water molecule diffusion". So although no treatment has been discovered, phonagnosia can be correlated to postictal parietal cortical dysfunction.

Infant speech perception

Infants begin the process of language acquisition by being able to detect very small differences between speech sounds. They can discriminate all possible speech contrasts (phonemes). Gradually, as they are exposed to their native language, their perception becomes language-specific, i.e. they learn how to ignore the differences within phonemic categories of the language (differences that may well be contrastive in other languages – for example, English distinguishes two voicing categories of plosives, whereas Thai has three categories; infants must learn which differences are distinctive in their native language uses, and which are not). As infants learn how to sort incoming speech sounds into categories, ignoring irrelevant differences and reinforcing the contrastive ones, their perception becomes categorical. Infants learn to contrast different vowel phonemes of their native language by approximately 6 months of age. The native consonantal contrasts are acquired by 11 or 12 months of age. Some researchers have proposed that infants may be able to learn the sound categories of their native language through passive listening, using a process called statistical learning. Others even claim that certain sound categories are innate, that is, they are genetically specified (see discussion about innate vs. acquired categorical distinctiveness).

If day-old babies are presented with their mother's voice speaking normally, abnormally (in monotone), and a stranger's voice, they react only to their mother's voice speaking normally. When a human and a non-human sound is played, babies turn their head only to the source of human sound. It has been suggested that auditory learning begins already in the pre-natal period.

One of the techniques used to examine how infants perceive speech, besides the head-turn procedure mentioned above, is measuring their sucking rate. In such an experiment, a baby is sucking a special nipple while presented with sounds. First, the baby's normal sucking rate is established. Then a stimulus is played repeatedly. When the baby hears the stimulus for the first time the sucking rate increases but as the baby becomes habituated to the stimulation the sucking rate decreases and levels off. Then, a new stimulus is played to the baby. If the baby perceives the newly introduced stimulus as different from the background stimulus the sucking rate will show an increase. The sucking-rate and the head-turn method are some of the more traditional, behavioral methods for studying speech perception. Among the new methods (see Research methods below) that help us to study speech perception, near-infrared spectroscopy is widely used in infants.

It has also been discovered that even though infants' ability to distinguish between the different phonetic properties of various languages begins to decline around the age of nine months, it is possible to reverse this process by exposing them to a new language in a sufficient way. In a research study by Patricia K. Kuhl, Feng-Ming Tsao, and Huei-Mei Liu, it was discovered that if infants are spoken to and interacted with by a native speaker of Mandarin Chinese, they can actually be conditioned to retain their ability to distinguish different speech sounds within Mandarin that are very different from speech sounds found within the English language. This proves that given the right conditions, it is possible to prevent infants' loss of the ability to distinguish speech sounds in languages other than those found in the native language.

Cross-language and second-language

A large amount of research has studied how users of a language perceive foreign speech (referred to as cross-language speech perception) or second-language speech (second-language speech perception). The latter falls within the domain of second language acquisition.

Languages differ in their phonemic inventories. Naturally, this creates difficulties when a foreign language is encountered. For example, if two foreign-language sounds are assimilated to a single mother-tongue category the difference between them will be very difficult to discern. A classic example of this situation is the observation that Japanese learners of English will have problems with identifying or distinguishing English liquid consonants /l/ and /r/ (see Perception of English /r/ and /l/ by Japanese speakers).

Best (1995) proposed a Perceptual Assimilation Model which describes possible cross-language category assimilation patterns and predicts their consequences. Flege (1995) formulated a Speech Learning Model which combines several hypotheses about second-language (L2) speech acquisition and which predicts, in simple words, that an L2 sound that is not too similar to a native-language (L1) sound will be easier to acquire than an L2 sound that is relatively similar to an L1 sound (because it will be perceived as more obviously "different" by the learner).

In language or hearing impairment

Research in how people with language or hearing impairment perceive speech is not only intended to discover possible treatments. It can provide insight into the principles underlying non-impaired speech perception. Two areas of research can serve as an example:

Listeners with aphasia

Aphasia affects both the expression and reception of language. Both two most common types, expressive aphasia and receptive aphasia, affect speech perception to some extent. Expressive aphasia causes moderate difficulties for language understanding. The effect of receptive aphasia on understanding is much more severe. It is agreed upon, that aphasics suffer from perceptual deficits. They usually cannot fully distinguish place of articulation and voicing. As for other features, the difficulties vary. It has not yet been proven whether low-level speech-perception skills are affected in aphasia sufferers or whether their difficulties are caused by higher-level impairment alone.

Listeners with cochlear implants

Cochlear implantation restores access to the acoustic signal in individuals with sensorineural hearing loss. The acoustic information conveyed by an implant is usually sufficient for implant users to properly recognize speech of people they know even without visual clues. For cochlear implant users, it is more difficult to understand unknown speakers and sounds. The perceptual abilities of children that received an implant after the age of two are significantly better than of those who were implanted in adulthood. A number of factors have been shown to influence perceptual performance, specifically: duration of deafness prior to implantation, age of onset of deafness, age at implantation (such age effects may be related to the Critical period hypothesis) and the duration of using an implant. There are differences between children with congenital and acquired deafness. Postlingually deaf children have better results than the prelingually deaf and adapt to a cochlear implant faster. In both children with cochlear implants and normal hearing, vowels and voice onset time becomes prevalent in development before the ability to discriminate the place of articulation. Several months following implantation, children with cochlear implants can normalize speech perception.

Noise

One of the fundamental problems in the study of speech is how to deal with noise. This is shown by the difficulty in recognizing human speech that computer recognition systems have. While they can do well at recognizing speech if trained on a specific speaker's voice and under quiet conditions, these systems often do poorly in more realistic listening situations where humans would understand speech without relative difficulty. To emulate processing patterns that would be held in the brain under normal conditions, prior knowledge is a key neural factor, since a robust learning history may to an extent override the extreme masking effects involved in the complete absence of continuous speech signals.

Music-language connection

Research into the relationship between music and cognition is an emerging field related to the study of speech perception. Originally it was theorized that the neural signals for music were processed in a specialized "module" in the right hemisphere of the brain. Conversely, the neural signals for language were to be processed by a similar "module" in the left hemisphere. However, utilizing technologies such as fMRI machines, research has shown that two regions of the brain traditionally considered exclusively to process speech, Broca's and Wernicke's areas, also become active during musical activities such as listening to a sequence of musical chords. Other studies, such as one performed by Marques et al. in 2006 showed that 8-year-olds who were given six months of musical training showed an increase in both their pitch detection performance and their electrophysiological measures when made to listen to an unknown foreign language.

Conversely, some research has revealed that, rather than music affecting our perception of speech, our native speech can affect our perception of music. One example is the tritone paradox. The tritone paradox is where a listener is presented with two computer-generated tones (such as C and F-Sharp) that are half an octave (or a tritone) apart and are then asked to determine whether the pitch of the sequence is descending or ascending. One such study, performed by Ms. Diana Deutsch, found that the listener's interpretation of ascending or descending pitch was influenced by the listener's language or dialect, showing variation between those raised in the south of England and those in California or from those in Vietnam and those in California whose native language was English. A second study, performed in 2006 on a group of English speakers and 3 groups of East Asian students at University of Southern California, discovered that English speakers who had begun musical training at or before age 5 had an 8% chance of having perfect pitch.

Speech phenomenology

The experience of speech

Casey O'Callaghan, in his article Experiencing Speech, analyzes whether "the perceptual experience of listening to speech differs in phenomenal character" with regards to understanding the language being heard. He argues that an individual's experience when hearing a language they comprehend, as opposed to their experience when hearing a language they have no knowledge of, displays a difference in phenomenal features which he defines as "aspects of what an experience is like" for an individual.

If a subject who is a monolingual native English speaker is presented with a stimulus of speech in German, the string of phonemes will appear as mere sounds and will produce a very different experience than if exactly the same stimulus was presented to a subject who speaks German.

He also examines how speech perception changes when one learning a language. If a subject with no knowledge of the Japanese language was presented with a stimulus of Japanese speech, and then was given the exact same stimuli after being taught Japanese, this same individual would have an extremely different experience.

Research methods

The methods used in speech perception research can be roughly divided into three groups: behavioral, computational, and, more recently, neurophysiological methods.

Behavioral methods

Behavioral experiments are based on an active role of a participant, i.e. subjects are presented with stimuli and asked to make conscious decisions about them. This can take the form of an identification test, a discrimination test, similarity rating, etc. These types of experiments help to provide a basic description of how listeners perceive and categorize speech sounds.

Sinewave Speech

Speech perception has also been analyzed through sinewave speech, a form of synthetic speech where the human voice is replaced by sine waves that mimic the frequencies and amplitudes present in the original speech. When subjects are first presented with this speech, the sinewave speech is interpreted as random noises. But when the subjects are informed that the stimuli actually is speech and are told what is being said, "a distinctive, nearly immediate shift occurs" to how the sinewave speech is perceived.

Computational methods

Computational modeling has also been used to simulate how speech may be processed by the brain to produce behaviors that are observed. Computer models have been used to address several questions in speech perception, including how the sound signal itself is processed to extract the acoustic cues used in speech, and how speech information is used for higher-level processes, such as word recognition.

Neurophysiological methods

Neurophysiological methods rely on utilizing information stemming from more direct and not necessarily conscious (pre-attentative) processes. Subjects are presented with speech stimuli in different types of tasks and the responses of the brain are measured. The brain itself can be more sensitive than it appears to be through behavioral responses. For example, the subject may not show sensitivity to the difference between two speech sounds in a discrimination test, but brain responses may reveal sensitivity to these differences. Methods used to measure neural responses to speech include event-related potentials, magnetoencephalography, and near infrared spectroscopy. One important response used with event-related potentials is the mismatch negativity, which occurs when speech stimuli are acoustically different from a stimulus that the subject heard previously.

Neurophysiological methods were introduced into speech perception research for several reasons:

Behavioral responses may reflect late, conscious processes and be affected by other systems such as orthography, and thus they may mask speaker's ability to recognize sounds based on lower-level acoustic distributions.

Without the necessity of taking an active part in the test, even infants can be tested; this feature is crucial in research into acquisition processes. The possibility to observe low-level auditory processes independently from the higher-level ones makes it possible to address long-standing theoretical issues such as whether or not humans possess a specialized module for perceiving speech or whether or not some complex acoustic invariance (see lack of invariance above) underlies the recognition of a speech sound.

Theories

Motor theory

Some of the earliest work in the study of how humans perceive speech sounds was conducted by Alvin Liberman and his colleagues at Haskins Laboratories. Using a speech synthesizer, they constructed speech sounds that varied in place of articulation along a continuum from /bɑ/ to /dɑ/ to /ɡɑ/. Listeners were asked to identify which sound they heard and to discriminate between two different sounds. The results of the experiment showed that listeners grouped sounds into discrete categories, even though the sounds they were hearing were varying continuously. Based on these results, they proposed the notion of categorical perception as a mechanism by which humans can identify speech sounds.

More recent research using different tasks and methods suggests that listeners are highly sensitive to acoustic differences within a single phonetic category, contrary to a strict categorical account of speech perception.

To provide a theoretical account of the categorical perception data, Liberman and colleagues worked out the motor theory of speech perception, where "the complicated articulatory encoding was assumed to be decoded in the perception of speech by the same processes that are involved in production" (this is referred to as analysis-by-synthesis). For instance, the English consonant /d/ may vary in its acoustic details across different phonetic contexts (see above), yet all /d/'s as perceived by a listener fall within one category (voiced alveolar plosive) and that is because "linguistic representations are abstract, canonical, phonetic segments or the gestures that underlie these segments". When describing units of perception, Liberman later abandoned articulatory movements and proceeded to the neural commands to the articulators and even later to intended articulatory gestures, thus "the neural representation of the utterance that determines the speaker's production is the distal object the listener perceives". The theory is closely related to the modularity hypothesis, which proposes the existence of a special-purpose module, which is supposed to be innate and probably human-specific.

The theory has been criticized in terms of not being able to "provide an account of just how acoustic signals are translated into intended gestures" by listeners. Furthermore, it is unclear how indexical information (e.g. talker-identity) is encoded/decoded along with linguistically relevant information.

Exemplar theory

Exemplar models of speech perception differ from the four theories mentioned above which suppose that there is no connection between word- and talker-recognition and that the variation across talkers is "noise" to be filtered out.

The exemplar-based approaches claim listeners store information for both word- and talker-recognition. According to this theory, particular instances of speech sounds are stored in the memory of a listener. In the process of speech perception, the remembered instances of e.g. a syllable stored in the listener's memory are compared with the incoming stimulus so that the stimulus can be categorized. Similarly, when recognizing a talker, all the memory traces of utterances produced by that talker are activated and the talker's identity is determined. Supporting this theory are several experiments reported by Johnson that suggest that our signal identification is more accurate when we are familiar with the talker or when we have visual representation of the talker's gender. When the talker is unpredictable or the sex misidentified, the error rate in word-identification is much higher.

The exemplar models have to face several objections, two of which are (1) insufficient memory capacity to store every utterance ever heard and, concerning the ability to produce what was heard, (2) whether also the talker's own articulatory gestures are stored or computed when producing utterances that would sound as the auditory memories.

Acoustic landmarks and distinctive features

Kenneth N. Stevens proposed acoustic landmarks and distinctive features as a relation between phonological features and auditory properties. According to this view, listeners are inspecting the incoming signal for the so-called acoustic landmarks which are particular events in the spectrum carrying information about gestures which produced them. Since these gestures are limited by the capacities of humans' articulators and listeners are sensitive to their auditory correlates, the lack of invariance simply does not exist in this model. The acoustic properties of the landmarks constitute the basis for establishing the distinctive features. Bundles of them uniquely specify phonetic segments (phonemes, syllables, words).

In this model, the incoming acoustic signal is believed to be first processed to determine the so-called landmarks which are special spectral events in the signal; for example, vowels are typically marked by higher frequency of the first formant, consonants can be specified as discontinuities in the signal and have lower amplitudes in lower and middle regions of the spectrum. These acoustic features result from articulation. In fact, secondary articulatory movements may be used when enhancement of the landmarks is needed due to external conditions such as noise. Stevens claims that coarticulation causes only limited and moreover systematic and thus predictable variation in the signal which the listener is able to deal with. Within this model therefore, what is called the lack of invariance is simply claimed not to exist.

Landmarks are analyzed to determine certain articulatory events (gestures) which are connected with them. In the next stage, acoustic cues are extracted from the signal in the vicinity of the landmarks by means of mental measuring of certain parameters such as frequencies of spectral peaks, amplitudes in low-frequency region, or timing.

The next processing stage comprises acoustic-cues consolidation and derivation of distinctive features. These are binary categories related to articulation (for example [+/- high], [+/- back], [+/- round lips] for vowels; [+/- sonorant], [+/- lateral], or [+/- nasal] for consonants.

Bundles of these features uniquely identify speech segments (phonemes, syllables, words). These segments are part of the lexicon stored in the listener's memory. Its units are activated in the process of lexical access and mapped on the original signal to find out whether they match. If not, another attempt with a different candidate pattern is made. In this iterative fashion, listeners thus reconstruct the articulatory events which were necessary to produce the perceived speech signal. This can be therefore described as analysis-by-synthesis.

This theory thus posits that the distal object of speech perception are the articulatory gestures underlying speech. Listeners make sense of the speech signal by referring to them. The model belongs to those referred to as analysis-by-synthesis.

Fuzzy-logical model

The fuzzy logical theory of speech perception developed by Dominic Massaro proposes that people remember speech sounds in a probabilistic, or graded, way. It suggests that people remember descriptions of the perceptual units of language, called prototypes. Within each prototype various features may combine. However, features are not just binary (true or false), there is a fuzzy value corresponding to how likely it is that a sound belongs to a particular speech category. Thus, when perceiving a speech signal our decision about what we actually hear is based on the relative goodness of the match between the stimulus information and values of particular prototypes. The final decision is based on multiple features or sources of information, even visual information (this explains the McGurk effect). Computer models of the fuzzy logical theory have been used to demonstrate that the theory's predictions of how speech sounds are categorized correspond to the behavior of human listeners.

Speech mode hypothesis

Speech mode hypothesis is the idea that the perception of speech requires the use of specialized mental processing. The speech mode hypothesis is a branch off of Fodor's modularity theory (see modularity of mind). It utilizes a vertical processing mechanism where limited stimuli are processed by special-purpose areas of the brain that are stimuli specific.

Two versions of speech mode hypothesis:

Weak version – listening to speech engages previous knowledge of language.
Strong version – listening to speech engages specialized speech mechanisms for perceiving speech.

Three important experimental paradigms have evolved in the search to find evidence for the speech mode hypothesis. These are dichotic listening, categorical perception, and duplex perception. Through the research in these categories it has been found that there may not be a specific speech mode but instead one for auditory codes that require complicated auditory processing. Also it seems that modularity is learned in perceptual systems. Despite this the evidence and counter-evidence for the speech mode hypothesis is still unclear and needs further research.

Direct realist theory

The direct realist theory of speech perception (mostly associated with Carol Fowler) is a part of the more general theory of direct realism, which postulates that perception allows us to have direct awareness of the world because it involves direct recovery of the distal source of the event that is perceived. For speech perception, the theory asserts that the objects of perception are actual vocal tract movements, or gestures, and not abstract phonemes or (as in the Motor Theory) events that are causally antecedent to these movements, i.e. intended gestures. Listeners perceive gestures not by means of a specialized decoder (as in the Motor Theory) but because information in the acoustic signal specifies the gestures that form it. By claiming that the actual articulatory gestures that produce different speech sounds are themselves the units of speech perception, the theory bypasses the problem of lack of invariance.

Neurocomputational speech processing

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Neurocomputational_speech_processing

Neurocomputational speech processing is computer-simulation of speech production and speech perception by referring to the natural neuronal processes of speech production and speech perception, as they occur in the human nervous system (central nervous system and peripheral nervous system). This topic is based on neuroscience and computational neuroscience.

Overview

Neurocomputational models of speech processing are complex. They comprise at least a cognitive part, a motor part and a sensory part.

The cognitive or linguistic part of a neurocomputational model of speech processing comprises the neural activation or generation of a phonemic representation on the side of speech production (e.g. neurocomputational and extended version of the Levelt model developed by Ardi Roelofs: WEAVER++ as well as the neural activation or generation of an intention or meaning on the side of speech perception or speech comprehension.

The motor part of a neurocomputational model of speech processing starts with a phonemic representation of a speech item, activates a motor plan and ends with the articulation of that particular speech item (see also: articulatory phonetics).

The sensory part of a neurocomputational model of speech processing starts with an acoustic signal of a speech item (acoustic speech signal), generates an auditory representation for that signal and activates a phonemic representations for that speech item.

Neurocomputational speech processing topics

Neurocomputational speech processing is speech processing by artificial neural networks. Neural maps, mappings and pathways as described below, are model structures, i.e. important structures within artificial neural networks.

Neural maps

Fig. 1: 2D neuronal map with a local activation pattern. magenta: neuron with highest degree of activation; blue: neurons with no activation

An artificial neural network can be separated in three types of neural maps, also called "layers":

input maps (in the case of speech processing: primary auditory map within the auditory cortex, primary somatosensory map within the somatosensory cortex),
output maps (primary motor map within the primary motor cortex), and
higher level cortical maps (also called "hidden layers").

The term "neural map" is favoured here over the term "neural layer", because a cortial neural map should be modeled as a 2D-map of interconnected neurons (e.g. like a self-organizing map; see also Fig. 1). Thus, each "model neuron" or "artificial neuron" within this 2D-map is physiologically represented by a cortical column since the cerebral cortex anatomically exhibits a layered structure.

Neural representations (neural states)

A neural representation within an artificial neural network is a temporarily activated (neural) state within a specific neural map. Each neural state is represented by a specific neural activation pattern. This activation pattern changes during speech processing (e.g. from syllable to syllable).

Fig. 2: 2D neuronal map with a distributed activation pattern. Example: "neural spectrogram" (This auditory neural representation is speculative; see ACT model, below)

In the ACT model (see below), it is assumed that an auditory state can be represented by a "neural spectrogram" (see Fig. 2) within an auditory state map. This auditory state map is assumed to be located in the auditory association cortex.

A somatosensory state can be divided in a tactile and proprioceptive state and can be represented by a specific neural activation pattern within the somatosensory state map. This state map is assumed to be located in the somatosensory association cortex.

A motor plan state can be assumed for representing a motor plan, i.e. the planning of speech articulation for a specific syllable or for a longer speech item (e.g. word, short phrase). This state map is assumed to be located in the premotor cortex, while the instantaneous (or lower level) activation of each speech articulator occurs within the primary motor cortex (see motor cortex).

The neural representations occurring in the sensory and motor maps (as introduced above) are distributed representations (Hinton et al. 1968): Each neuron within the sensory or motor map is more or less activated, leading to a specific activation pattern.

The neural representation for speech units occurring in the speech sound map (see below: DIVA model) is a punctual or local representation. Each speech item or speech unit is represented here by a specific neuron (model cell, see below).

Neural mappings (synaptic projections)

Fig. 3: Neural mapping between phonetic map (local activation pattern for a specific phonetic state), motor plan state map (distributed activation pattern) and auditory state map (distributed activation pattern) as part of the ACT model. Only neural connections with the winner neuron within the phonetic map are shown

A neural mapping connects two cortical neural maps. Neural mappings (in contrast to neural pathways) store training information by adjusting their neural link weights. Neural mappings are capable of generating or activating a distributed representation (see above) of a sensory or motor state within a sensory or motor map from a punctual or local activation within the other map (see for example the synaptic projection from speech sound map to motor map, to auditory target region map, or to somatosensory target region map in the DIVA model, explained below; or see for example the neural mapping from phonetic map to auditory state map and motor plan state map in the ACT model, explained below and Fig. 3).

Neural mapping between two neural maps are compact or dense: Each neuron of one neural map is interconnected with (nearly) each neuron of the other neural map. Because of this density criterion for neural mappings, neural maps which are interconnected by a neural mapping are not far apart from each other.

Neural pathways

In contrast to neural mappings neural pathways can connect neural maps which are far apart (e.g. in different cortical lobes, see cerebral cortex). From the functional or modeling viewpoint, neural pathways mainly forward information without processing this information. A neural pathway in comparison to a neural mapping need much less neural connections. A neural pathway can be modelled by using a one-to-one connection of the neurons of both neural maps.

Example: In the case of two neural maps, each comprising 1,000 model neurons, a neural mapping needs up to 1,000,000 neural connections (many-to-many-connection), while only 1,000 connections are needed in the case of a neural pathway connection.

Furthermore, the link weights of the connections within a neural mapping are adjusted during training, while the neural connections in the case of a neural pathway need not to be trained (each connection is maximal exhibitory).

DIVA model

The leading approach in neurocomputational modeling of speech production is the DIVA model developed by Frank H. Guenther and his group at Boston University. The model accounts for a wide range of phonetic and neuroimaging data but - like each neurocomputational model - remains speculative to some extent.

Structure of the model

Fig. 4: Organization of the DIVA model; This figure is an adaptation following Guenther et al. 2006

The organization or structure of the DIVA model is shown in Fig. 4.

Speech sound map: the phonemic representation as a starting point

The speech sound map - assumed to be located in the inferior and posterior portion of Broca's area (left frontal operculum) - represents (phonologically specified) language-specific speech units (sounds, syllables, words, short phrases). Each speech unit (mainly syllables; e.g. the syllable and word "palm" /pam/, the syllables /pa/, /ta/, /ka/, ...) is represented by a specific model cell within the speech sound map (i.e. punctual neural representations, see above). Each model cell (see artificial neuron) corresponds to a small population of neurons which are located at close range and which fire together.

Feedforward control: activating motor representations

Each neuron (model cell, artificial neuron) within the speech sound map can be activated and subsequently activates a forward motor command towards the motor map, called articulatory velocity and position map. The activated neural representation on the level of that motor map determines the articulation of a speech unit, i.e. controls all articulators (lips, tongue, velum, glottis) during the time interval for producing that speech unit. Forward control also involves subcortical structures like the cerebellum, not modelled in detail here.

A speech unit represents an amount of speech items which can be assigned to the same phonemic category. Thus, each speech unit is represented by one specific neuron within the speech sound map, while the realization of a speech unit may exhibit some articulatory and acoustic variability. This phonetic variability is the motivation to define sensory target regions in the DIVA model (see Guenther et al. 1998).

Articulatory model: generating somatosensory and auditory feedback information

The activation pattern within the motor map determines the movement pattern of all model articulators (lips, tongue, velum, glottis) for a speech item. In order not to overload the model, no detailed modeling of the neuromuscular system is done. The Maeda articulatory speech synthesizer is used in order to generate articulator movements, which allows the generation of a time-varying vocal tract form and the generation of the acoustic speech signal for each particular speech item.

In terms of artificial intelligence the articulatory model can be called plant (i.e. the system, which is controlled by the brain); it represents a part of the embodiement of the neuronal speech processing system. The articulatory model generates sensory output which is the basis for generating feedback information for the DIVA model (see below: feedback control).

Feedback control: sensory target regions, state maps, and error maps

On the one hand the articulatory model generates sensory information, i.e. an auditory state for each speech unit which is neurally represented within the auditory state map (distributed representation), and a somatosensory state for each speech unit which is neurally represented within the somatosensory state map (distributed representation as well). The auditory state map is assumed to be located in the superior temporal cortex while the somatosensory state map is assumed to be located in the inferior parietal cortex.

On the other hand, the speech sound map, if activated for a specific speech unit (single neuron activation; punctual activation), activates sensory information by synaptic projections between speech sound map and auditory target region map and between speech sound map and somatosensory target region map. Auditory and somatosensory target regions are assumed to be located in higher-order auditory cortical regions and in higher-order somatosensory cortical regions respectively. These target region sensory activation patterns - which exist for each speech unit - are learned during speech acquisition (by imitation training; see below: learning).

Consequently, two types of sensory information are available if a speech unit is activated at the level of the speech sound map: (i) learned sensory target regions (i.e. intended sensory state for a speech unit) and (ii) sensory state activation patterns resulting from a possibly imperfect execution (articulation) of a specific speech unit (i.e. current sensory state, reflecting the current production and articulation of that particular speech unit). Both types of sensory information is projected to sensory error maps, i.e. to an auditory error map which is assumed to be located in the superior temporal cortex (like the auditory state map) and to a somatosensosry error map which is assumed to be located in the inferior parietal cortex (like the somatosensory state map) (see Fig. 4).

If the current sensory state deviates from the intended sensory state, both error maps are generating feedback commands which are projected towards the motor map and which are capable to correct the motor activation pattern and subsequently the articulation of a speech unit under production. Thus, in total, the activation pattern of the motor map is not only influenced by a specific feedforward command learned for a speech unit (and generated by the synaptic projection from the speech sound map) but also by a feedback command generated at the level of the sensory error maps (see Fig. 4).

Learning (modeling speech acquisition)

While the structure of a neuroscientific model of speech processing (given in Fig. 4 for the DIVA model) is mainly determined by evolutionary processes, the (language-specific) knowledge as well as the (language-specific) speaking skills are learned and trained during speech acquisition. In the case of the DIVA model it is assumed that the newborn has not available an already structured (language-specific) speech sound map; i.e. no neuron within the speech sound map is related to any speech unit. Rather the organization of the speech sound map as well as the tuning of the projections to the motor map and to the sensory target region maps is learned or trained during speech acquisition. Two important phases of early speech acquisition are modeled in the DIVA approach: Learning by babbling and by imitation.

Babbling

During babbling the synaptic projections between sensory error maps and motor map are tuned. This training is done by generating an amount of semi-random feedforward commands, i.e. the DIVA model "babbles". Each of these babbling commands leads to the production of an "articulatory item", also labeled as "pre-linguistic (i.e. non language-specific) speech item" (i.e. the articulatory model generates an articulatory movement pattern on the basis of the babbling motor command). Subsequently, an acoustic signal is generated.

On the basis of the articulatory and acoustic signal, a specific auditory and somatosensory state pattern is activated at the level of the sensory state maps (see Fig. 4) for each (pre-linguistic) speech item. At this point the DIVA model has available the sensory and associated motor activation pattern for different speech items, which enables the model to tune the synaptic projections between sensory error maps and motor map. Thus, during babbling the DIVA model learns feedback commands (i.e. how to produce a proper (feedback) motor command for a specific sensory input).

Imitation

During imitation the DIVA model organizes its speech sound map and tunes the synaptic projections between speech sound map and motor map - i.e. tuning of forward motor commands - as well as the synaptic projections between speech sound map and sensory target regions (see Fig. 4). Imitation training is done by exposing the model to an amount of acoustic speech signals representing realizations of language-specific speech units (e.g. isolated speech sounds, syllables, words, short phrases).

The tuning of the synaptic projections between speech sound map and auditory target region map is accomplished by assigning one neuron of the speech sound map to the phonemic representation of that speech item and by associating it with the auditory representation of that speech item, which is activated at the auditory target region map. Auditory regions (i.e. a specification of the auditory variability of a speech unit) occur, because one specific speech unit (i.e. one specific phonemic representation) can be realized by several (slightly) different acoustic (auditory) realizations (for the difference between speech item and speech unit see above: feedforward control).

The tuning of the synaptic projections between speech sound map and motor map (i.e. tuning of forward motor commands) is accomplished with the aid of feedback commands, since the projections between sensory error maps and motor map were already tuned during babbling training (see above). Thus the DIVA model tries to "imitate" an auditory speech item by attempting to find a proper feedforward motor command. Subsequently, the model compares the resulting sensory output (current sensory state following the articulation of that attempt) with the already learned auditory target region (intended sensory state) for that speech item. Then the model updates the current feedforward motor command by the current feedback motor command generated from the auditory error map of the auditory feedback system. This process may be repeated several times (several attempts). The DIVA model is capable of producing the speech item with a decreasing auditory difference between current and intended auditory state from attempt to attempt.

During imitation the DIVA model is also capable of tuning the synaptic projections from speech sound map to somatosensory target region map, since each new imitation attempt produces a new articulation of the speech item and thus produces a somatosensory state pattern which is associated with the phonemic representation of that speech item.

Perturbation experiments

Real-time perturbation of F1: the influence of auditory feedback

While auditory feedback is most important during speech acquisition, it may be activated less if the model has learned a proper feedforward motor command for each speech unit. But it has been shown that auditory feedback needs to be strongly coactivated in the case of auditory perturbation (e.g. shifting a formant frequency, Tourville et al. 2005). This is comparable to the strong influence of visual feedback on reaching movements during visual perturbation (e.g. shifting the location of objects by viewing through a prism).

Unexpected blocking of the jaw: the influence of somatosensory feedback

In a comparable way to auditory feedback, also somatosensory feedback can be strongly coactivated during speech production, e.g. in the case of unexpected blocking of the jaw (Tourville et al. 2005).

ACT model

A further approach in neurocomputational modeling of speech processing is the ACT model developed by Bernd J. Kröger and his group at RWTH Aachen University, Germany (Kröger et al. 2014, Kröger et al. 2009, Kröger et al. 2011). The ACT model is in accord with the DIVA model in large parts. The ACT model focuses on the "action repository" (i.e. repository for sensorimotor speaking skills, comparable to the mental syllablary, see Levelt and Wheeldon 1994), which is not spelled out in detail in the DIVA model. Moreover, the ACT model explicitly introduces a level of motor plans, i.e. a high-level motor description for the production of speech items (see motor goals, motor cortex). The ACT model - like any neurocomputational model - remains speculative to some extent.

Structure

Fig. 5: Organization of the ACT model

The organization or structure of the ACT model is given in Fig. 5.

For speech production, the ACT model starts with the activation of a phonemic representation of a speech item (phonemic map). In the case of a frequent syllable, a co-activation occurs at the level of the phonetic map, leading to a further co-activation of the intended sensory state at the level of the sensory state maps and to a co-activation of a motor plan state at the level of the motor plan map. In the case of an infrequent syllable, an attempt for a motor plan is generated by the motor planning module for that speech item by activating motor plans for phonetic similar speech items via the phonetic map (see Kröger et al. 2011). The motor plan or vocal tract action score comprises temporally overlapping vocal tract actions, which are programmed and subsequently executed by the motor programming, execution, and control module. This module gets real-time somatosensory feedback information for controlling the correct execution of the (intended) motor plan. Motor programing leads to activation pattern at the level lof the primary motor map and subsequently activates neuromuscular processing. Motoneuron activation patterns generate muscle forces and subsequently movement patterns of all model articulators (lips, tongue, velum, glottis). The Birkholz 3D articulatory synthesizer is used in order to generate the acoustic speech signal.

Articulatory and acoustic feedback signals are used for generating somatosensory and auditory feedback information via the sensory preprocessing modules, which is forwarded towards the auditory and somatosensory map. At the level of the sensory-phonetic processing modules, auditory and somatosensory information is stored in short-term memory and the external sensory signal (ES, Fig. 5, which are activated via the sensory feedback loop) can be compared with the already trained sensory signals (TS, Fig. 5, which are activated via the phonetic map). Auditory and somatosensory error signals can be generated if external and intended (trained) sensory signals are noticeably different (cf. DIVA model).

The light green area in Fig. 5 indicates those neural maps and processing modules, which process a syllable as a whole unit (specific processing time window around 100 ms and more). This processing comprises the phonetic map and the directly connected sensory state maps within the sensory-phonetic processing modules and the directly connected motor plan state map, while the primary motor map as well as the (primary) auditory and (primary) somatosensory map process smaller time windows (around 10 ms in the ACT model).

Fig. 6: Hypothetical location of brain regions for neural maps of the ACT model

The hypothetical cortical location of neural maps within the ACT model is shown in Fig. 6. The hypothetical locations of primary motor and primary sensory maps are given in magenta, the hypothetical locations of motor plan state map and sensory state maps (within sensory-phonetic processing module, comparable to the error maps in DIVA) are given in orange, and the hypothetical locations for the mirrored phonetic map is given in red. Double arrows indicate neuronal mappings. Neural mappings connect neural maps, which are not far apart from each other (see above). The two mirrored locations of the phonetic map are connected via a neural pathway (see above), leading to a (simple) one-to-one mirroring of the current activation pattern for both realizations of the phonetic map. This neural pathway between the two locations of the phonetic map is assumed to be a part of the fasciculus arcuatus (AF, see Fig. 5 and Fig. 6).

For speech perception, the model starts with an external acoustic signal (e.g. produced by an external speaker). This signal is preprocessed, passes the auditory map, and leads to an activation pattern for each syllable or word on the level of the auditory-phonetic processing module (ES: external signal, see Fig. 5). The ventral path of speech perception (see Hickok and Poeppel 2007) would directly activate a lexical item, but is not implemented in ACT. Rather, in ACT the activation of a phonemic state occurs via the phonemic map and thus may lead to a coactivation of motor representations for that speech item (i.e. dorsal pathway of speech perception; ibid.).

Action repository

Fig. 7: Visualization of synaptic link weights for a section of the phonetic map, trained for the 200 most frequent syllables of Standard German. Each box represents a neuron within the self-organizing phonetic map. Each of the three link weight representations refers to the same section within the phonetic map and thus refers to the same 10×10 neurons

The phonetic map together with the motor plan state map, sensory state maps (occurring within the sensory-phonetic processing modules), and phonemic (state) map form the action repository. The phonetic map is implemented in ACT as a self-organizing neural map and different speech items are represented by different neurons within this map (punctual or local representation, see above: neural representations). The phonetic map exhibits three major characteristics:

More than one phonetic realization may occur within the phonetic map for one phonemic state (see phonemic link weights in Fig. 7: e.g. the syllable /de:m/ is represented by three neurons within the phonetic map)
Phonetotopy: The phonetic map exhibits an ordering of speech items with respect to different phonetic features (see phonemic link weights in Fig. 7. Three examples: (i) the syllables /p@/, /t@/, and /k@/ occur in an upward ordering at the left side within the phonetic map; (ii) syllable-initial plosives occur in the upper left part of the phonetic map while syllable initial fricatives occur in the lower right half; (iii) CV syllables and CVC syllables as well occur in different areas of the phonetic map.).
The phonetic map is hypermodal or multimodal: The activation of a phonetic item at the level of the phonetic map coactivates (i) a phonemic state (see phonemic link weights in Fig. 7), (ii) a motor plan state (see motor plan link weights in Fig. 7), (iii) an auditory state (see auditory link weights in Fig. 7), and (iv) a somatosensory state (not shown in Fig. 7). All these states are learned or trained during speech acquisition by tuning the synaptic link weights between each neuron within the phonetic map, representing a particular phonetic state and all neurons within the associated motor plan and sensory state maps (see also Fig. 3).

The phonetic map implements the action-perception-link within the ACT model (see also Fig. 5 and Fig. 6: the dual neural representation of the phonetic map in the frontal lobe and at the intersection of temporal lobe and parietal lobe).

Motor plans

A motor plan is a high level motor description for the production and articulation of a speech items (see motor goals, motor skills, articulatory phonetics, articulatory phonology). In our neurocomputational model ACT a motor plan is quantified as a vocal tract action score. Vocal tract action scores quantitatively determine the number of vocal tract actions (also called articulatory gestures), which need to be activated in order to produce a speech item, their degree of realization and duration, and the temporal organization of all vocal tract actions building up a speech item (for a detailed description of vocal tract actions scores see e.g. Kröger & Birkholz 2007). The detailed realization of each vocal tract action (articulatory gesture) depends on the temporal organization of all vocal tract actions building up a speech item and especially on their temporal overlap. Thus the detailed realization of each vocal tract action within an speech item is specified below the motor plan level in our neurocomputational model ACT (see Kröger et al. 2011).

Integrating sensorimotor and cognitive aspects: the coupling of action repository and mental lexicon

A severe problem of phonetic or sensorimotor models of speech processing (like DIVA or ACT) is that the development of the phonemic map during speech acquisition is not modeled. A possible solution of this problem could be a direct coupling of action repository and mental lexicon without explicitly introducing a phonemic map at the beginning of speech acquisition (even at the beginning of imitation training; see Kröger et al. 2011 PALADYN Journal of Behavioral Robotics).

Experiments: speech acquisition

A very important issue for all neuroscientific or neurocomputational approaches is to separate structure and knowledge. While the structure of the model (i.e. of the human neuronal network, which is needed for processing speech) is mainly determined by evolutionary processes, the knowledge is gathered mainly during speech acquisition by processes of learning. Different learning experiments were carried out with the model ACT in order to learn (i) a five-vowel system /i, e, a, o, u/ (see Kröger et al. 2009), (ii) a small consonant system (voiced plosives /b, d, g/ in combination with all five vowels acquired earlier as CV syllables (ibid.), (iii) a small model language comprising the five-vowel system, voiced and unvoiced plosives /b, d, g, p, t, k/, nasals /m, n/ and the lateral /l/ and three syllable types (V, CV, and CCV) (see Kröger et al. 2011) and (iv) the 200 most frequent syllables of Standard German for a 6-year-old child (see Kröger et al. 2011). In all cases, an ordering of phonetic items with respect to different phonetic features can be observed.

Experiments: speech perception

Despite the fact that the ACT model in its earlier versions was designed as a pure speech production model (including speech acquisition), the model is capable of exhibiting important basic phenomena of speech perception, i.e. categorical perception and the McGurk effect. In the case of categorical perception, the model is able to exhibit that categorical perception is stronger in the case of plosives than in the case of vowels (see Kröger et al. 2009). Furthermore, the model ACT was able to exhibit the McGurk effect, if a specific mechanism of inhibition of neurons of the level of the phonetic map was implemented (see Kröger and Kannampuzha 2008).

Chinese speech synthesis

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Chinese_speech_synthesis

Chinese speech synthesis is the application of speech synthesis to the Chinese language (usually Standard Chinese). It poses additional difficulties due to the Chinese characters (which frequently have different pronunciations in different contexts), the complex prosody, which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native speakers concerning what the correct pronunciation is of certain phonemes.

Concatenation (Ekho and KeyTip)

Recordings can be concatenated in any desired combination, but the joins sound forced (as is usual for simple concatenation-based speech synthesis) and this can severely affect prosody; these synthesizers are also inflexible in terms of speed and expression. However, because these synthesizers do not rely on a corpus, there is no noticeable degradation in performance when they are given more unusual or awkward phrases.

Ekho is an open source TTS which simply concatenates sampled syllables. It currently supports Cantonese, Mandarin, and experimentally Korean. Some of the Mandarin syllables have been pitched-normalised in Praat. A modified version of these is used in Gradint's "synthesis from partials".

cjkware.com used to ship a product called KeyTip Putonghua Reader which worked similarly; it contained 120 Megabytes of sound recordings (GSM-compressed to 40 Megabytes in the evaluation version), comprising 10,000 multi-syllable dictionary words plus single-syllable recordings in 6 different prosodies (4 tones, neutral tone, and an extra third-tone recording for use at the end of a phrase).

Lightweight synthesizers (eSpeak and Yuet)

The lightweight open-source speech project eSpeak, which has its own approach to synthesis, has experimented with Mandarin and Cantonese. eSpeak was used by Google Translate from May 2010 until December 2010.

The commercial product "Yuet" is also lightweight (it is intended to be suitable for resource-constrained environments like embedded systems); it was written from scratch in ANSI C starting from 2013. Yuet claims a built-in NLP model that does not require a separate dictionary; the speech synthesised by the engine claims clear word boundaries and emphasis on appropriate words. Communication with its author is required to obtain a copy.

Both eSpeak and Yuet can synthesis speech for Cantonese and Mandarin from the same input text, and can output the corresponding romanisation (for Cantonese, Yuet uses Yale and eSpeak uses Jyutping; both use Pinyin for Mandarin). eSpeak does not concern itself with word boundaries when these don't change the question of which syllable should be spoken.

Corpus-based

A "corpus-based" approach can sound very natural in most cases but can err in dealing with unusual phrases if they can't be matched with the corpus. The synthesiser engine is typically very large (hundreds or even thousands of megabytes) due to the size of the corpus.

iFlyTek

Anhui USTC iFlyTek Co., Ltd (iFlyTek) published a W3C paper in which they adapted Speech Synthesis Markup Language to produce a mark-up language called Chinese Speech Synthesis Markup Language (CSSML) which can include additional markup to clarify the pronunciation of characters and to add some prosody information. The amount of data involved is not disclosed by iFlyTek but can be seen from the commercial products that iFlyTek have licensed their technology to; for example, Bider's SpeechPlus is a 1.3 Gigabyte download, 1.2 Gigabytes of which is used for the highly compressed data for a single Chinese voice. iFlyTek's synthesiser can also synthesise mixed Chinese and English text with the same voice (e.g. Chinese sentences containing some English words); they claim their English synthesis to be "average".

The iFlyTek corpus appears to be heavily dependent on Chinese characters, and it is not possible to synthesize from pinyin alone. It is sometimes possible by means of CSSML to add pinyin to the characters to disambiguate between multiple possible pronunciations, but this does not always work.

NeoSpeech

There is an online interactive demonstration for NeoSpeech speech synthesis, which accepts Chinese characters and also pinyin if it's enclosed in their proprietary "VTML" markup.

Mac OS

Mac OS had Chinese speech synthesizers available up to version 9. This was removed in 10.0 and reinstated in 10.7 (Lion).

Historical corpus-based synthesizers (no longer available)

A corpus-based approach was taken by Tsinghua University in SinoSonic, with the Harbin dialect voice data taking 800 Megabytes. This was planned to be offered as a download but the link was never activated. Nowadays, references to it can be found only on Internet Archive.

Bell Labs' approach, which was demonstrated online in 1997 but subsequently removed, was described in a monograph "Multilingual Text-to-Speech Synthesis: The Bell Labs Approach" (Springer, October 31, 1997, ISBN 978-0-7923-8027-6), and the former employee who was responsible for the project, Chilin Shih (who subsequently worked at the University of Illinois) put some notes about her methods on her website.

World Bicycle Day

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/World_Bicycle_Day

Woman riding bicycle in Copenhagen

Bicycle rally on World Bicycle Day, 2018, New Delhi

Bicycle riders in Paris

Copenhagen bicycle riders

In April 2018, the United Nations General Assembly declared June 3 as International World Bicycle Day. The resolution for World Bicycle Day recognizes "the uniqueness, longevity and versatility of the Bicycle, which has been in use for two centuries, and that it is a simple, affordable, reliable, clean and environmentally fit sustainable means of transport."

The Founding of World Bicycle Day

Professor Leszek Sibilski from the United States led a grassroots campaign with his Sociology class to promote a UN Resolution for World Bicycle Day, eventually gaining the support of Turkmenistan and 56 other countries. The original UN Blue and White #June3WorldBicycleDay logo was designed by Isaac Feld and the accompanying animation was done by Professor John E. Swanson. It depicts bicyclists of various types riding around the globe. At the bottom of the logo is the hashtag #June3WorldBicycleDay. The main message is to show that the bicycle belongs to and serves all of humanity. The current blue and white #WorldBicycleDay logo again was designed by Isaac Feld and the accompanying animation was done by Professor John E. Swanson.

The Significance of World Bicycle Day

World Bicycle Day is a global holiday meant to be enjoyed by all people regardless of any characteristic. The bicycle as a symbol of human progress and advancement "[promotes] tolerance, mutual understanding and respect and [facilitates] social inclusion and a culture of peace." The bicycle further is a "symbol of sustainable transport and conveys a positive message to foster sustainable consumption and production, and has a positive impact on climate."

World Bicycle Day is now being associated with promoting a healthy lifestyle for those with Type 1 and Type 2 diabetes.

Search This Blog

Wednesday, July 15, 2020

Speech perception

Acoustic cues

Linearity and the segmentation problem

Lack of invariance

Context-induced variation

Variation due to differing speech conditions

Variation due to different speaker identity

Perceptual constancy and normalization

Categorical perception

Top-down influences

Acquired language impairment

Aphasia

Agnosia

Infant speech perception

Cross-language and second-language

In language or hearing impairment

Listeners with aphasia

Listeners with cochlear implants

Noise

Music-language connection

Speech phenomenology

The experience of speech

Research methods

Behavioral methods

Sinewave Speech

Computational methods

Neurophysiological methods

Theories

Motor theory

Exemplar theory

Acoustic landmarks and distinctive features

Fuzzy-logical model

Speech mode hypothesis

Direct realist theory

Neurocomputational speech processing

Overview

Neurocomputational speech processing topics

Neural maps

Neural representations (neural states)

Neural mappings (synaptic projections)

Neural pathways

DIVA model

Structure of the model

Speech sound map: the phonemic representation as a starting point

Feedforward control: activating motor representations

Articulatory model: generating somatosensory and auditory feedback information

Feedback control: sensory target regions, state maps, and error maps

Learning (modeling speech acquisition)

Babbling

Imitation

Perturbation experiments

Real-time perturbation of F1: the influence of auditory feedback

Unexpected blocking of the jaw: the influence of somatosensory feedback

ACT model

Structure

Action repository

Motor plans

Integrating sensorimotor and cognitive aspects: the coupling of action repository and mental lexicon

Experiments: speech acquisition

Experiments: speech perception

Chinese speech synthesis

Concatenation (Ekho and KeyTip)

Lightweight synthesizers (eSpeak and Yuet)

Corpus-based

iFlyTek

NeoSpeech

Mac OS

Historical corpus-based synthesizers (no longer available)

World Bicycle Day

The Founding of World Bicycle Day

The Significance of World Bicycle Day

Illegal immigration to the United States