But to our knowledge, there has been only one attempt to synthesize inharmonic speech utterances (words, phrases, sentences, etc.) for segregation experiments, and the limitations of resynthesis necessitated the use of speech with a flattened pitch contour 30. Numerous studies have examined the perception of concurrent synthetic vowels and support a role for harmonicity in their segregation 26, 27, 28, 29.
Prior attempts to explore speech segregation have relied on relatively limited synthetic approximations. The main obstacle to investigating sound segregation with natural sounds such as speech has been the difficulty of manipulating such sounds for experimental purposes. We sought to test whether the classic psychoacoustic grouping effects of harmonicity on synthetic tones 16, 17, 18, 19 would replicate with speech and whether harmonicity would be critical for extracting speech information from mixtures of talkers. Because harmonic frequencies typically result from a single sound-generating process that is periodic in time, their presence also provides a cue that they were generated by a common source. Harmonicity is believed to underlie pitch perception 23 and musical harmony 24 and may be detected by single neurons in the primate auditory system 25. Harmonicity refers to the situation in which sound frequencies are integer multiples of a common fundamental frequency (f0). The goal of the present study was to explore the role of harmonicity in the segregation of natural speech. Because speech, music, and other everyday sounds are more complex and varied than the artificial stimuli used in most psychoacoustic studies of acoustic grouping, it is not obvious whether effects observed with synthetic stimuli will transfer to real-world conditions. Although these sound properties presumably derive their importance from natural sound statistics, they have been studied primarily using relatively simple artificial sounds. Prior studies of sound segregation, driven by intuitions about the structure of natural sounds, have indicated the importance of a small number of acoustic grouping cues: common onset 14, 15, harmonicity 16, 17, 18, 19, 20, and repetition 21, 22. Understanding sound segregation thus requires uncovering the statistical regularities of natural sounds and the processes that make use of them. This is possible because natural sounds exhibit statistical regularities that the brain can use to group acoustic energy that is likely to have originated from the same source. Humans with normal hearing can usually segregate sounds successfully, helping us solve the “cocktail party problem.” Although spatial information contributes to our success in this domain 9, 10, 11, 12, 13, humans segregate sounds remarkably well from monaural signals, as when listening to mono music. Our results demonstrate acoustic grouping cues in real-world sound segregation.Īuditory scenes with multiple sound sources are ubiquitous in our lives, and the ability to segregate a particular source of interest from the sound mixture that enters the ears is critical for communication and recognition 1, 2, 3, 4, 5, 6, 7, 8. However, additional segregation deficits result from replacing harmonic frequencies with noise (simulating whispering), suggesting additional grouping cues enabled by voiced speech excitation. We find that violations of harmonicity cause individual frequencies of speech to segregate from each other, impair the intelligibility of concurrent utterances despite leaving intelligibility of single utterances intact, and cause listeners to lose track of target talkers. To test the role of harmonicity in real-world sound segregation, we developed speech analysis/synthesis tools to perturb the carrier frequencies of speech, disrupting harmonic frequency relations while maintaining the spectrotemporal envelope that determines phonemic content. One much-discussed regularity is the tendency for frequencies to be harmonically related (integer multiples of a fundamental frequency). The brain must use knowledge of natural sound regularities for this purpose. The “cocktail party problem” requires us to discern individual sound sources from mixtures of sources.