Skip to main content
Free AccessShort Research Article

The Effect of Talker Familiarity on Sentence Recognition Accuracy in Complex Noise

Published Online:https://doi.org/10.1027/1618-3169/a000509

Abstract

Abstract. The familiar talker advantage is the finding that a listener’s ability to perceive and understand a talker is facilitated when the listener is familiar with the talker. However, it is unclear when the benefits of familiarity emerge and whether they strengthen over time. To better understand the time course of the familiar talker advantage, we assessed the effects of long-term, implicit voice learning on 89 young adults’ sentence recognition accuracy in the presence of four-talker babble. A university professor served as the target talker in the experiment. Half the participants were students of the professor and familiar with her voice. The professor was a stranger to the remaining participants. We manipulated the listeners’ degree of familiarity with the professor over the course of a semester. We used mixed effects modeling to test for the effects of the two independent variables: talker and hours of exposure. Analyses revealed a familiar talker advantage in the listeners after 16 weeks (∼32 h) of exposure to the target voice. These results imply that talker familiarity (outside of the confines of a long-term, familial relationship) seems to be a much quicker-to-emerge, reliable cue for bootstrapping spoken language perception than previous literature suggested.

When a listener is familiar with a talker’s voice, the ability to perceive and process the talker’s speech is often better than if they are listening to a stranger. This phenomena is known as the familiar talker advantage. Researchers have noted this familiar talker advantage for listeners across the lifespan ranging from infancy (Barker & Newman, 2004) to childhood (Levi et al., 2019) to late adulthood (Yonan & Sommers, 2000). They have also uncovered the familiar talker advantage within a variety of experimental paradigms such as speech shadowing (Newman & Evers, 2007), sentence recognition with speech-shaped noise in the background (Souza et al., 2013), spoken word recognition following explicit voice-learning tasks (Levi et al., 2019; Nygaard & Pisoni, 1998), sentence recognition following implicit voice-learning tasks (Case et al., 2018b), and closed-set sentence recognition task in the presences of same-language maskers (Holmes & Johnsrude, 2020). It is theorized that the familiar talker advantage is a result of the relationship between the linguistic information and indexical, talker-specific information in the speech signal (Abercrombie, 1967) and how this information is encoded in and retrieved from the listener’s brain over time (Pisoni, 1997). However, studies to date provide us with limited knowledge of how and when listeners might utilize talker familiarity cues during everyday communication, particularly in noisy listening environments when the listeners’ degree of familiarity with the talker varies and subsequently changes over time.

The Nature of Familiarity

One limitation of research examining the familiar talker advantage is the elusive nature of familiarity; it is a concept that is challenging to define and measure. Furthermore, experimental evidence of the familiar talker advantage seems highly dependent on (1) the manner and time during which a listener has learned (i.e., become familiar with) a talker’s voice and (2) the research methods employed to assess the benefits of talker familiarity on spoken language processing. For example, in many studies investigating the familiar talker advantage, researchers exploited participants’ familial or multidecade relationships in very specific perceptual tasks. The infants of Barker and Newman’s (2004) work who participated in auditory simultaneous stream segregation with their mothers’ voices serving as the target talker and the adults of Souza et al.’s (2013) study in which the listeners’ spouses served as the familiar, target talkers in a speech recognition task are both examples of such participants. In both of these studies, the listeners had extensive, implicit exposure to the familiar target voices over a period of at least 7 months or 7 years (respectively), thus yielding a familiar talker advantage despite the differences in the perceptual tasks’ duration of exposure to the target voices. On the other hand, researchers have also demonstrated a familiar talker advantage when they used laboratory-based training paradigms to establish listeners’ familiarity with the target talkers. For example, Nygaard and Pisoni (1998) were some of the first to demonstrate a familiar talker advantage with young adults following 9 days of laboratory voice training during their sentence recognition task. While Levi et al. (2019) revealed a familiar talker advantage in their talker training experiment with school-age children following their 5-day familiarization with German–English bilingual talkers.

When Does the Unfamiliar Become Familiar?

Taken together, this current body of literature leaves one to wonder – when does a novel voice become familiar enough to yield a familiar talker advantage? Domingo et al. (2020) recently took an initial step toward answering this question and explored the familiar talker advantage in pairs of spouses and “friends”; spouses reported they knew each other for a Mdn = 27 years and friends reported a Mdn = 5 years. In their study, listeners completed a closed-set sentence recognition task in the presence of a masking sentence (both the target and masking stimuli were sentences from the Boston University Gerald corpus; Kidd et al., 2008). The sentences were spoken either by a talker familiar to the listener and their spouse/friend or by an unfamiliar talker. The results revealed a significant improvement in sentence recognition accuracy when listeners heard familiar talkers compared to the unfamiliar talkers, but there was no difference in the magnitude of the familiar talker advantage between the friends and spouses. The authors subsequently suggested that the familiar talker advantage plateaus after knowing someone for 1.5 years without any additive familiarity advantage thereafter. However, they did not speculate on the amount of time necessary (i.e., degree of familiarity) to yield a familiar talker advantage in the first place.

Current Study

To further explore the question when does a voice become familiar enough to yield a familiar talker advantage, we examined the effects of long-term, implicit voice learning on young adult listeners’ sentence recognition performance in the presence of four-talker babble. The familiar talker in the present study was the listeners’ professor. Implicit familiarization with her voice occurred throughout the duration of the university’s 16-week-long semester for half of the participating listeners. The professor was a stranger to the remaining half. Our research questions were as follows:

Research Question 1 (RQ1): Does implicit, nonfamilial, talker familiarity aid sentence recognition in the presence of four-talker babble?

Research Question 2 (RQ2): When do the benefits of said familiarity emerge (as measured by listeners’ performance on a sentence recognition in the presence of four-talker babble)?

Research Question 3 (RQ3): Are the benefits of said familiarity bolstered over the passage of time (i.e., with a listener’s increased experience with the target voice)?

We hypothesized there would be a significant benefit of talker familiarity resulting in a high accuracy rate on the sentence recognition task, similar to the familiar talker advantage noted in the aforementioned research. We likewise predicted a familiar talker advantage would be notable given the complex background noise (i.e., multitalker babble) included in the present experimental manipulation. As in past work (e.g., Case et al., 2018a), such results could theoretically be explained by the time-course hypothesis of specificity effects (McLennan & Luce, 2005), which states that talker-specific information will benefit a listener when their cognitive processing is slow and effortful, as compared to rapid and automatic. Predicting whether the familiar talker advantage would emerge as early as 12 h of exposure was challenging. Infant literature (Barker & Newman, 2004) and training literature with adults (Holmes et al., 2018) exploring the familiar talker advantage suggest that benefits of talker familiarity would be expected to emerge between 6 and 7 months of consistent exposure to a voice. Thus, it would be unprecedented to uncover a familiar talker advantage in the present study at ∼12 h of exposure over about 3 weeks. For our third hypothesis, we predicted the familiar talker advantage would be strengthened as exposure to the talker increased. In other words, sentence recognition accuracy would be significantly better when assessed during 32 h of exposure over about 16 weeks compared to 12 h of exposure. This prediction was based on the fact that the listeners’ exposure to their professor increased as the semester unfolded, thus increasing their level of familiarity with her voice. Simultaneously, the duration of their relationships with their professor would not reach 1.5 years by the end of the semester, thus reducing the probability of hitting a familiarity plateau (Domingo et al., 2020) and allowing for continued growth in familiar talker advantage magnitude.

Method

Participants

A total of 89 native speakers of US English participated in the study. Participants reported no history of speech, language, or cognitive disorders. All participants had typical hearing as indicated by a hearing screening at 25 dB HL at 500, 1,000, 2,000, and 4,000 Hz using a Grason-Stadler (Grason-Stadler, Inc., Eden Prairie, MS) 61 audiometer. A total of 21 additional individuals’ data were not collected due to the following reasons: They did not present with typical hearing upon screening or they were not native speakers of US English.

We divided the participants into two groups based on the participants’ familiarity with the talker who recorded the target stimuli. We recruited participants in the familiar talker group (n = 37; 35 females and 2 males) from a senior-level undergraduate course during the first week of the semester. The professor for this course was the female talker who recorded the target stimuli. The participants’ ages ranged from 19 to 28 years (M = 22.8 years). These individuals participated longitudinally and were tested in the 3rd/4th and 15th/16th weeks of the semester. All participants were present for both assessments (i.e., ∼12 h of exposure and 32 h of exposure; see Table 1 for details). Participants in the novel talker group (n = 52, 32 females and 20 males) were recruited from the university’s research participant pool, which consists of undergraduate students volunteering to participate in studies for course credit or compensation. The participants’ ages ranged from 18 to 26 years (M = 20.0 years). These participants were tested during the 6th/7th and 14th/15th weeks of the semester. Six of the individuals available at the first testing session were available to complete this task at the second testing session. Therefore, 46 of the participants in the novel talker group had data at only one time point. All participants were compensated with either course credit or extra credit in their classes.

Table 1 Number of listeners participating at each test point across groups

Materials

We used E-Prime 2.0 experimental software (Psychology Software Tools, Pittsburgh, PA) to execute the experiment on a Dell OptiPlex 7040 computer (Dell GmbH, Frankfurt a.M., Germany) and monitor equipped with Sennheiser HD 280 Pro circumaural headphones (Sennheiser GmbH & Co. KG, Wedemark, Germany). The experimental setup was located in a double-walled sound booth.

Sentence Stimuli

The aforementioned professor of the undergraduate course recorded a total of 80 English sentences randomly chosen from the Institute of Electrical and Electronics Engineers’ (IEEE) corpus (“Harvard Sentences”; 1969) to serve as the target stimuli. This corpus consists of low-predictability, phonetically balanced sentences that match the frequency of phonemes in English with five key words in each sentence (e.g., A tame squirrel makes a fine pet.).

Forty sentences were presented at ∼12 hr of exposure, and the remaining 40 sentences were presented at ∼32 hr of exposure. The professor was a female, US English, native speaker, aged 29 years. We recorder her using a Shure Professional SM81-LC microphone (Shure Distribution GmbH, Eppingen, Germany) and a Mackie 1202VLZ4 mixer connected to the Dell computer (Dell GmbH, Frankfurt a.M., Germany) in a double-walled sound booth. We recorded the stimuli in mono sound at 44.1 kHz. After recording, we edited and equated across the stimuli’s total root-mean squared (r.m.s.) values using Adobe Audition (Version CC; Adobe Systems, 2015) sound editing software.

Background Noise

The background noise in this study was four-talker babble. The four-talker babble consisted of an audio mix of two female and two male talkers, speaking 10 sentences from the IEEE corpus (1969) that were not used as target stimuli for this study. This type of babble is shown to function as both a linguistic and energetic masker (Simpson & Cooke, 2005). Given the number of competing speech streams (i.e., 4), it was still possible for the listener to perceive and discern the individual words and sentences from each talker in the babble noise.

We made the babble by first editing and equating the sentences’ total r.m.s. values in Adobe Audition. We then mixed the 40 sentences from all the talkers. The four-talker babble had an average F0 of 214 Hz with a maximum F0 of 318 Hz and a minimum F0 of 63 Hz. Finally, we increased the amplitude of the mixed, four-talker babble to yield a −2 dB signal-to-noise ratio. We chose −2 dB SNR because pilot testing indicated that listeners scored between 40% and 60% accurate on the sentence recognition task at this level, thus eliminating the risk of floor and ceiling effects in our final data collection.

Procedure

The participant sat down in front of the computer and monitor setup in a double-walled sound booth. Then, the experiment began and the participant was presented with the following instructions on the monitor: “You will listen to a woman, speaking a number of different sentences while noise plays in the background. Your job is to type exactly what you hear her say, while ignoring the background noise.” For participants in the familiar talker group, the instructions explicitly stated the talker’s name and how the participant knew the target talker. For the participants in the novel talker group, the instructions did not reference the talker’s identity. After the instructions, the participant completed 2 practice trials (without feedback) followed by 20 test trials, a break, and 20 final test trials. The sentences and the four-talker babble were presented diotically to the participants via the circumaural headphones during both the practice and test trials. For each trial, after the target sentence was presented, the listener was instructed to type out the target sentence using the computer’s keyboard. Sentences were presented pseudorandomly without replacement across all participants. The experimental session was self-paced, and each participant completed the task within 1 h.

After completing all the test trials, each participant filled out a questionnaire reporting whether they recognized the talker, if they could identify the talker, and how often they attended the professor’s class each week. This was to confirm novel participants were unfamiliar with the talker and familiar participants could accurately and explicitly identify the talker. Data from the questionnaire suggested that all the participants in the familiar talker group knew the talker and identified her as their professor. Participants in the familiar talker group also reported a minimum of 3 h of exposure a week to her voice based on class attendance. Finally, all the participants in the novel talker group reported they did not know the target talker.

Results

Following data collection, researchers hand-scored the participants’ final responses using keyword accuracy (Bradlow et al., 1996). Obvious spelling errors were counted as correct, but added or deleted morphemes were counted as wrong. Keyword accuracy served as the unit of analysis.

To test for the effects of the two independent variables: talker (familiar and novel) and hours of exposure (approximately 12 h and approximately 32 h), we employed two main analysis procedures: descriptive statistics and mixed effects modeling. Descriptive statistics were calculated based on stratification of the groups. Data visualization using a line plot show the differences in average performance between the groups (Figure 1). Finally, due to the repeated measures, we fit the mixed effects models using R software, version 3.4.3 (Bates, et al., 2015; R Core Team, 2016). Using mixed effects modeling, we assessed the relationship between the two independent variables and the dependent variable (keyword accuracy) while controlling for the clustering of the observations (i.e., the repeated measures). Finally, we assessed assumptions regarding the mixed effects modeling. The data and code for replicating the analyses herein are provided at osf.io/d4zar.

Figure 1 Group mean accuracy plots showing the overall pattern across talker and hours of exposure. Data from the participants listening to the familiar talker are plotted in black, and data from the participants listening to the novel talker are plotted in light gray. On the x-axis, 12 h of exposure to the talker in the familiar group reflect data gathered in the 4th week of the semester and 32 h of exposure reflects data gathered in the 16th week of the semester. The error bars highlight the 95% CIs.

Descriptive Statistics

We calculated M and SD for keyword accuracy by talker. This information can be found in Table 2. Figure 1 highlights the distributions of keyword accuracy across talker and hours of exposure.

Table 2 Descriptive statistics of the sample
Table 3 Results of the mixed effect models

Mixed Model Results

Two mixed effects models were used to assess the effect of talker on keyword accuracy (see Table 2). First, the intraclass correlation was assessed. The intraclass correlation showed a high value (r = .72), demonstrating the need for using linear mixed effects models over other approaches that depend on the independence of observations (e.g., ANOVA and linear regression).

To assess the main effects of talker, Model 1 included the main effects of both talker and hours of exposure. The model structure used accounted for variance across participants while hours of exposure was the fixed effect used to predict accuracy: lmer (scale(value) ∼ exposure + (1 | ID)). The Akaike information criterion of this model was 315.47, while the null comparison AIC value was 368.95, suggesting that this model accounts for the data better than the null. Both the intercept (p < .001) and slope (p < .001) of the model were statistically significant, suggesting that hours of exposure predicted keyword accuracy on this task. Both talker and hours of exposure significantly predicted keyword accuracy. The novel talker condition had 0.08 lower accuracy across both time points (p < .001, all p values reported for the mixed effects model uses the Satterthwaite approximation to degrees of freedom; Satterthwaite, 1946). Additionally, across both talker conditions, time predicted a 0.10 increase in keyword accuracy (p < .001).

Model 2 separated out hours of exposure at 4 weeks (∼ 12 h) and again at 16 weeks (∼ 32 h). The model structure was as follows: lmer (value ∼ factor (exposure) + (1 | ID), data = mod_data). The intercept (p < .001, CI, −.61, −.13, ES = −0.37) and exposure at 16 weeks were statistically significant (4 weeks: p = .31, CI, −.18, .56, ES = 0.19; 16 weeks: p < .001, CI, .74, 1.48, ES = 1.11). This suggests that ∼ 12 h of exposure (4 weeks) to the target talker did not significantly predict performance. However, following 32 h of exposure (16 weeks), exposure does predict performance on the sentence recognition task.

Discussion

In the present study, we revealed a familiar talker advantage in a sentence-recognition-in-noise task that emerged after only ∼32 h (16 weeks) of exposure to the target voice in a natural listening environment (i.e., a college lecture course). These results support our first hypothesis and are in concert with past work showing a familiar talker advantage in the presence of background noise (Souza et al., 2013) and with a familiar yet unrelated talker (Newman & Evers, 2007). This finding is also supported by predictions of the time-course hypothesis of specificity effects (McLennan & Luce, 2005) given what we know about the cognitive demands placed on a listener when tasked with sentence recognition while simultaneously listening to four-talker babble in the background (Rosen et al., 2013).

Our findings also provide new insights into our second research question: When do the benefits of said talker familiarity first emerge? Prior to the present study, the shortest relationship documented in the literature that naturally yielded a familiar talker advantage was that of 7.5-month-old infants and their mothers (Barker & Newman, 2004). In the talker training literature, the shortest duration of explicit familiarization with a voice, resulting in a familiar talker advantage with adults, was 6 months (∼120 h; Holmes et al., 2018). Our current study suggests that listeners require remarkably less exposure to a talker before yielding a familiar talker advantage – about 32 h distributed over 16 weeks. Our data also demonstrated that implicit exposure to a target talker’s voice for ∼12 h over ∼4 weeks was not enough time to demonstrate a familiar talker advantage. Taken together, these data suggest that, although the exact milestone when an unfamiliar voice becomes beneficially familiar remains elusive, listeners need a more than a month of natural exposure to a nonfamilial talker before they may be able to benefit from a familiar talker advantage during spoken language processing.

Our third hypothesis, the familiar talker advantage would be strengthened over time, was supported by our data from the individuals in the familiar talker condition. These results suggest a classic practice effect (Payne & Wenger, 1996). This was not surprising given that it has long been known that human learners demonstrate improvement on tasks after repeated exposure; furthermore, similar results were found by Case et al. (2018b) and Domingo et al. (2020) in their experimental tasks. Participants listening to the novel voice in the present study did not demonstrate improvement over time, a foreseeable outcome given our inability to test novel participants longitudinally. Despite the high attrition of the novel group, it is important to note that the patterns in our data (Figure 1) do replicate what Case et. al (2018a) found in their familiarity training task – sentence recognition accuracy in the novel talker condition did not improve over time (p = .088) while the accuracy of the participants in the familiar talker condition did significantly improve (p < .001).

It is unclear if the familiarity talker advantage seen here is similar in strength to that of familiarity benefits seen in family members. There is more exploration needed here. One could expect that if the talker was a family member that performance would be improved. However, it is important to highlight the novelty of our finding that a strong talker familiarity advantage was found after only ∼32 h of implicit training in an undergraduate class over the course of a semester. This finding suggests that the talker familiarity advantage is utilized by listeners sooner than previously thought.

Limitations and Future Directions

Despite the present findings revealing a familiar talker advantage for sentence recognition in the presence of four-talker babble after ∼32 h of implicit voice learning in a natural environment, it is important to consider the limitations to this study. The foremost limitation is the differential attrition between the two groups of participants, with the novel talker group having high attrition (n = 30 at 12 h of exposure, and only six of those participants were tested again at 32 h of exposure, n = 28) and the familiar talker group having none. We took appropriate steps in our analyses to reduce the impact of this limitation using mixed effects models, but the lack of matched experimental groups limits the full-comparative power of the novel talker group. Another limitation is the fact that the participants in the novel talker group were not matched to the familiar talker group; thus, there is a potential that other factors beyond that of being familiar with the talker (e.g., language skills or working memory capacity) may have contributed to the group differences. However, given that there were no significant differences between the two groups’ keyword accuracy at 12 h of exposure, it stands to reason that such potential extraneous factors did not impact the present results. Alternatively, an important future direction would be to complete an approximate replication of the present study with matched groups to better isolate the limits and scope of the familiar talker advantage.

Conclusion

In summary, in the present study, we showed that the benefits of talker familiarity emerged in young adults listeners after only ∼32 h (16 weeks) of exposure to the target voice in a college classroom setting. These data suggest that talker familiarity (outside of the confines of a familial relationship) can be reliable cue for bootstrapping spoken language perception during sentence recognition in the presence of four-talker babble. This is a first noted in the talker familiarity literature, and it helps the field begin to hone in on the amount of exposure required for an unfamiliar talker to become advantageously familiar. Further testing the limits and time course of the familiar talker advantage will not only benefit the field’s broader understanding of spoken language processing but may also result in important clinical applications for listeners with hearing loss and other listening challenges (e.g., talker training; Tye-Murray et al., 2016).

References