Skip to main content
Free AccessOriginal Article

Strategies for Increasing the Accuracy of Interviewer Observations of Respondent Features

Evidence From the US National Survey of Family Growth

Published Online:https://doi.org/10.1027/1614-2241/a000142

Abstract

Abstract. Because survey response rates are consistently declining worldwide, survey researchers strive to obtain as much auxiliary information on sampled units as possible. Surveys using in-person interviewing often request that interviewers collect observations on key features of all sampled units, given that interviewers are the eyes and ears of the survey organization. Unfortunately, these observations are prone to error, which decreases the effectiveness of nonresponse adjustments based on the observations. No studies have investigated the strategies being used by interviewers tasked with making these observations, or examined whether certain strategies improve observation accuracy. This study is the first to examine the associations of observational strategies used by survey interviewers with the accuracy of observations collected by those interviewers. A qualitative analysis followed by multilevel models of observation accuracy shows that focusing on relevant correlates of the feature being observed and considering a diversity of cues are associated with increased observation accuracy.

Interviewers in “face-to-face” surveys are often tasked with observing key features of sampled units, given that the interviewers are the eyes and ears of the survey organization. Because response rates in surveys of all formats have been consistently declining worldwide (Baruch & Holtom, 2008; Biener, Garrett, Gilpin, Roman, & Currivan, 2004; Brick & Williams, 2013; Cull, O’Connor, Sharp, & Tang, 2005; Curtin, Presser, & Singer, 2005; de Leeuw & de Heer, 2002; Tolonen et al., 2006; Williams & Brick, 2017), interviewers are asked to do this in an effort to obtain as much relevant auxiliary information on all sampled units as possible. The survey methodology literature has clearly established the need for auxiliary variables used for nonresponse adjustment of survey estimates to be related to both survey variables of interest and response propensity (Beaumont, 2005; Bethlehem, 2002; Groves, 2006; Kreuter et al., 2010; Lessler & Kalsbeek, 1992; Little & Vartivarian, 2005), so that the adjustments will reduce both the bias and the variance in the ultimate survey estimates. Survey organizations will therefore request that interviewers attempt to collect observations on auxiliary variables having these optimal properties (in theory) for both respondents and nonrespondents, and then evaluate the potential of the observations for nonresponse adjustment purposes (e.g., West, 2013a).

Unfortunately, interviewer observations can be prone to error. The various conceptualizations of total survey error (TSE) that have been published over the years (Groves & Lyberg, 2010) consistently acknowledge the problem of nonresponse bias that can arise in surveys. Errors in estimation are less often acknowledged as a key part of TSE (Biemer, 2010; Deming, 1944). From a TSE perspective that also considers errors in estimation, errors in interviewer observations may lead to nonresponse adjustments that introduce more bias in survey estimates than was present before the nonresponse adjustments (Lessler & Kalsbeek, 1992; Stefanski & Carroll, 1985; West, 2013a, 2013b). This underscores the need for interviewer observations to be of sufficient accuracy, so that nonresponse adjustments based in part on the observations will do an effective job of repairing nonresponse bias. Researchers have started to consider design strategies for improving observation accuracy as a result (West & Kreuter, 2015).

More generally, the survey methodology literature has recently begun to examine the non-negligible error properties of these observations (Campanelli, Sturgis, & Purdon, 1997; Casas-Cordero, Kreuter, Wang, & Babey, 2013; Groves, Wagner, & Peytcheva, 2007; McCulloch, Kreuter, & Calvano, 2010; Pickering, Thomas, & Lynn, 2003; Sinibaldi, Durrant, & Kreuter, 2013; Tipping & Sinibaldi, 2010; West, 2013a; West & Kreuter, 2013, 2015; West, Kreuter, & Trappmann, 2014). Several recent studies have suggested that interviewers working on the same survey and receiving the same general training will vary substantially in terms of the accuracy of these observations, even when controlling for relevant respondent-, interviewer-, and area-level covariates (Sinibaldi et al., 2013; West & Kreuter, 2013, 2015; West et al., 2014).

So why might interviewers working on the same study, and having received the same training, vary substantially in terms of observation accuracy (even after adjustment for covariates that might influence the accuracy)? No previous studies have assessed the possibility that interviewers may be using different strategies to collect these observations (i.e., looking for different cues that might serve as indicators of a particular feature being observed), and considered whether different strategies in a given context may lead to more or less accurate observations. Because survey organizations may use these interviewer observations for nonresponse adjustment purposes, and interviewers have repeatedly been demonstrated to vary substantially in terms of the accuracy of the observations, this study sought to (1) explore the variability in the strategies that interviewers use to collect these observations, and (2) understand whether there is a relationship between the strategy used and observation accuracy. The identification of strategies that are associated with more accurate observations has clear implications for improving interviewer training, and the overall improvements in accuracy that may arise from more standardized training on this process may in turn improve the effectiveness of nonresponse adjustments based in part on these observations.

Why might interviewers vary in terms of the observational strategies that they use? In the absence of standardized training on the observation process, survey interviewers might utilize different “native” observational strategies, based on varying prior expectations of how the social world around them is organized (e.g., “a respondent who is currently in a romantic relationship will have features X, Y, and Z, so I will only look for features X, Y, and Z when making this judgment…”; Funder, 1987, 1995; Manderson & Aaby, 1992a, 1992b; McCall, 1984; Tversky & Kahneman, 1974). However, empirical evidence of this assumption is largely lacking in the methodological literature. Do interviewers really vary in terms of the cues that they are looking for when attempting to record these types of observations? From a psychological perspective, interviewer perceptions may be influenced by specific environmental and behavioral cues (Forgas & Brown, 1977). The literature cited above suggests that interviewers with different backgrounds and expectations may be influenced in different ways by different types of cues (both when recording observations and survey responses), meaning that only certain types of cues will influence the observations that they record. In the context of the present study, these more influential cues would define the “strategies” employed by interviewers when recording their observations. No studies to date have considered whether these naturally-varying strategies may ultimately lead to variability among interviewers in observation accuracy.

What observational strategies might be associated with increased accuracy? Some interviewers may resort to considering features of the areas/environments in which they are working in the absence of any household-specific cues. This could be helpful if the interviewers are familiar with the areas and the areas are fairly homogeneous in terms of the feature(s) being observed. But if areas tend to be more heterogeneous, interviewers may incorrectly apply expectations that all households in that area will have the same features, or assume that if several households have been similar, the next will also have similar features (Babbie, 2001; Das, 1983; Harris, Jerome, & Fawcett, 1997; Manderson & Aaby, 1992a, 1992b; Millen, 2000; Repp, Nieminen, Olinger, & Brusca, 1998Seidler, 1974; Tversky & Kahneman, 1974).

For other interviewers, the observational task may be quite difficult (e.g., inability to access a locked building of apartments, working in crowded urban areas, etc.), leading to a failure to pick up on important external visual cues and subsequent guessing or “going on hunches.” In these situations, observations would be expected to have reduced accuracy (Feldman, Hyman, & Hart, 1951; Funder, 1987; Graham, 1984; Jones, Riggs, & Quattrone, 1979; Kazdin, 1977; Most, Scholl, Clifford, & Simons, 2005; Simons & Jensen, 2009). Other interviewers may attempt to pick up on several different relevant predictors of a feature being observed as well as specific features of a given household or respondent. These strategies reflecting diversity in the cues used (depending on the context) and an ability to detect specific features of a given respondent would be expected to have increased accuracy (Funder, 1995; Kazdin, 1977; West & Kreuter, 2015; West et al., 2014). The social psychology literature also suggests that observations based on first impressions in the presence of limited information will tend to have increased accuracy (Ambady, Hallahan, & Conner, 1999; Patterson & Stockbridge, 1998). Whether or not these theoretical expectations related to accurate observational strategies are borne out in the survey interviewing context remains an open research question.

This article presents an examination of observational strategies that are associated with the accuracy of a key interviewer judgment in the US National Survey of Family Growth (NSFG). NSFG interviewers (each of whom is female) first attempt to conduct a screening interview with an adult informant within a randomly sampled housing unit. The primary purpose of the screening interview is to determine whether the sampled housing unit contains a person between the ages of 15 and 49 (the target population for the NSFG). Upon identification of all eligible persons within a household, one age-eligible person is selected at random for the main (face-to-face) NSFG interview (possibly at a later date that is convenient for the selected respondent). Only about 77% of these selected individuals ultimately participate in the main interview, and this number has continued to decline gradually over time. Beginning with the seventh cycle of the NSFG (June 2006–June 2010; Lepkowski, Mosher, Davis, Groves, & Van Hoewyk, 2010), interviewers were tasked with judging whether this randomly selected age-eligible person was currently in a sexually active relationship with a member of the opposite sex, immediately after completion of the screening interview.

Why might the NSFG interviewers be asked to record this type of subjective judgment after each screening interview, given that this is not a directly observable feature of potential respondents? Unlike other types of auxiliary information commonly recorded by interviewers (e.g., contact observations, socio-demographics, housing unit features, etc.), this NSFG-specific judgment has a strong association with a number of key NSFG variables, along with the propensity of persons screened in the NSFG to respond to the main survey request (West, 2013a). These judgments therefore define an ideal auxiliary variable for nonresponse adjustment purposes (Little & Vartivarian, 2005). The NSFG in part focuses on several key variables related to sexual behavior and activity, and being able to get some indication from the interviewers as to whether potential respondents (selected from the initial screening interview) are currently in a sexually active relationship provides useful auxiliary information for persons who do not eventually respond to the main NSFG interview. NSFG staff can use this information (along with other auxiliary variables on the sampling frame) to predict key outcomes related to being in sexual relationships for main interview nonrespondents. These types of “tailored” interviewer observations, which survey managers can specifically design as potential correlates of key survey variables of interest, have also been shown to be more effective for nonresponse adjustments than other data sources, including linked information from commercial data sources (Sinibaldi, Trappmann, & Kreuter, 2014).

These interviewer judgments could also be validated based on actual respondent reports of sexual activity collected in the main NSFG interview, which is important for assessing their accuracy. In addition to being asked to record these judgments after completing screening interviews, NSFG interviewers in the last two quarters of Cycle 7 data collection were also asked to provide open-ended justifications for why they made their judgments, and what cues they noted when recording the judgments. This study sought to leverage this qualitative information and assess the interviewer-specific observational strategies evident in these justifications, along with the amount of variability among interviewers in judgment accuracy that was explained by these strategies.

Given that no prior research has examined observational strategies being used by field interviewers tasked with making these types of judgments, the present study aimed to answer the following two research questions:

  1. 1.
    Do NSFG interviewers tend to fall into distinct “strategy” clusters based on the justifications used for their sexual activity judgments?
  2. 2.
    Were certain observational strategies associated with increased accuracy of the sexual activity judgments?

Data

Coding of Open-Ended Justifications

In the last two quarters of data collection for the NSFG (Cycle 7), 45 interviewers were asked to record (on laptop applications) open-ended justifications for their post-screener judgments of perceived current sexual activity for selected persons (see Figure 1). The interviewers were trained to provide the justifications immediately after the judgments were made, and the interviewers could not proceed with main interview tasks until after the judgments and their justifications had been entered (along with all other observations from the screening interview). This means that all judgments and justifications were recorded prior to the main NSFG interview, and there were no missing data for the judgments or justifications. The interviewers were not prompted for specific justifications or limited in any way (e.g., justification length), and interviewer training sessions on this process did not suggest any specific strategies to use when recording the judgments. The interviewers were simply told to make their best judgments based on what they had seen and/or heard, opening the door for the use of the aforementioned “native” observational strategies.

Figure 1 NSFG interviewers entered open-ended justifications for their sexual activity judgments into the box labeled “Rsex rel with opposite sex partner.”

In total, the 45 interviewers provided 3,992 open-ended justifications of widely varying lengths during these two quarters of data collection. Two examples of real recorded justifications follow:

  1. 1.
    “He works and goes to school and lives here with his twin – I do not think he could have someone over as the carpet is all taken up and it smells badly of dog poo.” A justification for a judgment of not currently sexually active, reflecting the use of cues describing features of the housing unit.
  2. 2.
    “He has a tattoo, ‘Carol’, over his heart.” A justification for a judgment of currently sexually active, reflecting an ability to pick up on the respondent’s appearance during the screening interview.

The 3,992 justifications were coded on 13 different indicator variables (1 = mentioned in justification, 0 = not mentioned), with all indicators coded for each justification:

  • Living arrangement (living with spouse, parents, etc.).
  • Relationship status (mention of spouse, partner, etc.).
  • Age.
  • Housing unit characteristics (presence of children, cleanliness, etc.).
  • Appearance (references to physical appearance, ethnicity, etc.).
  • Neighborhood characteristics.
  • Shyness.
  • Going on Hunches/Guessing (indication of a gut feeling, or not being sure).
  • Incorrect (an incorrect judgment was entered in hindsight; after recording the judgment and having it saved in the system, the interviewer realized that they entered an incorrect judgment while they were writing their justification and mentioned this in their justification; we note that NSFG interviewers cannot go back and change their prior entries).
  • Conservative (a conservative or strict household/parents).
  • Health (reference to health or physical disability).
  • Personality (reference to the person’s personality).
  • Occupation (reference to the person’s occupation).

In addition, the number of words used for each justification was coded as a proxy of effort dedicated to the observational task. For example, the first justification given above was coded as having 35 words, and assigned “1” for living arrangement, household characteristics, and occupation, and “0” for all other indicators. All coding of the justifications was performed twice with the assistance of an undergraduate research assistant. The inter-rater reliability of the codes was quite high; of the 3,992 coded justifications, the percentages of codes on a given indicator that did not agree between the two coders ranged from 0.32% (Shyness) to 7.53% (Housing unit characteristics). This means that there was higher than 92% agreement for all of the coded indicators. Discrepancies in coding were detected using PROC COMPARE in the SAS software, and any discrepancies in coding or word counts were discussed and resolved. The percentage of justifications falling into each of these 13 categories was then computed for each interviewer, along with the mean word count for the interviewer.

Measures of Observation Accuracy

To compute dependent variables measuring observation accuracy for each interviewer, the total number of judgments of sexual activity, the total number of sexually active respondents (based on respondent reports of at least one sexual partner in the past year, from completed main interviews), the total number of sexually inactive respondents, and the total number of discordant judgments (i.e., judgments inconsistent with survey reports) were determined for each of the 45 interviewers. From these measures, we computed the overall gross difference rate (i.e., the proportion of judgments that were incorrect), the false positive rate (i.e., the proportion of sexually inactive respondents who were judged to be sexually active), and the false negative rate (i.e., the proportion of sexually active respondents who were judged to not be sexually active) for each interviewer. For purposes of this study, respondent reports of current sexual activity (i.e., at least one opposite-sex partner in the past year) were assumed to be true.

Two interviewers had insufficient information available for the false positive rate analyses, because all of their main interview respondents reported being sexually active. An additional two interviewers were found to be outliers in the subsequent cluster analyses and removed from further analysis. Our models of observation accuracy were thus based on 43 interviewers and 3,044 observations (for gross difference rates), 43 interviewers and 2,347 observations (for false negative rates), or 41 interviewers and 697 observations (for false positive rates).

Covariates

We also extracted a number of covariates describing features of the areas where an interviewer was assigned to work, motivated by prior studies examining correlates of observation accuracy (Sinibaldi et al., 2013; West & Kreuter, 2013, 2015). These included the percentage of the interviewer’s assigned households in urban areas (i.e., Metropolitan Statistical Areas or MSAs), the percentage of households in areas with access problems (e.g., gated communities), the percentage of households in primarily residential areas, the percentage of households in areas with evidence of non-English speakers or Spanish speakers, the percentage of households in areas with safety concerns, the percentage of households in multiunit buildings, the percentage of households with physical impediments (e.g., security gates), and the percentage of households where females were the selected respondents. We describe the roles of these covariates in our models of accuracy below.

Analytic Approach

To address our first research question, we performed an exploratory cluster analysis to determine whether distinct groups of interviewers existed in terms of the percentages of justifications falling into each category and effort spent on the observational task. To do so, the 13 percentages and the mean word counts for the 45 interviewers who completed main interviews were initially standardized. An agglomerative hierarchical clustering approach was then applied (Everitt, Landau, Leese, & Stahl, 2011), using squared Euclidean distances based on the 14 standardized variables as distance measures between interviewers and Ward’s (1963) minimum within-cluster variance method to define the clusters. This approach was selected for its established superiority in identifying known clusters when using continuous measures (Punj & Stewart, 1983). We examined descriptive statistics for the derived clusters, and then determined conceptual labels for the clusters by comparing the distributions of the percentages and means between them using the nonparametric, independent samples Kruskal-Wallis H test.

Next, to address our second research question, we fit a sequence of two multilevel logistic regression models to each of the three dependent accuracy measures (the gross difference rates, false positive rates, and false negative rates). The first model included random interviewer effects, capturing between-interviewer variability (and within-interviewer correlation) in a given accuracy indicator, and fixed effects of the P = 9 covariates described above, to account for the effects of these area-level features on the different accuracy measures. This initial model is shown in Equation (1):

((1)) where
  • ϕi = probability of incorrect judgment/false positive/false negative for IWER i,
  • xpi = value of covariate p for IWER i,
  • u0i ~ N(0, τ2), with τ2 = variance of random interviewer effects.

Next, given an initial estimate of the variance of the random interviewer effects, we added fixed effects of the specific interviewer “strategy” clusters identified from the cluster analysis to the initial model (omitting one cluster as a reference category):

((2)) where
  • ϕi = probability of incorrect judgment/false positive/false negative for IWER i,
  • xpi = value of covariate p for IWER i,
  • ,
  • u0i ~ N(0, τ2), with τ2 = variance of random interviewer effects.

We then computed the percentage of variance in the random interviewer effects explained by the inclusion of the fixed effects of the strategy clusters in Equation (2). In each model, the variance of the random interviewer effects was tested against zero using a likelihood ratio test based on a mixture of chi-square distributions (Zhang & Lin, 2008).

Models (1) and (2) were fitted using the GLIMMIX procedure in the SAS software (Version 9.4), specifically using a Laplace approximation for estimation purposes (Kim, Choi, & Emery, 2013).

Results

We first consider descriptive summaries of the variables computed for all 45 interviewers. Descriptive statistics for the interviewer-specific percentages and mean word counts are shown in Table 1. A large amount of variability among the 45 interviewers is evident, in terms of the justification strategies and the average number of words used for the justifications.

Table 1 Descriptive statistics for the variables used in the cluster analysis

Table 2 presents descriptive statistics for the three dependent variables computed for each interviewer. Interviewers made correct judgments of current sexual activity 79.4% of the time, with a minimum of 0% errors (i.e., some interviewers were correct all the time) and a maximum of 43.9% errors. There were more false positive judgments than false negative judgments, suggesting that interviewers tended to err on the side of assuming sexual activity (West & Kreuter, 2015).

Table 2 Descriptive statistics for the interviewer-level error rates

The initial cluster analysis provided evidence of two interviewers that could be considered outliers (Figure 2), with one interviewer citing neighborhood features in 55.41% of her justifications (the next highest percentage being 17.12%), and another interviewer citing health reasons for 43.14% of her justifications (the next highest percentage being 27.50%). After dropping these two interviewers, the second cluster analysis presented evidence of four unique clusters of interviewers based on scaled distances between the clusters (Figure 3); that is, there were in fact distinct groups of interviewers based on their justification tendencies. Descriptive statistics on the 14 variables for each cluster are shown in Table 3.

Figure 2 Dendrogram showing results of initial cluster analysis, with evidence of two outliers (Interviewers 36 and 40).
Figure 3 Dendrogram showing results of second cluster analysis, with evidence of four distinct groups of interviewers (based on rescaled cluster distances greater than 10).
Table 3 Descriptive statistics for interviewer-level justification tendencies and mean word counts within four distinct clusters of interviewers (n = 43 total)

The results in Table 3 suggest that the first cluster of interviewers is largely defined by a tendency to notice living arrangements and housing unit characteristics. The second cluster is largely defined by references to appearance and personality, and a relatively large word count; interviewers in this cluster appeared to use the widest diversity of cues, including references to age and relationship status. The third cluster is primarily defined by references to relationship status and guesses/gut feelings, while the fourth cluster focuses primarily on age, occasionally referring to relationship status and household characteristics but hardly anything else.

We now consider differences among these clusters in terms of the various accuracy measures for the sexual activity judgments. Table 4 presents estimates of the fixed effects and variance components for the two multilevel logistic regression models fitted to each dependent accuracy measure.

Table 4 Parameter estimates (logit scale) in multilevel logistic regression models indicating the relationships of selected covariates and interviewer “strategy” cluster membership with gross difference rates, false positive rates, and false negative rates (n = 43 interviewers for Models 1 and 3; n = 41 for Model 2)

The observational “strategy” clusters describing interviewers based on their justification tendencies are able to explain significant portions of the unexplained variance among interviewers in terms of observation accuracy. In the case of overall gross difference rates, Clusters 2 (use of a diversity of cues) and 3 (focus on relationship status, or “gut feelings”) have significantly reduced odds [odds ratio for Cluster 2 = exp(−0.40) = 0.67, or 33% lower odds, and odds ratio for Cluster 3 = exp(−0.35) = 0.70, or 30% lower odds, respectively] of making an error relative to Cluster 1. Furthermore, 36.2% of the unexplained variance among interviewers (when adjusting for the fixed effects of the area-level covariates) is accounted for by these “strategy” clusters.

In the case of false positive rates, Cluster 4 (focus primarily on age) had substantially increased odds of making a false positive error [odds ratio for Cluster 4 = exp(1.70) = 5.47, or 447% higher odds], and the fixed strategy cluster effects explain about 15% of the unexplained variance in accuracy among interviewers. In the case of false negative rates, Clusters 2 and 3 once again have significantly reduced odds of making a false negative error relative to Cluster 1 [odds ratio for Cluster 2 = exp(−0.87) = 0.42, or 58% lower odds, and odds ratio for Cluster 3 = exp(−0.99) = 0.37, or 63% lower odds, respectively]. We also note that Cluster 4 has reduced odds of making a false negative error, but the frequency with which this occurred was too small to prevent reliable estimation of this effect (see Table 3). Collectively, the fixed cluster effects were able to explain nearly 44% of the unexplained variance in false negative rates among interviewers.

We therefore find consistent evidence of the observational strategies influencing the error properties of the sexual activity judgments in a significant manner, as described below:

  • Focusing primarily on relationship status and gut feelings will result in the highest accuracy;
  • Judging based primarily on age will result in systematic false positives;
  • Considering a diversity of cues, including appearance, personality, and other external features, will also improve accuracy;
  • Focusing on living arrangement and housing unit features is detrimental in terms of accuracy and results in systematic false negatives.

What might cause the remaining unexplained variance in accuracy among interviewers? All of the NSFG interviewers were female, and the vast majority were white, married, and had previous NSFG experience. For as many interviewers as possible (42 of 45), we extracted their overall years of interviewing experience, their age, and the number of children that they had from a voluntary interviewer survey (3 of the 45 interviewers chose not to participate in the voluntary survey). NSFG managers examined these 42 values for each variable and reported no evidence of measurement error given their knowledge of these particular interviewers. In exploratory analyses, we added fixed effects of these interviewer-level covariates to the “full” models in Table 4. For the gross difference rates, we found that interviewers with more kids had significantly reduced odds of making an error overall, with the effects of the strategy clusters remaining the same, and that the interviewer variance component was reduced to the point where it was no longer significantly greater than zero. For the false positive rates, we found that older interviewers had significantly reduced odds of making a false positive error, again further reducing the variance component. Finally, for the false negative rates, we found the older interviewers had significantly increased odds of making a false negative error; older interviewers appeared to err in the direction of no sexual activity. Relevant interviewer features therefore did seem to explain additional variance in accuracy, which is consistent with the existing literature (Sinibaldi et al., 2013; West & Kreuter, 2013, 2015) and could have training implications depending on the type of observation being collected.

Discussion

This study has demonstrated that the collection and analysis of open-ended justifications for the observations that field interviewers are often asked to record while conducting face-to-face surveys is feasible in practice. The analyses provide interesting insights into the observational strategies used by NSFG interviewers who make more accurate judgments regarding a specific respondent characteristic. With regard to our first research question, we found evidence of distinct clusters of interviewers based on the justifications that they tended to use for their judgments. This finding suggests that with only minimal guidance and training on the observation process provided by NSFG staff, different interviewers did in fact use different observational strategies in the field when recording these types of judgments, consistent with our theoretical expectations. This finding certainly needs to be replicated in other survey contexts to further understand this phenomenon.

With regard to our second research question, the four clusters of interviewers identified based on the observational strategies evident in their justifications were found to vary significantly in terms of the error properties of this specific interviewer observation, when adjusting for the effects of other area- and interviewer-level covariates on accuracy. This finding suggests that variance in error rates on these types of observations may in fact be a function of varying observational strategies being employed in the field, which has important implications for standardized training on the process of recording interviewer observations (e.g., Dahlhamer, 2012).

Our results were mainly consistent with theoretical expectations, where a focus on highly relevant cues (Funder, 1995) and more diversity in the cues used (Kazdin, 1977) was found to improve the error properties of the observations. Slightly contrary to theoretical expectations was the finding that a combination of focusing on a highly relevant cue (mention of relationship status during the screening interview) and guessing or “going on hunches” (expected to result in lower observation accuracy) was found to produce favorable error properties. A reasonable suggestion for practice would thus be to first try and identify highly relevant correlates of a feature being observed, and then go with general impressions or best guesses if those correlates are not readily available. Identification of these highly relevant features for a particular survey will require replications of this study, or at least discussions with interviewers who are found to produce highly accurate observations in a given survey. Either way, we hope that the methods described in this study will be used by other survey researchers to better understand effective observational strategies in other survey contexts.

Importantly, the different strategies used by interviewers to record the observations could also be reflective of the approaches that they take to recording the actual survey measurements in the interview. For example, an interviewer who, based on their background and expectations, uses age cues exclusively to justify their current sexual activity judgments may ultimately make comments or express opinions about risky sexual behavior in older or younger people during the actual interview. This could lead respondents to answer questions about current sexual activity in different (and possibly error-prone) ways, depending on statements made by the interviewer. A review of the literature on measurement of sexual behaviors in surveys (Fenton, Johnson, McManus, & Erens, 2001) suggests that interviewer gender can influence self-reports of sexual behavior (which would not be relevant in the NSFG, given that all interviewers are females), but also that establishment of rapport between the interviewer and the respondent can lead to reports of more frequent (and possibly exaggerated) sexual behavior. Interviewers may become more conversational and go off on tangents when rapport has been established, increasing the probability of interviewers communicating (either verbally or nonverbally) their expectations regarding the topic at hand (West & Blom, 2017), and this is when their “native” observational strategies might play a role in affecting measurement. If this were the case in practice, comparing the “strategy clusters” in terms of observation accuracy may be misleading, but this requires future research (possibly taking advantage of computer audio-recorded interviewing (CARI) technologies for recording the survey interviews).

Alternatively, if older NSFG respondents tend to provide more socially desirable responses about sexual activity with respect to their age (Fenton et al., 2001), then in the group of interviewers that tended to use age exclusively in their justifications (Cluster 4 in this study), we may be estimating the error rates associated with this strategy incorrectly. For example, if (1) respondents between the ages of 20 and 49 tend to report being sexually active when they actually are not, (2) an interviewer tends to judge current sexual activity based on age (e.g., older respondents are more likely to be sexually active), and (3) the social desirability bias is larger for older individuals than younger individuals, what seem like correct judgments for older individuals may actually be incorrect at higher rates. This speculation would require additional research as well, and each of these possibilities speaks to the importance of having good validation data for establishing the accuracy of the interviewer judgments.

Setting aside these possibilities of a link between the observational strategy used and the accuracy of the respondent reports, the findings of this study have direct practical implications for future interviewer training. At a minimum, NSFG interviewers could be provided with verbal guidance about strategies to avoid (e.g., focusing on living arrangement) and strategies to employ (e.g., using a diversity of cues and attempting to detect some hint of relationship status in the screening interview) when recording these judgments. More generally, NSFG interviewers could watch brief videos of “staged” hypothetical screening interviews, where particular cues are mentioned (e.g., “My boyfriend isn’t home right now…”), and then be asked more generally about what judgments they might record and why. Thinking even more broadly about other surveys and other types of interviewer observations, some survey programs are starting to incorporate practice sessions for recording interviewer observations based on real photographs of housing units and neighborhoods into interviewer training (e.g., Dahlhamer, 2012; Stähli, 2010). Importantly, these sessions and the “correct” responses for what observations to record should be based on the type of empirical evidence generated in the present study.

There are several opportunities for future research in this area. First, the interviewer judgments in this study were only recorded for housing units where a screening interview was successfully completed (about 91% of housing units sampled for the NSFG). In general, the literature in this area would benefit from more case studies discussing the collection and analysis of “tailored” observations that are correlated with key survey variables and response propensity for all sampled units in a given survey. More evidence of such observations contributing to effective nonresponse adjustments would provide empirical support for the continued use of this practice. Second, the observations could only be validated using information provided by survey respondents. The possibility that observations on non-responding units may have suffered from reduced quality could not be considered in this study, and future studies would need to consider alternative sources of validation data to investigate this possibility further. Third, we spoke with a lawyer about the possibility that privacy regulations affecting data collection practices in Europe may eventually impact the ability of survey organizations to collect interviewer observations, and the lawyer indicated that interviewer observations are not an issue with respect to the current European privacy laws. It will therefore be interesting to see if future debates about privacy protection make it more difficult to collect these types of observations, both in Europe and other countries.

Brady T. West is a Research Associate Professor in the Survey Methodology Program, located within the Survey Research Center of the Institute for Social Research at the University of Michigan-Ann Arbor, and also in the Joint Program in Survey Methodology at the University of Maryland-College Park.

Frauke Kreuter is a Professor in the Joint Program in Survey Methodology at the University of Maryland-College Park, Professor of Statistics and Methodology at the University of Mannheim, and Head of the Statistical Methods Research Department at the Institute for Employment Research (IAB) in Nürnberg, Germany.

Funding for this research was provided by NIH Grant #1-R03-HD-075979-01-A1. The National Survey of Family Growth (NSFG) is carried out under a contract with the CDC’s National Center for Health Statistics, Contract # 200-2010-33976. Funding comes from several agencies, including NCHS, NICHD, CDC, OPA and ACF. The views expressed here do not represent those of NCHS or the other funding agencies. We are indebted to Ziming Liao from the University of Michigan Undergraduate Research Opportunity Program (UROP) for his contributions to this work.

References

  • Ambady, N., Hallahan, M. & Conner, B. (1999). Accuracy of judgments of sexual orientation from thin slices of behavior. Journal of Personality and Social Psychology, 77, 538–547. https://doi.org/10.1037/0022-3514.77.3.538 First citation in articleCrossrefGoogle Scholar

  • Babbie, E. R. (2001). The practice of social research (9th ed.). Belmont, CA: Wadsworth/Thomson Learning. https://doi.org/10.1177/0018726708094863 First citation in articleGoogle Scholar

  • Baruch, Y. & Holtom, B. C. (2008). Survey response rate levels and trends in organizational research. Human Relations, 61, 1139–1160. First citation in articleCrossrefGoogle Scholar

  • Beaumont, J.-F. (2005). On the use of data collection process information for the treatment of unit nonresponse through weight adjustment. Survey Methodology, 31, 227–231. First citation in articleGoogle Scholar

  • Bethlehem, J. G. (2002). Weighting nonresponse adjustments based on auxiliary information. In R. GrovesD. DillmanJ. EltingeR. LittleEds., Survey nonresponse (pp. 275–287). New York, NY: Wiley. First citation in articleGoogle Scholar

  • Biemer, P. P. (2010). Total survey error: Design, implementation, and evaluation. Public Opinion Quarterly, 74, 817–848. https://doi.org/10.1093/poq/nfq058 First citation in articleCrossrefGoogle Scholar

  • Biener, L., Garrett, C. A., Gilpin, E. A., Roman, A. M. & Currivan, D. B. (2004). Consequences of declining survey response rates for smoking prevalence estimates. American Journal of Preventative Medicine, 27, 254–257. https://doi.org/10.1016/j.amepre.2004.05.006 First citation in articleCrossrefGoogle Scholar

  • Brick, J. M. & Williams, D. (2013). Explaining rising nonresponse rates in cross-sectional surveys. The Annals of the American Academy of Political and Social Science, 645, 36–59. https://doi.org/10.1177/0002716212456834 First citation in articleCrossrefGoogle Scholar

  • Campanelli, P., Sturgis, P. & Purdon, S. (1997). Can you hear me knocking: An investigation into the impact of interviewers on survey response rates. London, UK: SCPR. First citation in articleGoogle Scholar

  • Casas-Cordero, C., Kreuter, F., Wang, Y. & Babey, S. (2013). Assessing the measurement error properties of interviewer observations of neighbourhood characteristics. Journal of the Royal Statistical Society (Series A), 176, 227–250. https://doi.org/10.1111/j.1467-985X.2012.01065.x First citation in articleCrossrefGoogle Scholar

  • Cull, W. L., O’Connor, K. G., Sharp, S. & Tang, S. S. (2005). Response rates and response bias for 50 surveys of pediatricians. Health Services Research, 40, 213–226. https://doi.org/10.1111/j.1475-6773.2005.00350.x First citation in articleCrossrefGoogle Scholar

  • Curtin, R., Presser, S. & Singer, E. (2005). Changes in telephone survey nonresponse over the past quarter century. Public Opinion Quarterly, 69, 87–98. https://doi.org/10.1093/poq/nfi002 First citation in articleCrossrefGoogle Scholar

  • Dahlhamer, J. (2012). New observation questions. Presentation at the 2013 NHIS Centralized Refresher Training and Conference, December 3-6, 2012, Hyattsville, MD. Available from the author ([email protected]) First citation in articleGoogle Scholar

  • Das, T. H. (1983). Qualitative research in organizational behaviour. Journal of Management Studies, 20, 301–314. https://doi.org/10.1111/j.1467-6486.1983.tb00209.x First citation in articleCrossrefGoogle Scholar

  • de Leeuw, E. & de Heer, W. (2002). Trends in household survey nonresponse: A longitudinal and international comparison. In R. GrovesD. DillmanJ. EltingeR. LittleEds., Survey nonresponse (pp. 41–54) (Chapter 3). New York, NY: Wiley. First citation in articleGoogle Scholar

  • Deming, W. E. (1944). On errors in surveys. American Sociological Review, 9, 359–369. First citation in articleCrossrefGoogle Scholar

  • Everitt, B. S., Landau, S., Leese, M. & Stahl, D. (2011). Cluster analysis (5th ed). Series in Probability and Statistics. New York, NY: Wiley. First citation in articleCrossrefGoogle Scholar

  • Feldman, J. J., Hyman, H. & Hart, C. W. (1951). A field study of interviewer effects on the quality of survey data. Public Opinion Quarterly, 15, 734–761. https://doi.org/10.1086/266357 First citation in articleCrossrefGoogle Scholar

  • Fenton, K. A., Johnson, A. M., McManus, S. & Erens, B. (2001). Measuring sexual behavior: Methodological challenges in survey research. Sexually Transmitted Infections, 77, 84–92. https://doi.org/10.1136/sti.77.2.84 First citation in articleCrossrefGoogle Scholar

  • Forgas, J. P. & Brown, L. B. (1977). Environmental and behavioral cues in the perception of social encounters: An exploratory study. The American Journal of Psychology, 90, 635–644. https://doi.org/10.2307/1421737 First citation in articleCrossrefGoogle Scholar

  • Funder, D. C. (1987). Errors and mistakes: Evaluating the accuracy of social judgment. Psychological Bulletin, 101, 75–90. https://doi.org/10.1037/0033-2909.101.1.75 First citation in articleCrossrefGoogle Scholar

  • Funder, D. C. (1995). On the accuracy of personality judgment: A realistic approach. Psychological Review, 102, 652–670. https://doi.org/10.1037/0033-295X.102.4.652 First citation in articleCrossrefGoogle Scholar

  • Graham, R. J. (1984). Anthropology and O.R.: The place of observation in management science process. The Journal of the Operational Research Society, 35, 527–536. https://doi.org/10.2307/2581799 First citation in articleCrossrefGoogle Scholar

  • Groves, R. M. (2006). Nonresponse rates and nonresponse bias in household surveys. Public Opinion Quarterly, 70, 646–675. First citation in articleCrossrefGoogle Scholar

  • Groves, R. M. & Lyberg, L. (2010). Total survey error: Past, present, and future. Public Opinion Quarterly, 74, 849–879. https://doi.org/10.1093/poq/nfq065 First citation in articleCrossrefGoogle Scholar

  • Groves, R. M., Wagner, J. & Peytcheva., E. (2007). Use of interviewer judgments about attributes of selected respondents in post-survey adjustments for unit nonresponse: An illustration with the national survey of family growth. Proceedings of the Section on Survey Research Methods, Joint Statistical Meetings, Salt Lake City, UT First citation in articleGoogle Scholar

  • Harris, K. J., Jerome, N. W. & Fawcett, S. B. (1997). Rapid assessment procedures: A review and critique. Human Organization, 56, 375–378. https://doi.org/10.17730/humo.56.3.w525025611458003 First citation in articleCrossrefGoogle Scholar

  • Jones, E. E., Riggs, J. M. & Quattrone, G. (1979). Observer bias in the attitude attribution paradigm: Effect of time and information order. Journal of Personality and Social Psychology, 37, 1230–1238. First citation in articleCrossrefGoogle Scholar

  • Kazdin, A. E. (1977). Artifact, bias, and complexity of assessment: The ABCs of reliability. Journal of Applied Behavior Analysis, 10, 141–150. https://doi.org/10.1901/jaba.1977.10-141 First citation in articleCrossrefGoogle Scholar

  • Kreuter, F., Olson, K., Wagner, J., Yan, T., Ezzati-Rice, T., Casas-Cordero, C., … Raghunathan, T. E. (2010). Using proxy measures of survey outcomes to adjust for survey nonresponse. Journal of the Royal Statistical Society (Series A), 173, 389–407. https://doi.org/10.1111/j.1467-985X.2009.00621.x First citation in articleCrossrefGoogle Scholar

  • Kim, Y., Choi, Y.-K. & Emery, S. (2013). Logistic regression with multiple random effects: A simulation study of estimation methods and statistical packages. The American Statistician, 67, 171–182. https://doi.org/10.1080/00031305.2013.817357 First citation in articleCrossrefGoogle Scholar

  • Lepkowski, J. M., Mosher, W. D., Davis, K. E., Groves, R. M. & Van Hoewyk, J. (2010). The 2006–2010 National Survey of Family Growth: Sample design and analysis of a continuous survey. National Center for Health Statistics. Vital and Health Statistics, 2, 1–36. First citation in articleGoogle Scholar

  • Lessler, J. & Kalsbeek, W. (1992). Nonresponse: Dealing with the problem. Nonsampling errors in surveys. New York, NY: Wiley-Interscience, 161–233. First citation in articleGoogle Scholar

  • Little, R. J. & Vartivarian, S. (2005). Does weighting for nonresponse increase the variance of survey means? Survey Methodology, 31, 161–168. First citation in articleGoogle Scholar

  • Manderson, L. & Aaby, P. (1992a). An epidemic in the field? Rapid assessment procedures and health research. Social Science Medicine, 35, 839–850. https://doi.org/10.1016/0277-9536(92)90098-B First citation in articleCrossrefGoogle Scholar

  • Manderson, L. & Aaby, P. (1992b). Can rapid anthropological procedures be applied to tropical diseases? Health Policy and Planning, 7, 46–55. https://doi.org/10.1093/heapol/7.1.46 First citation in articleCrossrefGoogle Scholar

  • McCall, G. J. (1984). Systematic field observation. Annual Review of Sociology, 10, 263–282. First citation in articleCrossrefGoogle Scholar

  • McCulloch, S. K., Kreuter, F. & Calvano, S. (2010, May 14). Interviewer observed vs. reported respondent gender: Implications on measurement error. Paper presented at the 2010 Annual Meeting of the American Association for Public Opinion Research, Chicago, IL. First citation in articleGoogle Scholar

  • Millen, D. R. (2000). Rapid ethnography: Time deepening strategies for HCI field research. In ACMEd., Proceedings on DIS00: Designing Interactive Systems: Processes, Practices, Methods, and Techniques (pp. 280–286). Brooklyn, NY: ACM. First citation in articleGoogle Scholar

  • Most, S. B., Scholl, B. J., Clifford, E. R. & Simons, D. J. (2005). What you see is what you get: Sustained inattentional blindness and the capture of awareness. Psychological Review, 112, 217–242. https://doi.org/10.1037/0033-295X.112.1.217 First citation in articleCrossrefGoogle Scholar

  • Patterson, M. L. & Stockbridge, E. (1998). Effects of cognitive demand and judgment strategy on person perception accuracy. Journal of Nonverbal Behavior, 22, 253–263. https://doi.org/10.1023/A:1022996522793 First citation in articleCrossrefGoogle Scholar

  • Pickering, K., Thomas, R. & Lynn, P. (2003, July). Testing the shadow sample approach for the English house condition survey. Prepared for the Office of the Deputy Prime Minister by the National Centre for Social Research, London, UK. First citation in articleGoogle Scholar

  • Punj, G. & Stewart, D. W. (1983). Cluster analysis in marketing research: Review and suggestions for application. Journal of Marketing Research, 20, 134–148. https://doi.org/10.2307/3151680 First citation in articleCrossrefGoogle Scholar

  • Repp, A. C., Nieminen, G. S., Olinger, E. & Brusca, R. (1998). Direct observation: Factors affecting the accuracy of observers. Exceptional Children, 55, 29–36. https://doi.org/10.1177/001440298805500103 First citation in articleCrossrefGoogle Scholar

  • Seidler, J. (1974). On using informants: A technique for collecting quantitative data and controlling measurement error in organization analysis. American Sociological Review, 39, 816–831. First citation in articleCrossrefGoogle Scholar

  • Simons, D. J. & Jensen, M. S. (2009). The effects of individual differences and task difficulty on inattentional blindness. Psychonomic Bulletin and Review, 16, 398–403. https://doi.org/10.3758/PBR.16.2.398 First citation in articleCrossrefGoogle Scholar

  • Sinibaldi, J., Durrant, G. B. & Kreuter, F. (2013). Evaluating the measurement error of interviewer observed paradata. Public Opinion Quarterly, 77, 173–193. https://doi.org/10.1093/poq/nfs062 First citation in articleCrossrefGoogle Scholar

  • Sinibaldi, J., Trappmann, M. & Kreuter, F. (2014). Which is the better investment for nonresponse adjustment: Purchasing commercial auxiliary data or collecting interviewer observations? Public Opinion Quarterly, 78, 440–473. https://doi.org/10.1093/poq/nfu003 First citation in articleCrossrefGoogle Scholar

  • Stähli, M. E. (2010). Examples and experiences from the Swiss interviewer training on observable data (neighborhood characteristics) for ESS 2010 (R5). Paper presented at the NC Meeting Mannheim, Germany, March 31–April 1, 2011. First citation in articleGoogle Scholar

  • Stefanski, L. A. & Carroll, R. J. (1985). Covariate measurement error in logistic regression. The Annals of Statistics, 13, 1335–1351. https://doi.org/10.1214/aos/1176349741 First citation in articleCrossrefGoogle Scholar

  • Tipping, S. & Sinibaldi, J. (2010, June 15). Examining the trade off between sampling and targeted non-response error in a targeted non-response follow-up. Paper presented at the 2010 International Total Survey Error Workshop, Stowe, Vermont. First citation in articleGoogle Scholar

  • Tolonen, H., Helakorpi, S., Talala, K., Helasoja, V., Martelin, T. & Prattala, R. (2006). 25-year trends and socio-demographic differences in response rates: Finnish adult health behaviour survey. European Journal of Epidemiology, 21, 409–415. https://doi.org/10.1007/s10654-006-9019-8 First citation in articleCrossrefGoogle Scholar

  • Tversky, A. & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185, 1124–1131. https://doi.org/10.1126/science.185.4157.1124 First citation in articleCrossrefGoogle Scholar

  • Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244. First citation in articleCrossrefGoogle Scholar

  • West, B. T. (2013a). An examination of the quality and utility of interviewer observations in the National Survey of Family Growth (NSFG). Journal of the Royal Statistical Society (Series A), 176, 211–225. https://doi.org/10.1111/j.1467-985X.2012.01038.x First citation in articleCrossrefGoogle Scholar

  • West, B. T. (2013b). The effects of error in paradata on weighting class adjustments: A simulation study. In F. KreuterEd., Improving surveys with paradata: Making use of survey process information (pp. 361–388). Hoboken, NJ: Wiley. First citation in articleCrossrefGoogle Scholar

  • West, B. T. & Blom, A. G. (2017). Explaining interviewer effects: A research synthesis. Journal of Survey Statistics and Methodology, 5, 175–211. https://doi.org/10.1093/jssam/smw024 First citation in articleGoogle Scholar

  • West, B. T. & Kreuter, F. (2013). Factors impacting the accuracy of interviewer observations: Evidence from the National Survey of Family Growth (NSFG). Public Opinion Quarterly, 77, 522–548. https://doi.org/10.1093/poq/nft016 First citation in articleCrossrefGoogle Scholar

  • West, B. T. & Kreuter, F. (2015). A practical technique for improving the accuracy of interviewer observations of respondent characteristics. Field Methods, 27, 144–162. https://doi.org/10.1177/1525822X14549429 First citation in articleCrossrefGoogle Scholar

  • West, B. T., Kreuter, F. & Trappmann, M. (2014). Is the collection of interviewer observations worthwhile in an economic panel survey? New evidence from the German labor market and social security (PASS) study. Journal of Survey Statistics and Methodology, 2, 159–181. https://doi.org/10.1093/jssam/smu002 First citation in articleCrossrefGoogle Scholar

  • Williams, D. & Brick, J. M. (2017). Trends in U.S. face-to-face household survey nonresponse and level of effort. Journal of Survey Statistics and Methodology, https://doi.org/10.1093/jssam/smx019 First citation in articleCrossrefGoogle Scholar

  • Zhang, D. & Lin, X. (2008). Variance component testing in generalized linear mixed models for longitudinal/clustered data and other related topics. In D. B. DunsonEd., Random effect and latent variable model selection (pp. 19–36). New York, NY: Springer. https://doi.org/10.1007/978-0-387-76721-5_2 First citation in articleGoogle Scholar

Brady T. West, Survey Methodology Program (SMP), Survey Research Center (SRC), Institute for Social Research (ISR), University of Michigan-Ann Arbor, Ann Arbor, MI 48109, USA,