Skip to main content
Published Online:

Improving Psychological Science Requires Theory, Data, and Caution: Reflections on Lynott et al. (2014)

Lawrence E. Williams

University of Colorado, Boulder, CO, USA

Lynott and colleagues (2014) replicate the procedures reported in Study 2 of Williams and Bargh’s (2008) paper examining the relationship between experiences with physical warmth and feelings of psychological warmth. The replication research reported by Lynott and colleagues is competently executed, and the researchers took special care to recreate the procedures and experiences of the original study. Indeed, the replication research likely improves upon the original approach, as the replication researchers had sufficient resources to coordinate across sites and collect data on a large sample of participants, ensuring greater experimental power. Despite these improvements, across multiple investigations at multiple experimental sites, they fail to detect the effect of the temperature manipulation reported in the original study.

Replication attempts are most certainly worthwhile, necessary, and in the aggregate do well to independently verify scientific findings, the work of Lynott and colleagues being no exception. Indeed, Lynott et al.’s replication attempt is a welcome addition to the literature for all researchers interested in the link between physical and psychological warmth. That said, the value of replication research in social psychology generally would be enhanced by deeper engagement in psychological theory, a broader consideration of the data supporting the existence of an effect, and increased caution when declaring original findings invalid. These three considerations can help us in our efforts to improve the quality of psychological science, by fostering a collaborative environment in which replication researchers and authors of original findings work together on their mutual goal of increasing the quantity and quality of our scientific knowledge.

The Necessity of Theory

Direct replication attempts can potentially be more instructive and generative if they engaged with the original findings on a deeper, theory-driven level. Without an appeal to theory, the reasons for the discrepant findings are unclear. The effects reported by Williams and Bargh (2008) were guided by the basic principles of behavioral priming research, which are grounded in long-standing theories regarding the associative nature of memory (Tulving & Schacter, 1990), the interplay between thinking and action (Lashley, 1951; Prinz, 1987), and the integration of bodily information into psychological outputs (Barsalou, 1999; Clark, 1998). Such foundational knowledge led us to hypothesize that experiences with physical warmth would be capable of activating the concept of psychological warmth in memory. This association between physical and psychological warmth is expected due to our early life interactions with caregivers, which feature the repeated co-occurrence of physical warmth (close contact with a warm human body), and psychological warmth (being cared for; cf. Bowlby, 1969; Harlow, 1958). Given spreading activation, we suspected that associated concepts such as generosity would be more available in working memory following exposure to a physically warm (vs. cold) stimulus. Direct replications that engage with the theories that guided the original findings can fruitfully assist with theory refinements, pushing the field ahead.

Taking Conceptual Replications Into Account

On the basis of their investigations, Lynott and colleagues (2014) conclude “there is no evidence that brief exposure to warm therapeutic packs induces greater prosocial responding than exposure to cold therapeutic packs” (p. 219). This conclusion, however, does not take into account other related data speaking to the connection between physical warmth and prosociality. There is a fuller body of evidence to be considered, in which both direct and conceptual replications are instructive. The former are useful if researchers particularly care about the validity of a specific phenomenon; the latter are useful if researchers particularly care about theory testing (Stroebe & Strack, 2014).

Accordingly, there have been several conceptual replications and extensions of Williams and Bargh (2008), which support the proposition that bodily experiences with physical warmth influences outcomes tied to psychological warmth. For example:

  • In a recent study examining the shared neural mechanisms for physical and social warmth experiences, Inagaki and Eisenberger (2013) find that exposure to a warm pack (vs. a room temperature ball) led participants to report stronger feelings of social connection to close friends and family, and that reading positive messages from close friends and family (vs. neutral messages) led participants to report feeling physically warmer. These researchers also found that exposure to physical and social warmth cues similarly activated regions of the middle insula and ventral striatum (this neural overlap was hypothesized but not measured in Williams & Bargh, 2008).
  • In other convergent work in neuroendocrinology, exposure to warm temperatures (vs. room temperature) activated serotonergic neural systems in rats, which is expected to shape not only thermoregulation but affective experience as well (Hale, Dady, Evans, & Lowry, 2011; Lowry, Lightman, & Nutt, 2009).
  • Finally, in a series of related papers, IJzerman and colleagues demonstrate the close relationship between physical and psychological warmth (IJzerman, Karremans, Thomsen, & Schubert, 2013; IJzerman & Semin, 2009, 2010; Szymkow, Chandler, IJzerman, Parchukowski, & Wojciszke, 2013). In one study, the authors exposed children to warm (vs. cold) ambient temperatures and later assessed their generosity toward other children (measured by the number of stickers they were willing to share). For children who were securely attached, the temperature manipulation significantly affected their pattern of giving, such that those in the warm room gave more stickers away, compared to children in the cold room. The temperature manipulation had no effect on children who were insecurely attached (IJzerman et al., 2013). This moderation pattern is consistent with a view that the relationship between physical and psychological warmth is established via close, physical, caring contact with caregivers early in life; such contact is typically absent for insecurely attached children (Ainsworth, 1979).

Thus, across a wide spectrum of empirical approaches and experimental settings, a set of research findings converges on the idea that there is a meaningful relationship between exposure to physical warmth and (psychological, neurobiological, and behavioral) outcomes related to psychological warmth.

Proceeding With Caution

Against the backdrop of these conceptual replications, what do we learn from Lynott and colleagues’ failed replication? The most important lesson to be gleaned here is that we researchers must exercise more caution in the way in which we draw conclusions from studies, original findings and replications alike.

As Simons (2014) notes, “…a single failure to replicate should not be treated as definitive evidence against the existence of an effect” (p. 78). Yet the other side of the coin is similarly true; worthwhile research methods courses caution would-be scientists that a single statistically significant finding should not be treated as definitive evidence than an effect exists. Replication is always necessary, but the conclusions that can be drawn from a small handful of studies are limited and must be circumspect (cf. Tverksy & Kahneman, 1971). In the same way that it may be tempting for a researcher to believe that she discovered something true about the world with a single significant finding, it is similarly tempting for replication researchers to believe that they discovered that a previously published finding is truly false with a single failure to replicate. This is not the way science works. As Cesario (2014) notes, “there needs to be an appreciation…that every study is merely one data point in the cumulative, ongoing practice of science” (p. 45).


Social psychology is facing a crisis of character, which has led some to start pushing for a more valid science. By engaging with theory, taking both direct and conceptual replications into account, and exercising caution when interpreting findings, replication researchers will find that they have a host of allies pursuing the same goal.

High Quality Direct Replications Matter

Response to Williams (2014)

Katherine S. Corker, Dermot Lynott, Jessica Wortman, Louise Connell, M. Brent Donnellan, Richard E. Lucas, and Kerry O’Brien

Kenyon College, Gambier, OH, USA, Lancaster University, Lancaster, UK, Michigan State University, USA, Monash University, USA, University of Manchester, UK

Abstract. We respond to Williams’ (2014) comments on our three failures to replicate of Study 2 from Williams and Bargh (2008). We clarify our conclusions on this topic, making clear that although the results of our studies cast doubt on the specific effect reported in Williams and Bargh (i.e., that instant hot and cold packs influence choice of reward for self or friend), a more complete understanding of the embodiment hypothesis in question requires consideration of relevant conceptual replications. Accordingly, we consider the strength of the evidence in the conceptual replications that Williams identifies and find that small samples appear to be the norm. We conclude that in order for researchers to move forward, future studies must take seriously issues of power, researcher degrees of freedom, and file drawer problems. Doing so will ensure that future studies are more informative tests of this hypothesis.

Keywords: replication, embodiment, temperature

We thank Dr. Williams for his thoughtful comments on our three replication attempts of Williams and Bargh (2008), Study 2. We also want to acknowledge that Dr. Williams participated fully in the preregistration process of our studies, and we appreciate that he shared the original data and clarified aspects of the original procedure for us. Our efforts were strengthened by his input and graciousness. Indeed, as he notes in his reply, we are surely “allies pursuing the same goal” of truth (p. 6). There are but a few issues we wish to clarify in response to his concerns.

Making Our Conclusions More Precise

First, Williams (2014) rightly notes that a single failure to replicate does not provide definitive evidence that an effect is not real. The results from failed replication attempts should not be privileged, just as results from original studies should not be given more weight just because they were published first. We regret that we were not clearer in our abstract when we wrote that “there is not evidence that brief exposure to warm therapeutic packs induces greater prosocial responding.” We should have been more careful to emphasize that this conclusion referred only to the results of our three studies. It is certainly possible that despite our failure to find evidence for this phenomenon, the effect does exist.

However, it is the case that as more and more evidence accumulates, the weight of that evidence may begin to suggest that the original finding was dependent on the precise contextual features of the original study or that the hypothesis underlying the effect was not correct. We reported three independent failures to replicate the original results, each with sample sizes that were over four times larger than the original study. Taking all of this evidence into account, the cumulative evidence for this specific effect (i.e., the ability of instant hot and cold packs to influence the choice of a treat for self vs. friend) is not especially strong.

Scientific Judgment and the Importance of Direct v. Conceptual Replications

In his response to our paper, Williams (2014) appeals to the need for strong theory to guide research on behavioral priming studies like those we attempted to replicate. He makes a compelling case that the theory that led to this study would predict the very result he (and we) set out to examine. However, he also appears to argue that the strong theoretical basis for this prediction somehow lessons the importance of our repeated failures to replicate the original result. We find this position to be somewhat surprising – if this finding is so clearly predicted by theory, then the failure to find such an effect in multiple, large-sample replications should be especially problematic for that theory. Thus, we suggest that direct replications inform theories more so than the other way around. Indeed, direct replications are generative to the extent that they motivate future revisions to a theory and generate additional experimental tests involving different ways of operationalizing variables.

In discussing the theoretical context for these results, Williams (2014) also highlights several studies that serve as conceptual replications of the original Williams and Bargh (2008) finding. Conceptual replications “try to operationalize the underlying theoretical variables using different manipulations and/or measures” (Stroebe & Strack, 2014, p. 60). We acknowledge that there is debate within the field about the relative merits of conceptual versus direct replications. However, we strongly disagree with the position espoused by Stroebe and Strack (which Williams seems to endorse) that conceptual replication serves the same goals or can take the place of direct replication. The problem is that in designing conceptual replications additional decisional flexibility can make false positives more likely (Simmons, Nelson, & Simonsohn, 2011). Further, when reviewing bodies of evidence, researchers must decide what “counts” and what doesn’t count as a conceptual replication, adding “researcher degrees of freedom” to the research process (Simmons et al., 2011). A conceptual replication that does not support a theory may be ignored in a review, or it may be tucked away in the file drawer where it can’t be considered by other researchers (LeBel & Peters, 2011).

Weighing the Evidence

Regardless, let us consider these conceptual replications together with the current direct replications. Table 1 displays the effect size estimates (converted to Cohen’s d) for Williams and Bargh (2008), the current replication studies, and relevant conceptual replications identified by Williams. The table contains only studies that link temperature manipulations to prosocial or related behavior.1 We also included one additional study not mentioned by Williams (Kang, Williams, Clark, Gray, & Bargh, 2011), which tests the effect of hot and cold instant packs on trust.

Table 1. Effects sizes for temperature on prosocial outcomes

One clear message from Table 1 is that the average sample size in these conceptual replications (omitting the current replications) is extremely small, with an average of just 38 participants (SD = 13.71).2 Our direct replication studies are each 6.5–8 times larger than this average. Sample sizes in the range of 176 are typically required to have adequate power (80%) to detect the average d = .43 effect size in social psychology (see Richard, Bond, & Stokes-Zoota, 2003). The effect sizes in Table 1 are much larger than this average. However, publication bias and the file-drawer effect make it likely that the actual effect is much smaller than these estimates. Thus, the literature cited by Williams (2014) is largely based on underpowered samples. Future tests of this hypothesis – both conceptual and direct – should be conducted with an eye toward achieving adequate power.

In the end, we hope that there is process in psychological science that would lead researchers to abandon theories that lack robust empirical support, and we believe direct replications play an important role in this process. Nonetheless, we realize that psychological theories do not live and die by the results of a single study or set of studies. We agree with Williams (2014) that more work is needed to better understand how temperature priming impacts thoughts, feelings, and behaviors. Our hope is that advocates of temperature priming research do not dismiss strong failures to replicate theoretically important results by focusing on conceptual replications that confirm their original beliefs. Instead, we urge researchers to conduct future tests of the theory with large samples so that results are informative.

1Studies that tested the opposite hypothesis of behavior impacting judgments of temperature are excluded (IJzerman & Semin, 2010; Szymkow, Chandler, IJzerman, Parchukowski, & Wojciszke, 2013; Zhong & Leonardelli, 2008). All of these studies failed to meet the basic requirement of a conceptual replication in that they did not test the same hypothesis as Williams and Bargh (2008). However, these studies evinced roughly similar effect and sample sizes to the studies in the table. Also excluded from the table are Hale, Dady, Evans, and Lowry (2011; a study of rats with N = 24) and Lowry, Lightman, and Nutt (2009; a review paper).

2Including the remaining studies with human participants (IJzerman & Semin, 2010; Szymkow et al., 2013; Zhong & Leonardelli, 2008) increases the average sample size to N = 50 (SD = 19.18).