methods, program evaluation

Randomized Control Trials: Really Gold?

March 14, 2018 | Pieta Blakely

A randomized control trial (RCT) is a study design in which participants are divided into two groups and one of those groups is given an intervention (the treatment group) while the other is not (the control group). It's often referred to as "the gold standard" for research design. But for the evaluation of human service programs, it might not be the best choice, or even feasible at all.

Profile of little African girl writing

An RCT design is meant to avoid two main threats to correct findings: history effects and selection bias. History effects means the changes that would have happened anyway, given that time has passed between the beginning and the end of the program (we'll call it an "intervention" here, which is researcher-speak for "the thing that we did". For example, let's say that I want to see whether a new curriculum helps elementary school students learn to read better. In this case, our intervention will be a new reading curriculum and our evaluation will compare it to the old reading curriculum. If I test at the beginning and end of the school year, I will see that my students do, in fact read better. But I'll have a hard time understanding how much this is attributable to the new curriculum and how much it is due to the fact that the students are growing older and being exposed to a lot of materials at home and at school during the year. If I were to set up an RCT study to evaluate the new curriculum, I might assign half the students to the new curriculum and half to the old curriculum. At the end of the year, the difference in the average reading abilities between the two groups should be attributable to the new curriculum.

The reason I would assign the two groups randomly is to avoid selection bias. If I let the teacherse choose, they might have the lowest-performing students adopt the new curriculum, hoping that they would get the most benefit out of it. Or, if we let parents choose, the children of the most vocal and engaged parents might stay with the old curriculum, while students whose parents are less engaged would interact with the untested, new curriculum. If we let the students choose, students who are enjoying the current curriculum might elect to stay put while students who struggle with the current materials would elect to adopt the new curriculum. Each of of those would mean that the new curriculum group was different from the old curriculum group in ways that would have affected their end-of-year reading ability whether or not there was an intervention. We would again have a hard time attributing differences in outcomes to the curricula themselves.

One of the significant advantages of RCT is that the two groups will usually be similar in composition and ability. If assignment is truly random, they should have roughly the same numbers of each demographic category, and also be well mixed in terms of harder-to-measure elements like commitment, wherewithal, or determination.

But these assumptions are all based on the assumption that the researcher controls both what the treatment group gets and what the control group gets. In a resource rich environment like Boston, where I live, that's often not the case. Here is an example: a co-worker of mine once told me that she wanted to enrol her son (age 7) in an after-school program at the YMCA. She was especially interested in this program because it included a mentoring component. But her son was not eligible to participate because of his age that year. Concerned about the extra time he would have on his hands, she signed him up for soccer instead, where you can expect he would interact with a different caring adult -- the coach. 

In this case, one of the issues with RCT is that you might not be sure exactly what you are comparing your intervention to. The participants who were assigned to the control group might sign themselves up for various types of alternatives, making it hard for you to understand whether your program is being compared to nothing or some other collection of interventions.

Other issues with RCT in the context of human services are that it's expensive and onerous to the participants in the control group. You'll have to find a way to follow up with them even as you don't serve them. This might not be the best use of your evaluation resources or their time.

Another concern is that the treatment group and the control group will interact and share information. If the students who have access to the new curriculum really love it, won't they share some of their learning with their classmates? A more detailed analysis of some of the reasons why you might not want to use a random control design are outlined here.

So what to do instead? There are a few quasi-experimental methods that can yield strong results. One method is called regression discontinuity design. In this method, we have to select a cutoff. Let's return to the example of students taking a reading assessment. We will let the teachers have their way and assign the lower-performing students to the new curriculum. Imagine that the assessment is scored on a scale from 0 to 100. Every student who scores 49 points or less on the pre-test will be placed in the new curriculum group, while those with scores at or above 50 stay in the old curriculum group. 

The key with a regression discontinuity is that there is probably not actually much difference in the reading ability between a student who earned 49 points and one who earned 50 points. That difference is attributable to luck or having eaten a better breakfast on the day of the assessment. So we can compare the change from pre-test to post-test for students whose scores were very close to the cut-off in order to understand the effect of the new curriculum. A detailed set of instructions is here.

Another method is called difference-in-differences. This assumes that while some students are ahead of others, they would all be expected to advance at the same pace and maintain the same difference in average scores, even while maintaining a gap. If the size of the gap between the two groups at the end is not the same as it was at the beginning, that difference is attributable to the intervention.

A diagram shows two pre and two post scores with non-parallel changes (slopes). The difference in the slope of the two lines is attributed to the intervention.

These are just two methods -- there are many more, some of which require very large sample sizes or advanced statistical methods. Some good sources of information on these methods and more are here and in this excellent video.

 Subscribe to my blog


Tags: methods, program evaluation