When the RCT meets the road: Why Randomized Controlled Trials Look Different in the Field Than They Do in Class
It starts with curiosity. Once something has our curiosity, we begin to study it. After a period of study, we begin to believe that we understand it. At this point, we find ways to put our understanding to the test. Using the test results, we update our understanding. It is through this reiterative process that we deepen our insight.
At Busara, we’re always trying to find ways to gain as much high-quality insight from each iteration as possible through a rigorous research process. This usually balances four competing objectives—representativeness, high power, low risk for spillovers, and a budget. All of these trade-offs can initially be represented by clean formulas and numbers, making it the ideal challenge for any number-savvy researcher. However, all good researchers also know that reality is much messier than the mathematically cleanest approach would like it to be—that’s the nature of the curiosity game.
This is why, when planning a study to understand the impact of watching a new TV show on literacy, confidence and curiosity, and gender norms, we were ready for the field to challenge all our initial assumptions. It is also why we were not surprised to find ourselves along with EdTech Hub, a partner on this study, needing to pivot multiple times, showing once more why mathematics and creativity are not separate but instead necessary elements of an iterative, responsive, and ultimately viable research strategy.
Let’s start at the beginning.
First, a meticulous methodology
We chose to examine our questions in a randomized controlled trial (RCT) of naturalistic viewing (at-home TV) of the show to make the results as policy relevant as possible. Following this decision, we were privileged to have enough time to think through our research design thoroughly. In various conversations with experts from academia, practice, public sector, and internal field teams, we agreed on an optimal trade-off.
To maximize representativeness, we randomly selected a mix of six eligible counties, each of which had to be in a different region in Kenya. An eligible county was defined as a county with an average of at least 40 children or higher per school, in at least one sub-county, who are in grades 1–3 and live in a household with access to a TV. We used former provinces as regions, disregarding Nairobi: Coast, North Eastern, Eastern, Central, Rift Valley, Western, Nyanza. We excluded Nairobi as a potential county due to the risk of existing over-saturation of the intervention to be studied, as well as increased risks for spillover due to a dense and mobile population.
At this point, we only had publicly available datasets to work with, which were the Kenya Population and Housing Census (2019), and the Kenya Basic Education Statistical Booklet (2019). Using this information, we needed to ensure that the county mix fulfilled the criterion of, in total, including at least 3,000 public schools within eligible sub-counties. This would give us enough leeway to find an appropriate sample while also making fieldwork more logistically feasible by only considering areas where we expect a large enough population relevant to this study.
We would then subset to schools with an expected number of 20 eligible boys and 20 eligible girls. From these schools, we planned to randomly sample 400 (350 plus 50 buffer schools), which were at least 5 km apart from each other, in order to minimize the risk of spillovers.
Finally, we would visit each sampled school and, from their roster, randomly choose 10 girls and 10 boys to survey.
Our strategy included several conservative considerations (so that we could be extra sure that everything would go smoothly). For example, the counties we picked were expected to have more than seven times the number of schools we needed. Our assumption that all eight grades would have an equal number of children also likely underestimated the number of children in grades 1–3, which are typically larger due to later dropouts. Similarly, we aimed for 40 eligible children, although we only need 15 in each school.
Then, reality struck
Once we had a plan in hand, we started getting our hands dirty and worked through the data—yet we were unable to find a sample which fulfilled all our criteria. Information on TV ownership and school location was both limited and somewhat outdated. Even after manually geotagging hundreds of primary schools using Google Maps, we had to reduce our eligibility criteria to 12 female + 12 male eligible children and schools 4 km apart. Given our actual needs, we still felt pretty good about these assumptions.
Pivot point #1: Unexpected lack of permissions
Initial challenges averted, we were ready to plan fieldwork, but after months of engagement, we failed to get the necessary permissions to survey children in public schools. The TV show content had not yet gotten approval from the relevant government institutions, and that left the study without school-based access to our sample. Back at the drawing board, we recognized the need to discuss the following questions:
|Methodological considerations||Practical considerations|
|• Should the research design remain clustered at the school level, or should we consider increasing our power by randomizing on a household level, given that we cannot access the sample via schools?|
• Might village or neighborhood clustering improve our power without increasing the risk of spillovers that we would expect to be high in the case of household randomization?
• Will the sample still be representative enough if we cannot sample from the full population of children within a given primary school?
• How would a non-school-based approach influence attrition? Do we expect attrition to be higher when surveying at schools (children may not be present on the day we survey) or at households (parents and children aren’t present all day, or consent rates may be lower during leisure time).
|• Can we practically find (“recruit”) enough eligible children to survey?|
• Can we still manage to survey the sample we need within the cost- and timeframe initially envisioned?
• Can we change our surveying method to be household-based given our acquired Institutional Review Board (ethics approval)?
We felt strongly that the risk of spillovers across households within a neighborhood was much more problematic than the power loss from clustering. As such, we held on to our sample of schools, and our field team found a way to use phone recruitment methods via village chiefs with the support of the County Commissioner’s Office and subsequent snowballing.
Within days, we set up a team to complete an eligibility survey with as many parents as possible. The idea was to first find all eligible children for a given school, and then randomly sample an equal number of boys and girls in order to reach our full sample with equal cluster sizes. We understood the risk of a slight bias in our sample, being unable to sample children from a full school roster. However, in initial attempts, it seemed feasible to reach almost all eligible children in our sampled school using this method.
Pivot point #2: The fallacy of assumptions
As we were completing our eligibility surveys in more and more schools, we learned that TV ownership was everything but uniformly distributed across sub-counties, which was a key assumption in our initial calculations. Although we already sampled our schools based on expected eligible children, there were many in which we could not find enough children with TVs at home to stick to equal cluster sizes and an equal number of boys and girls in each school. Once more, we investigated the situation at hand:
|Methodological considerations||Practical considerations|
|• How large is the loss of power if we accept unequal cluster sizes, as compared to dropping schools where we cannot reach our minimum cluster size?|
• Can we re-distribute our sample across counties if some counties have many eligible children and some with few, or would that affect representativeness too much?
• Given diminishing returns (on power) of each additional child in schools with many eligible children, is there a maximum number of children we should survey in any given school?
• Should and can we still stratify on gender, even if we cannot reach an equal gender split in every school?
|• Does it make sense to travel to areas if there are very few eligible children?|
• Can we hold our trained field team long enough to evaluate and resolve the necessary methodological adjustments?
• Can we increase our overall sample within the given budget due to efficiency gains of having larger teams in dense areas and smaller teams otherwise?
We had committed to a clustered design at this point, and losing clusters appeared to be the most costly. Moreover, we felt that it actually seems reasonable to survey more children in counties with more eligible children, as this suggests better overall representativeness than an equal number per county. Finally, there were no cost-savings in choosing to artificially survey fewer children in schools with many eligible, so we did not bother to define a maximum number.
Ultimately, having to react and adapt multiple times had its advantages. For example, finding out about the number of eligible children in advance avoided a large field team arriving at a school with almost no eligible children. This would have had significant cost repercussions, both in terms of money and lost sample size.
Eventually, we were able to survey 346 schools and almost 4,400 children, thereby putting us on a road to meaningful and important study results for both policymakers and education professionals—with considerable potential for tangible impact on children’s skills. We used baseline information to make up for some of the power losses using a multivariate quadruplet-matched randomization technique. In this happy end, the assumption universe was finally on our side: our baseline data shows much lower intra-cluster-correlations (ICCs) than expected and, as such, the research team feels more “power-ful(l)” than ever—and ready to face another learning phase as we roll out our intervention and encouragement design.
¹We only considered public schools.
²We expected many children to watch TV and, consequently the show, which would make it difficult to find a compliant control group.