Module 12: Evaluation Study Designs

Qualitative Design

It is uncommon for evaluators to use only qualitative methods when evaluating a program. Some clients are more at ease with making decisions based on quantitative data outputs, while some understand the significant costs and time associated with qualitative data collection. It is also very difficult to generalize to other populations based on qualitative data, which therefore makes program replication and scale-up a challenge. On the other hand, supporters of qualitative evaluations believe that context is such a large factor in program implementation success that generalizing to other populations is not possible despite the methodology choices.(1) However, qualitative evaluation methods are extremely beneficial in providing rich program feedback and evaluators should consider integrating them into the evaluation plan, i.e. using a "mixed methods" approach. For more information on qualitative data collection methods to incorporate into program evaluations, please refer to the certificate in Global Health Research.

Quantitative Designs

If clients or stakeholders are interested in a quantitative evaluation plan, there are a number of design sequences that can be used, depending on time, money, and availability of data. Linking directly to the needs of the stakeholders and the purpose of the evaluation, study designs can either be experimental (with a traditional control group), quasi-experimental (with a comparison group that may not necessarily be a control group), or non-experimental (where no formal control or comparison group exists).(2)It is important to note how evaluation differs from monitoring; monitoring data like monthly reporting forms, stock outs for supply chain management, and some disease tracking is done consistently and frequently throughout the life of a project. Please refer to http://www.uniteforsight.org/metrics-course/monitoring-evaluation  for more information on the differences between monitoring and evaluation processes.

In many cases, it is impossible to randomly assign individuals into experimental and control groups, as is often done in clinical or randomized control trials (RCT). When RCT is unavailable, comparison groups are selected that match the target population on a number of population characteristics, closely resembling the group that receives the intervention. This is the main difference between experimental and quasi-experimental designs.(3) For example, if the intervention is held in a school, a classroom that is not receiving the intervention may be used as a comparison group; this ensures that the children in the intervention and comparison groups have had somewhat similar school experiences, are most likely the same age range, and may live in the same geographic area.(4)

Quasi-experimental design (QED) is the most common quantitative design in global health evaluation. There are various schedules of data collection, differing in level of robustness according to time, money, and availability of data (see: Constraints on Evaluation).

It is recommended that QED data collection occur at four different time periods: pre-intervention, mid-intervention, post-intervention, and ex-post intervention. Pre-intervention data (also referred to as baseline data) is collected on relevant evaluation indicators prior to intervention. Evaluators and program implementers look for changes in the intervention group from pre- to post-intervention data, signaling the possibility of project impact. Baseline data may also be collected far in advance of program development as a way to determine the needs of the community in more participatory approaches. Mid-intervention measures occur around the mid-point of intervention, which will vary according to the timeline of intervention. Post-intervention data is collected immediately after the intervention has ended, and ex-post data is collected much later, possibly 5 to 10 years after the intervention has ended, in order to assess the impact or sustained effects of the program.(5) At each of these times, data may be collected from the intervention group, the comparison group, or both. The source and time of data collection will depend on the level of available resources and the needs of the evaluation.

In the most ideal situation with surplus resources, evaluators would collect data on relevant indicators from both groups at all four times. Without any randomization as in experimental designs, this design is the closest to a randomized control trial and can be visually represented as follows:


Group

Baseline

Mid

Post

Ex-Post

Intervention

X

X

X

X

Comparison

X

X

X

X

The "X"s represent data collected at that time point.

Various time, budget, and data constraints often deter the use of this robust and extensive schedule of health program evaluations. First, evaluators may not be introduced into the process until the intervention is already in place, eliminating the chances that baseline data has been collected unless there was a pre-existing record. Second, budgets may not allow for mid-intervention nor guarantee ex-post measurements, which may be considered less critical than baseline and post-intervention measures for determining project outcomes.  Third, depending on how large the required sample size is, the collection of data that may be considered non-essential may be removed for the sake of balancing a budget with a significant sample size. Also, budget, time, and data constraints may eliminate the possibility of a comparison group: a valid comparison group may not exist, there may not be sufficient funding for comprehensive data collection, or late intervention may preclude the creation of a comparison group. As the number and type of measurements are reduced, the model of data collection becomes less robust and susceptible to invalidity.(6)

The least robust QED is represented only by a post-intervention measure in the intervention group, shown below:


Group

Baseline

Mid

Post

Ex-Post

Intervention

 

 

X

 

Comparison

 

 

 

 

Occurring frequently in global health evaluation, this schedule does not show demonstrative change in the intervention group nor impact of intervention when evaluated against a comparison group —two main goals of evaluation.(7) Instead, evaluators collect data on participants in the program (those who have received the intervention) and attempt to make the most relevant and conclusive statements about the success of the program. At this stage, evaluators may choose to supplement quantitative data with qualitative methods in order to bolster the findings. Recall is a great strategy for estimating a rough baseline (e.g. asking people what their income used to be or how many times they went to the doctor five years ago), however, it creates a risk of bias if not triangulated with other forms of data collection methods. This is an appropriate place to incorporate qualitative methods to triangulate and reinforce the recalled data, quantitatively, in the post-intervention period.(8) With only one measurement and no effective comparison group, this particular study design may also be categorized as non-experimental.

Hypothetical Case: Evaluating Demand for a Health Program(9)

If an evaluator is only interested in looking at the demand for a particular health program (for example, safe-sex education), as well as what kind of people are attending the lectures, they may perform a post-test only design among the intervention group. After conducting a number of health education sessions for a population, the evaluator may survey the attendants and ask various questions to determine why they chose to come to the sessions, along with standard demographic information collection. In this situation, the data collected would most likely be formative in the sense that the evaluator would report this information to program developers and they could determine the best health education program for this population based on the evaluation reports.

In between the most and least robust study designs, there are other options that may suit available resources; evaluators and clients may discuss which measurements they believe are most crucial to determining program success (or obtaining continued funding) and proceed from there. It should be noted, however, that if limited resources exist, it is better to do two measurements in a comparison and intervention group than take two measurements at different times from only the intervention group. For example it is (usually) better to do this:


Group

Baseline

Mid

Post

Ex-Post

Intervention

 

 

X

 

Comparison

 

 

X

 

than this:


Group

Baseline

Mid

Post

Ex-Post

Intervention

X

 

X

 

Comparison

 

 

 

 

The latter works well when the theory in practice is well established or the client and evaluators are only interested in determining adequacy of the intervention. However, in pilot programs or evaluations that want to determine probability or plausibility, only measuring twice among the intervention group is not sufficient. As discussed in the Purpose of Evaluation module, probability and plausibility evaluations are more involved but have the ability to derive more pertinent and significant program information than adequacy assessments. By collecting indicator data from both the comparison and intervention groups, differences can better estimate how well the project can be generalized to other populations.(10) However, in both cases, the final decision for the study design will come down to the ability to access a comparison group and the aim of the evaluation.

Threats to Validity

Evaluations face limited robustness and validity as the number of measurements is reduced and the comparison group is compromised. In particular, the amount of control in an evaluation determines the level of internal validity.(11) Consequently, the evaluation schedule with a single post-test measurement of the intervention group would have little internal validity due to few control measures in place (i.e no comparison group, limited measurements, no randomization).

When only one group is measured, evaluation risks shortcomings due to "maturation". This is the risk that evaluation indicators are measuring only growth that would have occurred naturally through the maturation process, rather than growth due to program activities.(12) For instance, measuring math skills of 5th graders will most likely produce improved indicators simply because of the normal educational process and brain function, not necessarily the impact of a tutoring program. History threats are similar and reflect events or changes in the surrounding environment that may have altered the results, as opposed to program activities.(13) For example, if a country is improving the economy and spending more on healthcare, it is possible that a decline in child mortality would have been attributed to these external factors rather than a specific intervention. It would be in the interest of the evaluator to examine potential confounding variables and processes of maturation in the final evaluation.

Go To Module 13: Logic Models >>

Footnotes

(1) Bamberger, M., Rugh, J., and Mabry, L. (2006). Real world evaluation: Working under budget, time, data, and political constraints. Thousand Oaks, CA: Sage Publications, Inc.

(2) The World Bank. (n.d.). Impact evaluation: Evaluation designs. Poverty Reduction and Equity.

(3) Gribbons, B., and Herman, J. (1997). True and quasi-experimental designs. Practical Assessment, Research and Evaluation,  5(14).

(4) Harrell, A., Burt, M., Hatry, H., Rossman, S., Roth, J., and Sabol., W. (n.d.). Evaluation strategies for human service programs: A guide for policymakers and providers. Washington, DC: The Urban Institute.

(5) United Nations World Food Programme. (n.d.). Monitoring and evaluation guidelines: What is RBM oriented M&E? Rome, Italy: United National World Food Programme Office of Evaluation and Monitoring.

(6) Bamberger, M., Rugh, J., and Mabry, L. (2006).

(7) Shadish, W.R., Cook, T.D., and Campbell, D.T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.

(8) Bamberger, M., Rugh, J., and Mabry, L. (2006).

(9) Fisher, A. and Foreit, J. (2002). Designing HIV/AIDS intervention studies: An operations research handbook. Washington, DC: Population Council.

(10) Ibid.

(11) Slack, M.K. and Draugalis, J.R. (2001). Establishing validity of experimental studies: Threats to internal validity. American Journal of Health-System Pharmacy 58(22).

(12) Trochim, W.M.K. and Donnelly, J.P. (2006). Research methods knowledge base. Cengage Learning.

(13) Issel, L.M. (2009). Health program planning and evaluation: A practical, systematic approach for community health (2nd ed.)  Sudbury, MA: Jones and Bartlett Publishers.