Propositional Evaluation & Outcomes Assurance by Andrew Hawkins is licensed under CC BY 4.0

top of page
  • Writer's pictureandrewjhawkins


What is science?

1. The scientific method is a process of hypothesis generation and testing using systematic observation, experiment, and measurement. 2. Applying the scientific method to a hypothesis is the work of science. 3. A hypothesis is a claim about what some part of reality is or how it behaves. To be scientific it must be testable. This allows other scientists to agree or disagree that there is evidence to support a claim in preference to other claims or explanations for a phenomenon.

Some basic requirements for science.

4. For scientific claims to be possible, or for that matter for any claim to be possible, the subject of the claim must be definable. For a scientific claim, the subject must have a consistent form with reliable behaviour or causal properties that are discoverable - gravity, interest rates, intelligence, racism, are all subjects about which claims may be made, or hypothesis developed and tested (even if it is not easy). 5. While observable phenomena (or even abstract ones) such as outcomes will be used in an experiment to test a hypothesis – a hypothesis itself is an abstract concept about a deeper and unobservable reality that causes the manifest world in the way we experience it. 6. The purpose of science is generating knowledge about the world - experiments are a means of testing our hypothesis about the world – they are a means to an end. Experimental results matter only for what they say about our hypothesis about the world.

What are programs?

7. Public and not for profit policy and programs are names given to propositions for collective action intended (at least nominally) for the public or social good as conceptualized by the funder.

Why programs usually do not have the conditions required for applying the scientific method.

8. Scientific program evaluation must assume that the program is a hypothesis (or theory, maybe a ‘theory of change’) about which scientific claims are possible. That is, it must have reliable and discoverable behaviour or causal properties. 9. Social policy programs don’t appear to be knowable in this sense and do not seem to have reliable causal properties – different people interacting with different parts of a program delivered by different people at different times in different ways lead people to form different thoughts, intentions, and behaviors. Intuitively, people know that a homelessness service will have different effects on different people – depending on who delivers what to whom and the state of mind of both parties. Pretending this is an issue for program fidelity would be to assume a robot provider that did the same thing for all participants regardless of their unique needs would be a program ideal. 10. Realist evaluators have explained why social policy and programs do not have causal properties that may be discovered. Program activities do not directly cause behaviour change – it’s the way thoughts and behaviors are shaped by people’s interpretations of program activities that generate change for different people in different ways. The causal power of a program lies in how people interpret offers of support or sanction within their broader social context. 11. The use of science in program evaluation should be restricted to research questions that can be studied scientifically. Scientific realists conclude the appropriate focus of study in a program evaluation is an abstract causal mechanism in a particular context. That is, a hypothesis about how a purported causal mechanism in context may be ‘triggered’ by program activities to generate outcomes that may sometimes be observed. 12. Scientific knowledge is very useful for program design but will not usually be created by evaluation – that is the domain of research. Research cannot be replaced with single trial evaluation studies involving difficult to define and dynamic entities such as programs. 13. Even when applied to some phenomenon discoverable by science, experiments must be replicated numerous times to find replicable patterns for scientific knowledge to progress. Researchers examining into new drugs, for example, take 10-15 years to develop a sufficient level of understanding about a relatively simple causal mechanism before a randomized controlled trial (RCT) is considered appropriate.

What happens when we treat a program as an object for scientific discovery?

14. A program is not a hypothesis that can be studied using the scientific method – even if elements of that program, components or mechanisms in context may be studied in this way. 15. Applying the scientific method to a concept, entity or thing that cannot be studied scientifically is not science. 16. As a corollary, program evaluation using the scientific method, such as with RCTs where the concept to be explained is a program that amalgamates many different activities with many different people in different ways, is not science. 17. In other words, scientific program evaluation is a contradiction in terms, because programs do not exist as replicable hypothesis to be tested or provide a unit of analysis amenable to scientific study.

Why program evaluation is more often history than science when the scientific method is applied.

18. Program evaluation using the scientific method applied to the whole program is an exercise in telling the story or providing a history of the program. It shows what happened when a particular program was delivered within a particular context, at a particular time. It tells us what proportion of any changes we chose to observe can be attributed to the program. It tends to be silent on results we didn’t intend – spillovers, negative externalities etc. 19. Outcomes of a program measured using scientific methods are changes in apparent phenomena or conditions that we can observe and attribute to program activities using an experiment like an RCT. 20. Changes in an apparent phenomenon in an experiment may provide data to test a hypothesis – but the scientific study is of the hypothesis, not changes in the observed phenomenon per se. Outcomes of programs only matter in a scientific sense if they are the result of replicable causal mechanisms or a testable and falsifiable hypothesis. 21. The application of the scientific method to a program is at best a method for providing a history of a program. History is useful for informing future action, but it can only provide evidence of what worked, not what works. 22. History is very useful even if it is not a complete history (all history is from a certain perspective) but a history of activities and their relationship to a set of intended outcomes is not science.

Evaluation as concerned with value – not history or science.

23. The merit, worth or significance of measuring outcomes in evaluation using an experimental design cannot be assumed. Measurement is a criterion for science. Measurement may also be used to establish facts if the goal is to write history. Measurement has a precise meaning in the philosophy of science and relates to quantifiable phenomena. 24. Programs of any substantial nature are plans – not any type of theory or causal hypothesis. They may fail or succeed for a myriad of reasons that cannot be reduced to a simple statement of whether they can, do or did work, or not using measurements. 25. Evaluation uses science, but should not be considered a type of science, even the vaguely defined concept of applied science. Just as engineers use physics but don’t do physics. You wouldn’t evaluate a bridge by testing theories. You don’t evaluate a program by testing theories. 26. Evaluation if it is to survive must deliberate about the extent to which it is a science or a management discipline. This is necessary if it is to stay relevant in the face of 100 years of claims to be able to generate scientific knowledge to solve complex social issues that it has not. 27. The purpose of evaluation is most commonly to provide instrumentally useful information for decision making. It may provide information for accountability; it may claim to be contributing to an evidence base of ‘what works’ (claims that seem groundless for any class of phenomena more complex than a ‘nudge’ or some form of pedagogy) but its major contribution is to inform sense making conversations and decision making about policy and programs in the here and now. 28. The boundary between a generalizable hypothesis that can be studied scientifically with an RCT and a specific proposition for action that may or may not be sound, valid or well-grounded is difficult to prescribe – wisdom, experience and a familiarity with complexity science is required.

The importance of high-quality evaluation

29. It is all too common to treat programs like medical interventions - a single definable intervention to address a single definable problem that exists within a closed system. Policy makers are often looking for the social policy equivalent of the ‘pill’ to fix the problem - and are prepared to spend large amounts evaluations which tests those pills or purports to provide this knowledge. The lack of such pills is a difficult truth to swallow – (thanks to Toby Lowe for this one). 30. High quality evaluation about a possible course of action always has been, and always will be, the difference between the survival or extinction of our species - as it is for any species faced with decisions to make that affect its survival. 31. Evaluation may be best considered as a core requirement for democracy and a means of guiding deliberations about whether a proposed course of action is, was or could be a good idea. 32. Evaluation leverages science but is different to science – it is time we gave it the respect it deserves. Andrew Hawkins, September 2023

66 views0 comments


Post: Blog2_Post
bottom of page