So, continuing my series of posts on methods. I’d like to offer my thoughts on inter-rater reliability.
What is inter-rater reliability? It’s when two (or more) people independently code up qualitative data and then they compare the number of matching codes. The more codes that are the same, the more confidence you have in the data analysis produced.
I have mixed feelings about inter-rater reliability, I think it has a place, but like many things involving research methodology it involves judgement. And I think that judgement turns on the relationship between those codes and the final analysis.
What do I mean by that?
Well, in some circumstances, interviews or some other qualitative procedure, the development of codes maybe a very close approximation to the analysis. Perhaps, for example, the interview is a small component of the overall method or study, and there is a desire to generate codes that illustrate common themes. What reoccurred, what did not. Of course that’s not statistically generalizable, but perhaps here inter-rater reliability provides some assurance to the reader that what is being presented has been seen by multiple sets of eyes.
Where I find inter-rater reliability less compelling is when there is some distance between the codes and analysis. In other words where analysis or cycle of data collection take substantial time and energy post the development of the codes. The problem with inter-rater reliability when in those situations is that the codes are an interim product, the first or an early step in the analytic process. And there’s a nice study that suggests that even when coders agree, their analyses are framed differently.
For example, to return to Grounded Theory, there’s no mention of inter-rater reliability in any of the theoretical or practical elaborations. Why? Well because codes are merely an interim product, and they are not the only interim product generated (e.g., the memoing). And there’s a substantial distance that must be travelled during analysis between the time codes are generated and the end result. Codes for example may be incomplete, particularly in the process of selective coding. That triggers another round of data collection, and more code development. But most crucially, the analysis is more than the sum of the codes, it’s an interpretation, an explanation grounded in them, but accompanied by other scholarship, related work, analytic insight, etc… and it’s that piece of the process in addition to the codes that generates the final analysis, or grounded theory. And, knowing that two people could generate the same set of codes just isn’t a measure of whether the theory is compelling.
A set of criteria that I like come from Christine Halverson (although I take application very broadly, a practical outcome for me might be that I understand something about the relationship between people and technology better).
- Descriptive power: make sense of and describe the world.
- Rhetorical power: describe the world through naming important aspects clearly and map them to the real world. Should help us communicate and persuade others.
- Inferential power: inferences about phenomena that we do not yet completely understand. Predict the consequences of deploying a technology into the environment.
- Application: can we apply the theory in such a way that we get design, or some other practical outcome.