Yesterday morning I heard Bob Groves speak at the StatCan conference. His talk was titled, “Towards a Quality Framework for Blends of Designed and Organic Data.” Or, more simply put, survey data and big data. For those who may not know him, Bob has been our most significant evangelist of the total survey error (TSE) model over the last 30 years. His book is the TSE bible for many of us. He is one of the smartest people I know.
He began by noting that TSE essentially is about ensuring data quality by controlling two things: measurement of the phenomena we are trying to understand (generally via a questionnaire) and representivity (via a good sample). But big data is designed by others or not designed at all. So how can we think about assessing let alone ensuring data quality?
A few specifics:
- There is no sample frame with big data and coverage can be difficult to assess. (My comment here is let’s dismiss the silly assertion by many big data enthusiasts that N=all.)
- While we are used to datasets with lots of variables, many big data sources have few variables. Think of POS or GPS data. The only way to get more variables is through linking with other sources, and that is fraught with problems.
- While some big data may lack a useful number of variables, their specificity can be remarkable.
- Identifying the characteristics of people associated with a social media post or a sensor can be extremely difficult.
- The data come in all manner of forms and data structures.
- The data generally are owned by someone else and getting hold of them as well as the ability to fully assess their provenance can be difficult.
Bob’s view is that we need to be looking at blending sample surveys and big data if we want to get valid estimates. This is not an easy task and involves three things:
- Development of some very complex models.
- Existence of shared covariates in the survey and big data source(s).
- Careful assessment of measurement properties of the big data items and their coverage.
One of the things people working with administrative data in national statistical institutes already are discovering is that merging these sources results in lots of missing data. So presumably much of that complex modeling referenced above involves imputation, not something that we in MR generally do with our datasets.
He concluded by arguing that much of the future for research will be model based. That has at least three important implications.
- Transparency is going to be more important than ever.
- We will need an understanding of statistics that is well beyond what we are accustomed to.
- There will be a need for uncertainty measures (much like we have margin of error now), which have yet to be invented.
In a brief conversation afterwards we agreed that issues of consent and privacy protection were also significantly more challenging than what we now face. He just didn’t have the time to get into them.
As I listened it struck me that he was describing a level of rigor that I have yet to hear in discussions about big data within MRX. Maybe all that says is that the “rigor gap” that currently exits between MR and those who work in social policy research will just continue into the future unaddressed. Sorry to say it, but that would not surprise me.