Notice: the WebSM website has not been updated since the beginning of 2018.

Web Survey Bibliography

Title Is Clean Data Good Data?: Data Cleaning and Bias Reduction
Year 2017
Access date 13.04.2017

Relevance and Research Question:

Many researchers have argued that, in order to improve accuracy, we should clean our data by excluding participants who exhibit sub-optimal behaviors, such as speeding or non-differentiation. Some researchers have gone so far as incorporating ‘trap’ questions in their surveys to catch such participants. Increasingly, researchers are suggesting more extensive cleaning criteria to identify larger portions of respondents for removal and replacement. This not only raises questions about the validity of the survey results, but also has cost implications as replacement sample is often required. Our research question focused on the effects of the extent of data cleaning on data quality.

Methods and Data:

We used data from three surveys that contained items which allowed us to estimate bias, including items for which external benchmarks existed from reputable sample surveys along with actual election outcomes. Survey 1 had 1,847 participants from GfK’s U.S. probability-based KnowledgePanel® and 3,342 participants from non-probability online samples (NPS) in a study of the 2016 Florida presidential primary. Survey 2, had over 1,671 participants from KnowledgePanel and 3,311 from non-probability online samples fielded for the general elections in 2014 in Georgia and Illinois. Survey 3 was a 2016 national election study with over 2,367 respondents from the KnowledgePanel. Each study had questions that paralleled benchmarks established with high quality federal data.


We examined how varying the proportion of respondents removed based on increasingly aggressive data cleaning criteria (e.g., speeding) affected bias and external validity of survey estimates. We compared using all cases versus cleaning out from 2.5% up to 50% of the sample cases based on speed of completion.

As found in our initial investigation of other studies, while we found NPS had higher bias than the probability-based KnowledgePanel sample, we found that more rigorous case deletion generally did not reduce bias for either sample source, and in some cases higher levels of cleaning increased bias slightly.

Added Value:

Some cleaning might not affect data estimates and correlational measures, however, excessive cleaning may increase bias, achieving the opposite of the intended effect while increasing the survey costs at the same time.

Year of publication2017
Bibliographic typeConferences, workshops, tutorials, presentations

Web survey bibliography (8390)