Notice: the WebSM website has not been updated since the beginning of 2018.

Web Survey Bibliography

Title Predictive inference for non-probability samples: a simulation study
Source Statistics Netherlands (2015)
Year 2016
Access date 07.02.2016
Full text pdf (2.3 MB)
Abstract Non-probability samples provide a challenging source of information for official statistics, because the data generating mechanism is unknown. Making inference from such samples therefore requires a novel approach compared with the classic approach of survey sampling. Design-based inference is a powerful technique for random samples obtained via a known survey design, but cannot legitimately be applied to non-probability samples such as big data and voluntary opt-in panels. We propose a framework for such non-probability samples based on predictive inference. Three classes of methods are discussed. Pseudo-design-based methods are the simplest and apply traditional design-based estimation despite the absence of a survey design; model-based methods specify an explicit model and use that for prediction; algorithmic methods from the field of machine learning produce predictions in a non-linear fashion through computational techniques. We conduct a simulation study with a real-world data set containing annual mileages driven by cars for which a number of auxiliary characteristics are known. A number of data generating mechanisms are simulated, and—in absence of a survey design—a range of methods for inference are applied and compared to the known population values.The first main conclusion from the simulation study is that unbiased inference from a selective non-probability sample is possible, but access to the variables explaining the selection mechanism underlying the data generating process is crucial. Second, exclusively relying on familiar pseudo-design-based methods is often too limited. Model-based and algorithmic methods of inference are more powerful in situations where data are highly selective. Thus, when considering the use of big data or other non-probability samples for official statistics, the statistician must attempt to obtain auxiliary variables or features that could explain the data generating mechanism, and in addition must consider the use of a wider variety of methods for predictive inference than those in typical use at statistical agencies today.
Year of publication2015
Bibliographic typeReports, seminars