Title Sampling online communities: using triplets as basis for a (semi-) automated hyperlink web crawler.
Author Veny, Y.
Year 2013
Access date 27.03.2013

Relevance & Research Question:
Blogs and opinion sites have now become a highly surveyed object of the social sciences. However sampling problems quickly arise when one tries to include in the sample pages from the long tail or from a specific online community. New tools are thus needed to overcome these sampling problems. We propose here a conditional (semi-) automated algorithm based on hyperlink analysis, using the Holland and Leinhardt’s triad census for sampling online communities. Triplets are used as proxies of the network entire clustering when one only has partial knowledge of that network.
Methods & Data:
Different methods have been used to sample blogs from the French speaking ‘political ecology’ blogs community. Fourteen blogs of ecological candidates running for the 2009 regional election have been selected. Different tools have been run on the same set of 14 original blogs. The results are compared on two elements: how many new blogs have been retrieved? What proportion of them is relevant?
The results show significant differences between the methods. Using an unconditional web crawler is problematic tool because the sample becomes very quickly overwhelming and most of the included webpages cannot be considered as similar to the original set of blogs. A ‘conditional (semi-) automated tool’ return different results given the triads included in the model. Triplets seem to be the most effective way to sample online communities given their conservativeness, even if the number of new actors remains low.
Added Value:
The algorithm we have proposed here has shown its effectiveness in sampling an online community. This algorithm has been developed and written in the open source statistical environment “R” and can thus be implemented by anyone interested in sampling online community with an adaptive and rigorous tool.

