(together with Adalbert Wilhelm and George Ioannidis)
Sampling methodology applied to the Internet traditionally centers on users rather than on content. However, many applications in social sciences, media studies and market research focus on the content of the World Wide Web and require a representative and valid sample of servers, websites or web pages.
Our primary interest stems from a trans-disciplinary research project on key visuals presented in different media. This project requires a selection of images and movies from the Internet that allows statistically sound and valid comparisons between web offerings of different origin. We use cluster sampling in order to implement a representative sampling of web content. In our application of cluster sampling, IP addresses and domain names form clusters. Within clusters, web pages and content within web pages constitute the units of analysis.
Pictures (and any other web content) are stored on computers (servers) that are connected to the Internet. Each server receives a unique IP. An IP is a 31 bit wide number that is usually displayed as four decimal numbers ranging from 0 to 255 (e.g. 18.104.22.168). It can address about 4.3 billion devices. Addressing servers by their IP is rather inconvenient. Therefore, the domain name system has been created. The domain name maps a written address (like "ard.de") to an IP. The domain name system is hierarchical. The different levels of hierachy are separated by a dot. The top-level domain for Germany is "de". An example for a second-level domain is "ard" and a third level domain might be "sport". The server behind the latter domain ("sport") can be addressed by "sport.ard.de" (see figure 1). This domain name translates into the IP (e.g. 22.214.171.124) of the server serving the sport news of the German broadcaster ARD. Pictures are located on the server, again in a hierarchical structure that is derived from the underlying file system (nowadays, web content is often served dynamically - however, the general structure remains the same). The picture is stored in a file (here: "zuma_dpa_405.jpg") that is located in a (sub-)directory (here: "/sp/fussball/news200603/04/img/"). The full address of the picture from the example in figure 1 is therefore "sport.ard.de/sp/fussball/news200603/04/img/zuma_dpa_405.jpg". This string identifies the server, the directory of the picture on the server, and the filename of the picture.
Web content on a web server is usually linked by hyper-links. One web page is pointing to other web pages or to other content like pictures or videos. By following all links on a web server, one may access the content of a web server. Content that is not linked directly or indirectly to the entry page of a web server is not considered in the present context.
Drawing such a sample is fairly straight-forward, if there is a list of second level domains available. Some registrars of domain names offer a full list of registered second-level domains (e.g. for the .org - top-level domain). One can then define the second-level domains as clusters, draw clusters randomly from the full list, and download recursively the web content from the domain by following the links and thereby determine the size of the cluster. This represents a standard form of sampling which can be found in the literature: cluster sampling with unknown cluster size.
However, for many top-level domains, no lists of second level domains are available as primary sampling unit (see figure 2 for the German DENIC). Is there a replacement for such a list? To address this question, one must look into the relationship between IP addresses and the domain name system. As written above, every web server can be reached via an IP address. The distribution for Europe and Central Asia is coordinated by RIPE. RIPE assigned a total of about 262 million ip addresses to providers. These numbers are known. In order to get a list of primary sampling units, IP numbers from the IP number pool of RIPE are drawn randomly and probed for the existence of a web server with a ".de" top-level domain. This can be done via reverse dns "/ ptr" records. These records map IP addresses back to domain names (however, more than one domain might be assigned to an IP address - reverse dns retrieves only one of those). The second-level domains found by this technique constitute the “central” domains of sampling clusters.
The content of these domains are downloaded and analyzed for domains that link with these domains. This is repeated for all the domains that have been found. In total, a cluster consists of the second-level domain that has been identified via reverse dns and all recursively linked second-level domains (here: up to two recursion levels, see figure 3).
The clusters that have been generated are the basis for a multistage cluster sampling: domains are randomly sampled from the clusters found and their content is analyzed. In our example, we randomly sampled 161 domain names via reverse dns from the RIPE address pool. We analyzed these domains for links to other second-level domain names. We repeated the analysis for these domain names. We ended up with about 21.000 domain names in 161 clusters. The average cluster size is about 130. From these clusters, we randomly selected 502 domains and analyzed their content (about three per cluster). We downloaded a total of about 100.000 pictures.
Domains that are well linked to other web offerings have a higher probability to enter the clusters (such as www.heise.de). On the one hand, this might be a feature. It leads to an oversampling of central and important web sites. Alternatively, one can adjust for the centrality of a web page with the help of google's "link:” feature. The "link:" feature returns the number of links to a site (see figure 3). These numbers can be used to adjust the weights of the clusters.
Using the formulars for two-stage cluster sampling and adjusting for centrality, one can estimate parameters of the pictures on German domains: the average file size of a picture on the German Internet is 6221.4 Bytes with a standard error 284.23. The total size of all pictures in the top 2 hierarchies on second-level domains below the ".de" top-level domain are 8.25 TB with a SE of 3.177e+12 (which is about 140 images per .de 2nd lvldomain). One may of course use the sample for a more sophisticated analysis (like a content analysis, possibly based on automatic image analysis).
The representativity of the sampling method applied rests on a few assumptions:
Müller, J., Wilhelm, A. & Ioannidis, G. (2006). Selecting Images, Web Pages or Web Sites: Sampling Strategies for Internet Content. Presentation at the 8th German Online Research Conference, Bielefeld.