Sampling of Internet Content

(together with Adalbert Wilhelm and George Ioannidis)

Sampling methodology applied to the Internet traditionally centers on users rather than on content. However, many applications in social sciences, media studies and market research focus on the content of the World Wide Web and require a representative and valid sample of servers, websites or web pages.

Our primary interest stems from a trans-disciplinary research project on key visuals presented in different media. This project requires a selection of images and movies from the Internet that allows statistically sound and valid comparisons between web offerings of different origin. We use cluster sampling in order to implement a representative sampling of web content. In our application of cluster sampling, IP addresses and domain names form clusters. Within clusters, web pages and content within web pages constitute the units of analysis.

Figure 1: The location of a picture on a web server.
Figure 1: The location of a picture on a web server.

Pictures (and any other web content) are stored on computers (servers) that are connected to the Internet. Each server receives a unique IP. An IP is a 31 bit wide number that is usually displayed as four decimal numbers ranging from 0 to 255 (e.g. 80.237.145.192). It can address about 4.3 billion devices. Addressing servers by their IP is rather inconvenient. Therefore, the domain name system has been created. The domain name maps a written address (like "ard.de") to an IP. The domain name system is hierarchical. The different levels of hierachy are separated by a dot. The top-level domain for Germany is "de". An example for a second-level domain is "ard" and a third level domain might be "sport". The server behind the latter domain ("sport") can be addressed by "sport.ard.de" (see figure 1). This domain name translates into the IP (e.g. 85.183.195.40) of the server serving the sport news of the German broadcaster ARD. Pictures are located on the server, again in a hierarchical structure that is derived from the underlying file system (nowadays, web content is often served dynamically - however, the general structure remains the same). The picture is stored in a file (here: "zuma_dpa_405.jpg") that is located in a (sub-)directory (here: "/sp/fussball/news200603/04/img/"). The full address of the picture from the example in figure 1 is therefore "sport.ard.de/sp/fussball/news200603/04/img/zuma_dpa_405.jpg". This string identifies the server, the directory of the picture on the server, and the filename of the picture.

 

Web content on a web server is usually linked by hyper-links. One web page is pointing to other web pages or to other content like pictures or videos. By following all links on a web server, one may access the content of a web server. Content that is not linked directly or indirectly to the entry page of a web server is not considered in the present context.

The question: How to draw a representative sample of visual web content?

Drawing such a sample is fairly straight-forward, if there is a list of second level domains available. Some registrars of domain names offer a full list of registered second-level domains (e.g. for the .org - top-level domain). One can then define the second-level domains as clusters, draw clusters randomly from the full list, and download recursively the web content from the domain by following the links and thereby determine the size of the cluster. This represents a standard form of sampling which can be found in the literature: cluster sampling with unknown cluster size.

Figure 2: A list of second level domain is not always available.
Figure 2: A list of second level domain is not always available.

However, for many top-level domains, no lists of second level domains are available as primary sampling unit (see figure 2 for the German DENIC). Is there a replacement for such a list? To address this question, one must look into the relationship between IP addresses and the domain name system. As written above, every web server can be reached via an IP address. The distribution for Europe and Central Asia is coordinated by RIPE. RIPE assigned a total of about 262 million ip addresses to  providers. These numbers are known. In order to get a list of primary sampling units, IP numbers from the IP number pool of RIPE are drawn randomly and probed for the existence of a web server with a ".de" top-level domain. This can be done via reverse dns "/ ptr" records. These records map IP addresses back to domain names (however, more than one domain might be assigned to an IP address - reverse dns retrieves only one of those). The second-level domains found by this technique constitute the “central” domains of sampling clusters.

Figure 3: a first domain is found by reverse-dns. Besides this domain, all second-level domains within two link-steps are added to the cluster.
Figure 3: a first domain is found by reverse-dns. Besides this domain, all second-level domains within two link-steps are added to the cluster.

The content of these domains are downloaded and analyzed for domains that link with these domains. This is repeated for all the domains that have been found. In total, a cluster consists of the second-level domain that has been identified via reverse dns and all recursively linked second-level domains (here: up to two recursion levels, see figure 3).

 

The clusters that have been generated are the basis for a multistage cluster sampling: domains are randomly sampled from the clusters found and their content is analyzed. In our example, we randomly sampled 161 domain names via reverse dns from the RIPE address pool. We analyzed these domains for links to other second-level domain names. We repeated the analysis for these domain names. We ended up with about 21.000 domain names in 161 clusters. The average cluster size is about 130. From these clusters, we randomly selected 502 domains and analyzed their content (about three per cluster). We downloaded a total of about 100.000 pictures.

Figure 4: the "link:" - feature of the google web search maschine delivers the number of links to a web site.
Figure 4: the "link:" - feature of the google web search maschine delivers the number of links to a web site.

Domains that are well­ linked to other web offerings have a higher probability to enter the clusters (such as www.heise.de). On the one hand, this might be a feature. It leads to an oversampling of central and important web sites. Alternatively, one can adjust for the centrality of a web page with the help of google's "link:” feature. The "link:" feature returns the number of links to a site (see figure 3). These numbers can be used to adjust the weights of the clusters.

 

Using the formulars for two-stage cluster sampling and adjusting for centrality, one can estimate parameters of the pictures on German domains: the average file size of a picture on the German Internet is 6221.4 Bytes with a standard error 284.23. The total size of all pictures in the top 2 hierarchies on ­second-level domains below the ".de" top-level domain are 8.25 TB with a SE of 3.177e+12   (which is about 140 images per .de ­ 2nd lvl­domain). One may of course use the sample for a more sophisticated analysis (like a content analysis, possibly based on automatic image analysis).

The representativity of the sampling method applied rests on a few assumptions:

  • The population is composed out of pictures on second-level domains only within the top-two linked levels. However, these values are rather arbitrary and might be changed, if more or less ressources are available.
  • A "small world" assumption: every second-level domain can be reached within two (or: n) steps from a domain that has been found via reverse-dns. The problem here are virtual web servers: multiple web servers may share a single IP. Reverse-dns returns only one of these web servers. Higher-volume sites (like www.heise.de) tend to be hosted on their own dedicated servers with their own IP whereas lower volume sites like private websites tend to share an IP. The latter therefore have a lower chance to be selected as primary sampling units. However, this is evened out by following links to other domains. More links are followed, the better the result is.

Müller, J., Wilhelm, A. & Ioannidis, G. (2006). Selecting Images, Web Pages or Web Sites: Sampling Strategies for Internet Content. Presentation at the 8th German Online Research Conference, Bielefeld.

 

Deutsch|English