DatCat℠Internet Measurement Data Catalog

Log in | Create an Account

Note: for a better experience, enable javascript and stylesheets in your browser.

Search for in
Enter one or more word stems or quoted phrases. Wildcards “*” and “?” are allowed.
Contact us

/collection/1-0723-8

Collection: Web API usage in the Alexa 10k

Measurements of how often browser features are used in the Alexa 10k, in an unmodified browser and with popular extensions.

Jump to: Description | Download Data | Annotations | Record Details

Collection Details

SummaryThis dataset describes a measurement of Web API (DOM) browser features used on the open web. Each site in the Alexa 10k was visited using an automated measurement technique described in Snyder et al., “Browser feature usage on the modern web,” in the Proceedings of the 2016 Internet Measurement Conference, (2016) with an instrumented web browser, to count feature usage. Each site in the Alexa 10k was visited 10 times, 5 times with an unmodified browser, and 5 times with the AdBlock Plus and Ghostery browsing add-ons installed. This dataset records how often 1,390 different Web API / DOM endpoints, taken from 75 Web API standards documents, were used on each site. The dataset further records the difference in feature usage when ad and tracking blocking extensions are present.
Start Time2016-04-01 07:00 UTC (+0000)
End Time2016-04-30 07:00 UTC (+0000)
Duration29 days 00:00:00 (2505600.0 s)
Data formatsPostgreSQL
Network locationUniversity of Illinois at Chicago
PlatformUbuntu 16.04.1 LTS
Primary contactPeter Snyder
CreatorsPeter Snyder, Chris Kanich, Cynthia Taylor, Lara Ansari
Keywordsalexa 10k, privacy, web browser
Description
The structure of the dataset is documented in per-table and per-column comments in the database. A higher level description of the structure of the data is provided below though.

The dataset records the use of Web API features in the Alexa 10k. Each domain is represented by a row in the *domains* table, and each Web API feature is represented by a row in the *features* table. The W3C (or similar standards organization) document defining the each feature is defined by a record in the *standards* table.

Each site in the Alexa 10k was visited multiple times under different configurations (first in an unmodified browser, then again with popular browser extensions installed, etc). Each of these different configurations (or test conditions) is represented by a row in the *conditions* table.

We visited each domain in the Alexa 10k under each browser configuration 5 times. Each of these visits is a row in the *crawls* table, and each of the pages / urls visited during each visit to a domain is a row in the *pages* table. Each Web API feature used by each page is a row in the *features_pages* table.

Finally, the reported security vulnerabilities we were able to associate with each standard of Web API features is described in the *cves* table.
Creation process
The below is a abbreviated discussion of the methodology used to generate this data set. A full detailed description can be found in Snyder et al., “Browser feature usage on the modern web,” in the Proceedings of the 2016 Internet Measurement Conference, (2016).

We first identified 1,390 distinct Web API features implemented in Firefox 46.0.1, and tied them to their 75 defining standards documents. This was done by reviewing the Firefox source and inspecting the DOM implemented in the browser.

Next, we built a browser extension that instruments the browser to count the number of times each of these browser features is used by a website.

This browser extension also implements a spidering technique to attempts to elicit all the functionality used on unauthenticated parts of a website. The extension interacts with the visited browser page by clicking on elements in the page, filling in form elements and scrolling the page, among other common user activities. The extension watches the URLs that would have been visited by these actions (ex clicking on links) and then selects 3 URLs with dissimilar paths to visit. The extension then visits these URLs and repeats the process.

The above described spidering technique is repeated depth-twice for each site visit, resulting in up to 14 pages being visited (the initial page, three children of the initial page, and three more children of each of those pages).

We then installed the above spidering-and-instrumenting extension in two browsers, first in an otherwise unmodified version of Firefox 46.0.1, and second in Firefox 46.0.1 with the AdBlock Plus and Ghostery extensions installed.

Finally, we visited each domain in the Alexa 10k five times with each browser, resulting in 96,609 domains being visited (some sites were unreachable, either globally or from our measurement location). The features used by each domain were recorded and included in the present data set.
Member of(none)
Contents
1 data files (480 MiB)

Download Data

Annotations

Record Details

Handleimdc.datcat.org/collection/1-0723-8=Web-API-usage-in-the-Alexa-10k
ContributorPeter Snyder
Contributed2016-08-23 18:54:08.872 UTC (+0000)
Last Modified2016-08-24 18:21:00.490 UTC (+0000)