The latest release of Privacy Badger gives it the power to detect and block a new class of evasive, pervasive third-party trackers, including Google Analytics.
Most blocking tools, like uBlock Origin, Ghostery, and Firefox’s native blocking mode (using Disconect’s block lists), use human-curated lists to decide whether to block or allow third-party resources. But Privacy Badger is different. Rather than rely on a list of known trackers, it discovers and learns to block new trackers in the wild. It works using heuristics, or patterns of behavior, to identify trackers.
Last week, we updated Privacy Badger with a new heuristic to help it identify trackers that have flown under its radar in the past. Here’s how it works.
What makes a tracker a tracker?
All tracker-blockers have to grapple with a fundamental question: what is a tracker, anyway? Often times, it’s obvious. When an ad network sets a third-party cookie and uses it to build a profile of your browsing history, it’s tracking you. But other times, it’s not so straightforward. Is a content delivery network (CDN) that serves static files across the web a tracker? What about a third-party image host? How do you decide what to block, and what to allow?
List-based blockers have humans make those decisions on a case-by-case basis. The idea is to keep a list of every single domain or URL that might be tracking you on the web. EasyPrivacy, the popular tracker list used by AdBlock Plus and others, has almost 17,000 entries. Creating a list like that requires tens of thousands of judgment calls by human beings, and maintaining it means making those decisions over and over again, for as long as the list is in use.
Privacy Badger takes a different approach. We try to define what tracking behavior looks like, so that any third-party domain that acts that way is likely to be a tracker, and any one that doesn’t is likely to be benign. Then we let the extension decide on each request it sees for us. When new trackers appear on the web, or formerly benign companies start tracking users, Privacy Badger learns to block them without any help from us. If a company that doesn’t intend to track its users becomes blocked anyway, it can adopt a legally-binding Do Not Track (DNT) policy that commits the company to respecting users’ privacy. Privacy Badger will then automatically unblock that company’s resources.
This makes our choice of heuristics extremely important. Trying to write rules that will identify every single tracker on the Web, present and future, is a Sisyphean task. Tracking on the Web is always developing, and the most creative trackers will probably always be able to circumvent our detection. Our goal is to detect and block the vast majority of trackers that users encounter on a daily basis—and to make the surveillance business model a little less profitable.
Cookie Sharing: third-party tracking at a first-party price
For most of its history, Privacy Badger has used three main heuristics to identify tracking behavior:
- Third-party cookies. Cookies are the simplest and most common tracking tools on the web. Privacy Badger considers a domain to be tracking if it sets a third-party cookie with enough information to uniquely identify an individual user. In our own experiments, we’ve found that around 98% of the tracking activity identified by Privacy Badger uses third-party cookies.
- Local storage “supercookies.” Third-party domains that can run JavaScript are able to set values in the browser’s local storage, then retrieve them later to track user activity across sites. Because these values act a lot like cookies but can evade common cookie-blocking tactics, they are sometimes referred to as “supercookies.” Privacy Badger looks for reads and writes of lots of information to third-party local storage and marks those as tracking.
- Canvas fingerprinting. Trackers can also use JavaScript to try to extract a “browser fingerprint,” a value that can uniquely identify your device without the use of stored values like cookies. Privacy Badger looks for some of the most common kinds of fingerprinting using the HTML canvas, and marks those actions as tracking.
These heuristics help Privacy Badger identify and block the majority of tracking requests on the web. But a while ago, we noticed that one particularly notorious data collector was evading our filters: Google Analytics.
Because Google Analytics doesn’t use third-party cookies, local storage supercookies, or browser fingerprinting to collect data about users, it wasn’t caught by any of Privacy Badger’s existing heuristics. However, it is a silent passenger on a huge portion of the Web, and one that collects information about users and sends that data back to Google. Moreover, Google Analytics is included on nearly every popular human-curated block list, including EasyPrivacy and Disconnect. Any intuitive definition of “tracking” probably includes what Google Analytics does, but Privacy Badger’s definition didn’t.
What Google Analytics does make use of is cookie sharing. Cookie sharing is a tracking technique most often used by third-party analytics services. It can also help trackers sidestep restrictions on third-party cookies, like Safari’s Intelligent Tracking Protection (ITP) and Firefox’s default content blocking.
It works like this: When you visit a website, the page loads a piece of JavaScript from a third-party server. That JavaScript runs in a first-party context and sets a cookie associated with the first-party domain, like “example.com.” Your browser allows the third-party JavaScript (running as part of the first-party page) to read and update the cookie. Then, the JavaScript sends off a request to the third-party tracker. Normally, cookies are automatically sent alongside requests, and the browser controls who sees what cookies—it wouldn’t allow a first-party cookie to be sent to a third party like Google. However, since Google’s script is able to access the cookie, it can stick the cookie value right into the request itself (specifically, into the “query string” portion of the request). Google receives the identifier from the first-party cookie and uses it to link the request back to a user profile.
Cookie-sharing trackers, including Google Analytics, often rely on “tracking pixels” to work. A tracking pixel is typically an invisible, 1x1 “image” that is placed on a web page for the sole purpose of triggering a request to a third party. That means most cookie sharing is undetectable to all but the most tech-savvy users.
On its own, cookie sharing is usually not as effective at tracking users as traditional third-party cookies. Because the first-party sites don’t share cookie data with each other, a third-party tracker will associate the same user with a different cookie value on each first-party site the user visits. This makes it more difficult for the tracker to link data from different first-party sites to the same user. But if the tracker sees requests from the same user on two different first-party sites in rapid succession, it can use other identifying information, like IP address or TLS state, to link different cookie values to the same user. Google Analytics is present by some measures on as much 80% of the Web, so it gets information about nearly every site most users visit, and linking those identities together is a cinch. But don’t take our word for it; Google’s privacy policy says as much (emphasis ours):
Google Analytics relies on first-party cookies, which means the cookies are set by the Google Analytics customer. Using our systems, data generated through Google Analytics can be linked by the Google Analytics customer and by Google to third-party cookies that are related to visits to other websites.
Let’s look at how this might work in practice. Imagine a user browses to a handful of sites over the course of a day, each of which has the same cookie-sharing tracker on it. The table below shows the information that the tracker gets from each visit.
Time |
Source |
Shared cookie ID |
IP address |
10:25 am |
theguardian.com |
abc123 |
192.168.124.101 |
10:46 am |
studentaid.ed.gov |
def456 |
192.168.124.101 |
2:41 pm |
newegg.com |
cba321 |
192.168.131.92 |
2:55 pm |
plannedparenthood.org |
xyz789 |
192.168.131.92 |
3:02 pm |
theguardian.com |
abc123 |
192.168.131.92 |
On the first two sites the user visits, the tracker sets a first-party cookie in the user’s browser that is unique to the first-party site. The user’s ID cookie is “abc123” on The Guardian’s website, and “def456” on ed.gov. But because the user visits the two sites in rapid succession, their IP address remains the same, so the tracker can infer that the “abc123” user and the “def456” user are one and the same.
Later in the day, the same user gets back online and visits Newegg and Planned Parenthood’s websites. The user’s IP address has changed, so the tracker doesn’t know that the “cba321” and “xyz789” identities point to the same user as before. However, the user then re-visits The Guardian’s website, and the tracker sees another request from user “abc123” coming from the new IP. This tells the tracker that the requests from Planned Parenthood and Newegg probably came from the same user, and lets it link all the recent activity it’s seen back to a single identity.
Detecting Cookie Sharing
In the latest update, Privacy Badger has a new heuristic to detect cookie sharing. Every time it sees a third-party request, it runs a series of checks:
- Is the request an image request? The majority of cookie sharing uses 1x1 pixel “images,” so we ignore other kinds of requests.
- Does the request URL have query arguments? These are usually used to convey extra information with an image request.
- Do any of the query arguments contain a long segment of a first-party cookie? This is what we’re really looking for. If any of the query arguments have a large chunk of information (8 characters or more) in common with any of the first-party cookies on the page, the request is probably trying to share a tracking cookie.
If all of the above conditions match, Privacy Badger logs the request as a tracking action.
After building the new heuristic, we tested it using Badger Sett, our in-house tool for scanning the web with Privacy Badger. We scanned the top 10,000 first-party websites on the Majestic Million and recorded the number of times each third-party domain was logged taking a particular tracking action. This allows us to see which new domains Privacy Badger will learn to block using the new heuristic, as well as to make sure it doesn’t mark too many benign requests as tracking.
The table below shows the five domains that Privacy Badger newly identified as tracking on the most sites:
Tracking domain |
Business |
Number of first-party sites domain was seen tracking on |
google-analytics.com |
Third-party analytics |
5479 |
chartbeat.net |
Third-party analytics |
659 |
nexac.com |
Advertising |
220 |
bouncex.net |
Identity resolution |
151 |
alexametrics.com |
Analytics, market research |
140 |
Google Analytics is by far the most common tracker identified by the new heuristic, but all of the top five are what we would consider “trackers.” Four of the five are included on the Disconnect blocklist (the list used by Firefox’s content-blocking feature): Google Analytics, Chartbeat, Nexac, and Amazon’s Alexa Metrics. The fifth, BounceX, advertises itself as a service to “accurately recognize and market to the actual person behind every visit in real-time.” Sounds an awful lot like tracking to us.
Track Changes
The techniques used by trackers are always evolving, so Privacy Badger’s countermeasures have to evolve, too. In the process of developing the new cookie-sharing heuristic, we learned more about how to evaluate and iterate on our detection metrics. As a result, Privacy Badger is stronger than ever. When the next generation of corporate surveillance technology hits the web, we’ll be ready.