Login is restricted to DCN Publisher Members. If you are a DCN Member and don't have an account, register here.

Digital Content Next logo


InContext / An inside look at the business of digital content

Is your data leaking?

January 6, 2016 | By Don Marti @dmarti

We hear a lot about ad blocking and ad fraud, but surprisingly little about data leakage, the underlying problem that feeds both. What is data leakage, anyway?

Sophia Cope at NAA: Data leakage is the collection and monetization of an online publisher’s audience data by a third party without the publisher’s knowledge and consent.

John McDermott at Digiday: Data leakage typically occurs when a brand, agency or ad tech company collects data about a website’s audience and subsequently uses that data without the initial publisher’s permission.

Note that these definitions include nothing about user permission. That’s a separate issue. Data leakage is a publisher/third-party problem, not a publisher/user problem. Both definitions are based on publisher permission.

That’s a good start. But “permission” is hard. The reality is that publishers often don’t understand all of the third-party resources on their own sites. And to make things worse, third parties bring along other third parties … with the result being that pages can end up with 50-70 third-party pixels, scripts, and iframes. (There’s even a zone on the Lumascape devoted to services that a publisher can use to figure out what other services they’re using.)

Every one of those third parties comes with some kind of contractual “permission.” All of those permissions, though, are expressed in complex contracts. Even if a publisher can afford to comprehend the terms of one contract, the set of resources on a page can introduce unwanted data transfers working in combination. And the combinations of technology and contracts change faster than anyone can evaluate them.

Learning from security software
It’s inadvisable for security companies to call malware “malware” when the malware developer has a lawyer. The solution has been to come up with a new category name, Potentially Unwanted Programs (PUPs). The definition is based on whether the software works against the user’s interests, not on whether the user has clicked on “I agree” after not reading a complex, deceptive contract.

Labeling programs as PUPs is a small win. It is possible to define a problem in terms of interests and norms, not in terms of what someone has “agreed” to. We will be able to have a more productive discussion of the data leakage problem if we can shift the definition from one based on “permission” to one based on publisher interests and business norms. Remember: Data leakage is not a problem about user permission (which is another story). Instead, user norms and expectations can be a helpful way to establish the definition of data leakage.

So here’s a new definition:
Data leakage is collection or transfer of publisher audience data in a manner that violates the norms and expectations of the members of the audience.

Why is it important to clearly define data leakage? Because we still have trouble collecting data on it, and to do so, we need to know what to measure. We don’t know how much of the ongoing problem that newspapers have with building online ad revenue is attributable to data leakage.

In order to measure data leakage, we have to define it based on a reasonable standard, not on a mess of contracts that results in publishers unknowingly “agreeing” to practices that work against their interests.

Finding data leakage
We can look for leaks at the top of the pipe and at the bottom:

Bad bots visiting good sites: Fraudulent ad inventory gains value from associated user data, which is why even the best-run sites still get some fraudbot traffic. Bots are collecting cookies on legit sites, then showing up as valuable “users” on fraud sites. If your site has a small bot problem, then any advertiser paying to reach your users on other sites will have a big bot problem.

Good audiences available in bad places: We don’t have access to the “user data chop shops” that sit between legit publishers and sales of targeted impressions based on publisher data. But we can look at demand-side platforms to find out where valuable impressions are showing up, and measure the impact of anti-data-leakage tests.

Data leakage is an important issue that must be better understood and addressed.We can’t deal with ad blocking, ad fraud, and low ad revenue without understanding the data leakage problem. The good news is that data leakage is measurable, if we work it from the legit publisher end and the questionable inventory end.


Don Marti (@dmarti) is a contributor of code and documentation to the aloodo.org project, a low-friction way for sites and brands to reclaim the value of online advertising from fraud and ad blocking. He serves as a strategic advisor for Mozilla, and is the former editor of Linux Journal. Don is the subject of an out-of-date Wikipedia article which he will not edit himself, but wishes that someone would.

Liked this article?

Subscribe to the InContext newsletter to get insights like this delivered to your inbox every week.