MRMCD 2017

»Privacy by Design : Collecting data in a socially responsible manner without privacy side effects.«
2017-09-01, 21:00–21:50, Ziergarten

Data from users is needed to build great products. Google, Facebook even Doubleclick would not be able to offer their services unless they had tons of data. No company or product is an exception, the current industry modus-operandi is "collect-as-you-can" and it is bound to produce privacy side effects.

We would like to share an alternative approach based on client-side aggregation. This talk will showcase how it is perfectly possible to collect and anonymously send data without any user identifiers or PII for building data driven products like a browser with a fully integrated search engine (Cliqz), Anti-tracking, Anti-phishing etc. without risking or compromising the users privacy at any time.

Privacy by Design means that the complete architecture right from collection, processing, serving the data is built with privacy and security in mind from the ground up. The term Big Data needs little introduction; it is the foundation of today's services, without this most of the services that we enjoy today could not exists not exist. The industry modus-operandi can be described as "collect-all-you-can", this behavior is not only accepted but encouraged. This approach to data collection, however is dangerous especially when data involves our everyday actions and when the subject of data collection involves human actions on the Web, side-effects are something to be considered seriously. What is a side-effect in context of data collection ? In simple terms, a piece of knowledge that can be learned from data analysis which was neither intended nor expected, and that poses a risk to the privacy of the subjects of the data collection. During this talk we will provide evidences of side-effects on popular systems such as google analytics that compromise the privacy of the users, even to the point of obtaining personal identifiable information (PII).

Unfortunately the current approach of data collection where aggregation only happens on the backend of the collector makes these side effects unavoidable. At Cliqz we faced the same problem when designing our data collection system. We needed data from our users to build our services: a browser with an integrated search engine, news recommendation, security services such as anti-phishing and anti-tracking and so on. However, we were very troubled by the side-effects. That is why we created the framework – Human-Web[1] , a novel approach to data collection that relies on client-side aggregation rather than server-side. Since server-side aggregation of user's data is bound to produce side-effects, it is strictly forbidden. Our data collection backend only receives signals from our users if and only if those signals require no further aggregation. If aggregation by user is required, it will be carried out in the client itself. Thus no UID ever reaches our data collection backend. In addition to remove explicit UIDs we have a complex set of heuristics to detect potential implicit UIDs in the content of the signals as well. This talk, however, is not just about how we at Cliqz build privacy preserving data collection systems, which have been running in production for over two years now, the whole motivation behind this talk is to showcase design choices and exciting new technologies that can help anyone collect and transport data in a privacy preserving manner which can be useful for any number of uses cases, for instance one can use the same approach to build a privacy-preserving google analytics (as can be showcased in a demo) and solve problems like re-targeting, complex telemetry for business intelligence, goal conversions or like in our case building a complete search-engine. The intended audience for this talk is anyone who deals with data, as a takeaway they will learn how by a slight change in the methodology data can be collected in a much much saner and safer way ultimately protecting and respecting the user's privacy. Actually to demonstrate further, we have built a demo Google Analytics[2] service that covers a wide-range of metrics that require would require tracking to solve not so trivial use cases like unique, goal completion, return rate, time on page, etc, but instead it relies on the principles of privacy by design and yet achieves what services like Google analytics will.

Footnotes: [1] : https://cliqz.com/en/whycliqz/human-web [2] : http://josepmpujol.net/public/papers/big_green_tracker.pdf