Owning Your Data
In my last post, I briefly described my interest in working on the next evolution of the internet, which I’ll refer to as the decentralized web (also known as web 3.0). In this post, I’ll explore one area in particular that I find promising: the decentralization (and encryption) of personal data.
Despite its benefits, there are many problems with today’s internet. One of the biggest is that you have little control over what data is collected about you, where it’s stored, who has access to it, and when and how it’s used. The vast majority of information about you lives in centralized databases owned (or at least controlled) by companies behind the apps and websites you use.
This is no coincidence. Data is valuable stuff. The biggest companies in the world—Alphabet (i.e. Google), Amazon, Microsoft, Apple, Facebook—are also the companies with the most data about us. This data can be sold to advertisers, used to improve and personalize products, combed for information about competitors, and much more.
While the benefits to companies are many, this arrangement is not in the best interest of individual consumers. Both regulators and the general public are starting to take note. Flawed as they may be, the EU’s General Data Protection Regulation (GDPR, enforced as of 2018) and the California Consumer Privacy Act (CCPA, enforced as of 2020) are the biggest initiatives to date aimed at empowering individuals to control their own data. More governments are likely to follow suit (some already have).
However, while increased regulation may help stem the worst abuses, it’s ultimately a Band-Aid solution. Even if companies do their best to obey the law and do the right thing, they can (and often do) get hacked, hire dishonest or incompetent people, and make mistakes.
The root of the problem is that individuals don’t own and control their own data. The most direct solution is, therefore, to shift data ownership from companies to consumers.
When you use a website or app, you should grant explicit permission for it to access information about you from a collection of data—let’s call it a personal data store (PDS)—that is private, secure, and completely under your ownership and control. When data is generated as a byproduct of using that website or app (sometimes called “digital wake”), that data should be stored in your PDS—not in a database belonging to the company behind the product.
For example, imagine you’re creating an account on a new social media app, Faceblock (a fictional app). Instead of creating a username and password, you select the option to Log in with MyDataLocker (a fictional PDS).
Faceblock needs some basic personal info to populate your profile, so it asks if it can access your name, email, date of birth, short bio, and headshot from your PDS. The first two items are required to create an account, but the last three are optional. You see no value in others knowing your age, so you decline to provide your date of birth. You’re not entirely pleased with your short bio, but you know you can update or remove access to it at any time.
Over the next few weeks, you use the app regularly—sharing images, writing status updates, following various interests, and connecting with other Faceblock users who share the same interests. All of this data is stored in your PDS—Faceblock stores nothing other than a “pointer” to your PDS. Everytime you log in, the app connects directly to your PDS to load the relevant data (images, posts, etc.) instead of loading the data from a centralized Faceblock database.
As the weeks pass, you find yourself using the app less frequently, to the point where you don’t think it makes sense to maintain your account. You log into your MyDataLocker account with a secure password and two-factor authentication (or perhaps even a physical key) and locate Faceblock in a searchable list of all apps that currently have access to your data.
Conveniently, you can see exactly what data each app in the list has access to, when you granted that access, and a list of every day and time the data was accessed. You click a single button to revoke access to Faceblock and automatically close your account.
In this hypothetical, the data is decentralized from the perspective of Faceblock, since they no longer have everything in a single database they control. But from your perspective, all of your personal data appears to be “centralized”, albeit in a database you control.
But under the hood, this is not the case. Your data is actually decentralized, meaning copies (or pieces) of it are stored on many different computers across a worldwide, peer-to-peer storage network (like IPFS). Further, your data is encrypted so that even if someone gained access to your data on one or more of these computers, it would be completely unreadable (i.e. useless) to them. Additionally, since no one particular person or company is responsible for “hosting” your data, no one can tamper with or destory it. It’s what techies call persistent.
Several organizations/products are already working on this problem, though not all exactly as I’ve described it above. In no particular order:
Based on my limited research, Blockstack is by far the furthest along with about one million “verified users”. Users of Blockstack-enabled apps own and control their data, but that data is stored with a traditional cloud provider. However, the data is encrypted and there are plans to allow alternative storage providers in the future, which could (in theory) include a peer-to-peer network like IPFS.
Solid and the company behind it, Inrupt, are interesting because they were started by the inventor of the world wide web, Tim Berners-Lee. It’s less obvious to me how far along it and the other projects are, though I’d like to learn more.
Several of the projects listed on the Wikipedia page for personal data service (another term for PDS) appear to have shut down. I’d be interested to know why.
I think the concept of everyone owning their own data is exciting on its own, but there are several questions that immediately come to mind that warrant further exploration. If it’s possible for everyone to own their own data and grant access to it on an as-needed basis…
What’s the incentive for existing companies to shift from their centralized data architecture to a decentralized one? A few things immediately come to mind: reduced risk/liability, simpler infrastucture, and potential to access lots of data already stored in a user’s PDS.
Without unlimited access to user data, how will companies perform manual data analyses and build automated systems that require access to the data of many users? E.g. will users allow access to their anonymised data for research or product personalization for a small fee?
What will be the impact on AI research overall? AI research has flourished in recent years with the arrival of so-called “big data”. If companies stop hoarding data, will they still be able to innovate with respect to AI?
How would it change the relationship between consumers and advertisers? E.g. will advertisers offer to pay users for access to their anonymized data?
What would be the impact on companies that rely on advertising for the bulk of their revenue? Google, Facebook, Twitter, etc.
To what extent do digital identity and data storage need to be developed in tandem? Clearly they are related, since your personal data is associated with who you are (Blockstack has developed solutions for both). But it may be more feasible to decouple the two since digital identity is already becoming a crowded space and likely to become more crowded over time.
How can a website or app be prevented from saving data to a centralized database after accessing it from a PDS? E.g. is it possible to ensure the data is “read-only” or to automatically flag an app that appears to be storing data elsewhere?