I see the current economy trend as a combination of a growing sharing economy where people renting out what they own for income based on big data applications where people volunteering their data for free. This is not sustainable. The people renting out their possessions are not pricing the rents to reflect their capital investment and depreciation. The new income is not sufficient to assure future rentable property.
Meanwhile, the people enjoying the benefits are releasing their data for companies to create ever more disrupting businesses that will put many more people out of work. Many disruptive businesses enjoy high capitalization with very small staff and virtually no infrastructure. Eventually the income flows will dry up: more people will have fewer possessions to rent out, and more people will less income to subscribe to new apps or broadband plans.
I wrote much earlier of the trend to increasing part time employment with less predictable income as a result. I argued that to survive, the future employees are going to have to become more active in the data science that is impacting their lives. In that post, the advice was toward preserving income or being better able to predict it. It is unlikely to bring more income.
In later posts, I explored some ideas about data driven government or politics. In these posts, I assumed a free flow of data to centralized data stores accessible to government. In particular the futuristic dedomenocracy, a mix of libertarianism and authoritarianism, implicitly has absolute access to all data generated by everyone. Such a government would involve an absolute absence of privacy, but it will emerge gradually so that people will accept it without complaint.
Missing from this discussion of future of government by data is how people will actually earn a living. I envisioned an benevolent form of government that granted widespread liberties with the risk of occasional but brief government intervention. The actual economy under this government is likely to be more impoverished with few opportunities for human expertise.
On the other hand, my vision of dedomenocracy is that it will make rules that will strive to increase benefits to society. Such an optimistic view excuses my need to think much about the economy: the data algorithms will figure something out. The one advantage of dedomenocracy is that it rejects human intellectual contributions: it excuses my laziness. Things will be just fine under a dedomenocracy. In data, we trust.
Yesterday, I saw this video interview about IBM’s plans for data analytics on a grand scale of all medical-relevant data. One key point I want to focus on is his confidence in his advanced anonymization algorithms to protect privacy of individuals despite its access to extensive health-related data about each individual. I recently wrote a post describing my confidence that anonymity is impossible with big data with access to a sufficient number of dimensions about the individual. In that article I distinguished the protections into two categories:
- less likely targeted attacks where someone deliberately seeking info about a specific individual
- inevitable accidental disclosures where multidimensional categories have a membership population of exactly one.
Compared to the deliberate targeting of an individual, the biggest risk is the insider analyst taking the opportunity to exploit accidentally disclosed identities for unethical purposes. IBM Watson team seems to think this is nothing to be worried about.
I have a more pessimistic view of privacy protection when it becomes part of big data stores with a tiny fraction of dimensions discussed in the IBM Watson Health Cloud. I lack the authority or income of the IBM’ers but I know what an alert analyst can spot when presented large outputs of aggregated/categorized data. After all, the advantage of rich visualization is the exploit the human capacity to recognize the minute something that is out of place. Maybe the clever scientists at IBM have made Watson is clever enough to spot it first and withhold it from analyst’s view. If Watson were so clever, then there would be no point to investing so much in the fancy visualization for humans. Watson could just jump straight to the answers and save energy of processing and projecting visualizations.
The issues of privacy protection have come up only recently in these blog posts. My past experience did not directly involve personally identifiable information. Although I didn’t have direct experience with identity protection, I did have a lot of experience with discovering new hypotheses. An identity is just another hypothesis that could be discovered.
I devoted the blog to generalize my experiences that led to discovering hypotheses.
One of those generalizations was my attempt to describe my approach to processing data into chained stages that I compared to industrial supply chains. In the supply chain, the data experiences successive refinement with coincidental categorization and aggregation (mapping and reducing). At the final stage, all of the data came together into a single multidimensional database available for querying with SQL or MDX. The final stage used very common and readily available database technologies.
Although this described my design, I became aware very on of the need to defend the reputation and quality of the data source. This is distinct from the need to defend the reputation of a data warehouse project.
In well-run software development life-cycles with code review and testing, it is natural for developers to trust their code more than they trust the source data. When anomalies come up in production, the developer’s first reaction is to blame the source data or the operator error. After all, the code has been thoroughly tested and approved.
In contrast, my first suspicion for any anomaly was to first blame my software. After confirming the software didn’t introduce the anomaly, I worked backwards to attempt to find a transmission or handling error before I would consider blaming the source for faulty data. My project was one of many projects that used the same data sources. These different projects had many different objectives and reporting timelines. Blaming the source data to excuse some anomaly risks impacting all of those other projects. I’ve seen this happen many times. Someone will accuse the source data itself being imperfect. The immediate implication is that everyone using that same data could be harmed by that imperfection. Such claims triggered many simultaneous investigations by the various projects to see how the problem affects them. Almost every time it turned out the original claim was mistaken or at least unfairly exaggerated.
This experience alerted me to the fragility of the source’s reputation when the source releases its data in bulk for unpredictable consumers to use and interpret on their own. A less experienced or untrained user of that bulk data can tarnish the source’s reputation with a single blunder in how he handled the data or how he interpreted its meaning.
As bulk data moves further away from the source and used for higher level analysis incorporating more data dimensions, the risk of the bulk data being abused increases. In my last post, I gave the example of a CEO-level dashboard using data from the numerous components of the business, such as the shipping-department. The data scientists working on the CEO-level reports may not understand all of the nuances of shipment-processing data. They could blame the shipment processing by not seeing their own faulty conclusions. At a minimum, this can raise doubts about everything else that relies on that same data. Often it will result in costly reviews and restarting the calendar for everyone to recover their original (and correct) confidence in that data.
All data sources take huge risks in releasing his data in bulk for downstream data warehouses. Once the data leaves his data stores, the source loses all control for how that data is used and interpreted. The risk to the source is the potential damage to its reputation. A damaged reputation could put the data source out of business.
In most cases, the downstream data warehouses want bulk data to enrich some other data they possess. For example, marketing department may request bulk transfer of data from the shipping department in order to enrich their own marketing data. They may not have a full understanding of the subtleties that can occur in shipping, but they will insist on bulk data so they can have maximum flexibility for their future unforeseen needs.
My experience involved an completely unrelated project but I’ll hazard to use this as an analogy. The shipping department data may have a field that confirms a shipment when the package is placed on a pallet while the marketing may interpret the same field to mean that the package is placed in a vehicle. Nearly all the time the two interpretations may coincide, but there may be rare cases where the pallet contents need to be unpacked and repacked, causing a delay for when its placed on a vehicle. This roughly corresponds to a real scenario I encountered. I needed a field to mean something important to my project but the field in the source’s context had a different meaning. Eventually I learned that my field was model-generated data (dark data) based on the available observation data. The solution was to create a new field for what I meant and used algorithms to compute its value from the field provided to me. Before implementing that corrective solution, there was a period where I was misusing the source data and this was causing recurring confusion until I finally figured out there was a semantic error.
The point of my little analogy is to show that someone working in downstream data warehouses (such as my project) may have a misconception of what a field actually means in the context of the source. As a result, that downstream analyst may misuse data that could discredit the source through no fault of the source at all.
To address this problem, I proposed a different approach where the source never releases his data at all. This approach declines contributing bulk data to a downstream (or master) data warehouse. The analysts working with the data warehouse may still have access to information from the source, but they will need to send their requests to the source instead of the more convenient retrieval from the data warehouse.
In this approach, the data will stay at the source. The source will provide the data transaction service and the necessary processing to handle the requests. The source will retain full cover over the request-processing design so that it is consistent with the source’s expert understanding of what the data means. Also the source will have control over what types of transactions are permitted and what results to return. For example, the data source can provide quality control checks before delivering transaction data. In this model, the source reserves the right to delay delivery when errors are discovered in order to attempt to correct those errors before the consumers will ever see the results.
I proposed the model for enriching data at the source as a means for the source to protect its own data by performing all processing of his data. The source never releases its bulk data to where it may be potentially misused. Recently, I recognized that this may be a useful approach to protecting privacy.
Typically big data stores work with privacy data in bulk but using some form of anonymity algorithm to protect the privacy of the data. Even though the data appears anonymous, the delivery of this data in bulk permits some future analyst to discover (perhaps accidentally) a way to process this data that recovers private information. See this puzzle for a description a process that is like how accidental discovery can happen in multidimensional queries when bulk data is available.
An alternative approach to privacy protection is where the data owner keeps his data in his own health vault (example) and retains full control over how to release that data for any purpose (also described here). For the purpose of healthcare, the provider would have temporary access to the vault and record any new patient data in the vault. For purpose of analytics, the analysts will need to submit a narrowly defined request with justification and commitment for how he will use the data.
I’ve been focusing on the healthcare example because that has been a theme of many of my recent ramblings. For blogging purposes, I feel more comfortable projecting my lessons learned in data onto a field I don’t know much about (health care data). I don’t think I am offering anything of value to health care. Using it as an analogy motivates me to think harder about my own experience. I could have used any other type of data to achieve the same objective. Likewise, the discussion could apply to any type of data.
Instead of building a big-data economic ecosystem, we could be building a data vault type of economy. The technology for big data (essentially cheap limitless storage and matching query power) could as easily enable data sources or data owners to do what they could not do previously: perpetually save all of the data they generate and make the data efficiently retrievable.
Private data vaults could end up looking like big data with volume, velocity, and variety. The distinction is that the vault is private and not immediately available for population-wide analytics. The data source retains full control over his data. If he wishes, he can offer the data to external parties. He could decide to follow current practice of volunteering everything for free. He could instead offer a transaction service allowing for limited approved access for certain types of data, for example my concept of data enrichment services using the data-owner’s facilities. In the latter option, he could offer data enrichment services for a fee, just like the commercial data providers in the Azure marketplace.
Each person (generalized to include individuals or state-granted charters) can set data vaults for all of the data they generate. They then can offer transaction services to everyone who wants that data. People wanting vault data may include journalists, researchers, marketing analysts, government bureaucracies, etc. In each case, they do not own the data unless they can provide a signed receipt that they obtained it from an approved transaction. There could be piracy laws that require signed attestations that a vault contains no unauthorized data from other parties. The vaults will be subject to audits to assure that vaults contain only privately generated data or data legally obtained from others.
The data vault approach may generate a new source of wealth or income in the economy because the external transactions may involve a fee. If someone wants data from another’s private vault, they need to pay or barter for it. Unlike the present big data ideals, the data is not laying around for free and unlimited exploitation. The data owner retains ownership of the data for the lifetime of that data. Each new user of the data has to show he obtained it with the owner’s approval. That approval involves a receipt. That receipt may require a transfer of money.
In effect the data vault approach create monetary value out of private data. In a data-driven economy, this could be a significant source of wealth and income for all persons. It may be the only source of income for some people.
A present day analogy to receiving income from privately owned data is the intellectual property economy of patents and copyrights. The owners of intellectual property have legal claim to exclusive use of their intellectual property unless they expressly agree to transfer that privilege to someone else. Usually that transfer involves a payment. Sometimes, the payment can be the sole source of income for a person.
In the data vault approach, everyone owns his own data. That data may be anything about his person. Healthcare provides a vivid example because it is easy to list various healthcare related observations. But the data can include other things such as the person’s interests, opinions, preferences, achievements, physical appearance, possessions, etc. All of this information can be treated just like intellectual property. In effect, the person has immediate copyright of anything about his life. He would have exclusive right to use any data about himself. Anyone else using that data would need to obtain his permission through some type of transaction. That transaction may involve an exchange of funds.
Private data is not a commodity that market forces can drive down by over-supply. Each individual’s private data is unique to that individual. One person’s private data can not be substituted for another’s. I alluded to this earlier in my discussion about the value of collecting first-person accounts to complete the picture of the community.
In that discussion, I gave the example of how first-person accounts to enable big data technologies to help manage epidemics. The understanding of how epidemics emerge and spread provides a great example of the non-interchangeable value of individual private data. Each case of contracting the disease is unique. So is each case of exposure to the disease but where there is no contagion. There is no market competition for first person stories. Each story is uniquely valuable for the project of understanding the spread of an infectious disease.
Each individual case represents a monopoly for that private data. Obtaining that data may require a high price even though the owner of that data is part of a large population of poor people. Each individual has a monopoly on information about his personal experiences involving the epidemic. For big data analytics to offer a valuable contribution to the fight of the epidemic, it needs a broad sample of individual experiences with the disease. There is a market demand for the monopolist’s product: an accounting of personal experience.
The epidemic example provides a vivid example of all policy making that occurs in a dedomenocracy. Any data driven policy needs access to individual data. If the individual inherently owns that data, the dedomenocracy will either have to buy that data or take the data by force. I argue that the best option is to buy the data.
Dedomenocracy makes purely analytic decisions for optimizing overall public good. Historical evidence supports the conclusion that everyone benefits if there is a large population of well-compensated wage earners: a large middle class. My optimistic view is that Dedomenocracy will naturally choose a policy to compensate people for their data instead of taking that data by force. The citizens of a dedomenocracy will all receive a steady income from the government. That income is compensation for their providing the government their private information.
A pessimistic view is that this compensation can spiral down to a culture of poverty. A government that pays most people primarily for access to their private data appears indistinguishable from a welfare state that pays people primarily for the fact that they are alive. What distinguishes a dedomenocracy from a welfare state is the explicit recognition of the economic value of personal private data.
Unlike the common poor of a welfare state, a citizen of a dedomenocracy has a monopoly on his private information. A dedomenocracy needs private data in order to operate. Consequently, it will pay people for access to their monopolies on their private data.
In terms of money transfer both the welfare state and dedomenocracy operate the same way: the state pays nearly everyone living. In a dedomenocracy, the rationale for transfer of money is for access to data instead of merely humanitarian provision of necessities. This economic exchange model could allow the dedomenocracy to avoid creating a culture of poverty. People have an incentive to obtain unique private data that could allow their data to be more attractive for certain purposes.
Recently, I have been paying attention to academic publishing for various reasons. One of the areas that I find interesting is how most of the work for publications (writing the article, peer review, and editing) is provided freely and usually volunteered from personal time outside of normal working hours. The work is volunteered freely as part of the academic culture that prizes careful publication to build the enduring body of knowledge. The reason I bring it up here is because the publication and review is an analogy of freely giving away data (knowledge of the writer and reviewers) for free when they could instead require compensation for that same data. Recently I read this blog post that discusses the controversy where publishers are offering premium service to assure faster publication by compensating peer reviews to return reviews within a set time. That blog post argues against compensated peer reviews for faster turn-around where it may distort the peer reviews toward younger less experienced reviewers than the current approach. I don’t have experience with the quality of the review process, but I think this controversy illustrates the lost opportunity to sell data (knowledge and expertise of peer review) out of centuries of traditions. Peer reviewers could offer their service for direct compensation, paid by the word, or by the hour. Likewise the publishing author could instead receive compensation for the word or page count.
It is easy to imagine the unique value of data privately held (as knowledge) of academic experts. For the specific project of publishing academic papers, they give this data away for free. I argue that this unique value is common for all people.
Everyone has private information that they could demand compensation but usually give it away for free. For example, today I received from local government a survey to ask my opinion on various local services. As I filled out the form, I recognized that this was a transaction where I was giving away my monopoly on my opinions. I volunteer this information as most people do without thinking much about it. The fact is that the local government has no other option for how to find my personal opinions. I have a monopoly on my personal opinions as well as my skills and experience. Just like the academics could demand payment for their publication services, I could demand payment for providing my opinions to a survey form.
The survey form is also illustrative of a data project that strives to compile responses across the entire community to do some analysis that will prepare some pretty charts for some future presentation. I do not doubt the sincerity of their attempt, but I doubt that the survey will have much impact on the future direction of the local government. There are too many constraints that limit their options to do much of anything other than what they are currently doing. Another problem with the survey was that there is really no way to reduce my experiences to 5-point scale evaluations of 2-3 word phrases of certain government functions. For example one question asked me about my satisfaction of policing: my answer of satisfaction is from the fact that I’ve had not interaction with police, but another person may express the same satisfaction as a result of a direct interaction with police. Missing from the question was how I came up with my answer. Ultimately, I imagine, the goal is to measure overall public satisfaction of police department based on survey data instead of discussing the question directly with individuals. I commented on a prior post about this disconnect of seeking an answer to a question from salvaged data (the survey is an example) instead of directly asking me for that question. They are probably seeking a more specific question than what would fit on a printed survey form of dozens of questions.
In my fictional concept of an entire government run by data instead of politicians, a government I call a dedomenocracy, all policy decisions must be based on actual observed data. For my fictional version of a dedomenocracy, I chose a pure form that excludes any human theories: all decisions will trace to observed data processed through a select set of general-purpose objective statistical algorithms. The concept is that this government will have access to a tremendous amount of data, far more data than we have even today, so that the statistical algorithms will take into account the laws of nature from the available observations instead of from hard-wired algorithms.
The key to dedomenocracy is access to data. The success of the government depends on its ability to obtain sufficient data to avoid making mistakes. One key part of the data is the opinions and human knowledge of the citizenry. The government needs the citizens to provide the government their opinions, especially those opinions that evoke strong emotions.
In an earlier post, I described the need for dedomenocracy to measure urgency so that it can select optimally few policies that could pragmatically be in force at any time. My fictional concept of dedomenocracy is of a benevolent government that is mostly libertarian and non-coercive requires an accurate and timely assessment of the full range of sentiments from individuals from the vast majority of the population. The government has the means to coerce people to give this information, but when that happens, I would claim it is no longer the ideal government I had in mind. On the other hand, if the government does not obtain this information, it will be vulnerable to destabilizing revolt where the government does not enjoy a sufficient super-majority support to counter a rebellion by a minority.
The enduring stability of the dedomenocracy depends on its ability to collect nearly complete data about attitudes of its population. One of the ways it will obtain this sentiment data is through surveys similar to the one I mentioned above, but the surveys would occur far more frequently and have much more precisely worded statements to eliminate ambiguity.. Unlike the democratic example that can work with a statistical sample, the dedomenocracy needs a near universal participation in the surveys in order to come up with the best policies. The continued success of a dedomenocracy depends on its ability to make policies that avoid eroding super-majority consent and avoid inflaming minority dissent.
In a democracy, the goal of the survey is to establish plurality and simple majority opinions. The vagueness of the questions is suitable for these goals because it makes it easier to get identify the dominant opinions as they relate to democratic elections.
The goal is substantially different for a dedomenocracy. For one thing, the dedomenocracy is attempting to derive the actual policy from the data. The democratic survey gives direction for human policy makers while the dedomenocracy survey will produce actual policy without human input. The other reason dedomenocracy needs near perfect participation is to maintain a the super-majority approval that can overwhelm any minority of dissenters.
In our current democracy, the goal of opinion surveys in democracy is to measure the most common opinion of a plurality or of the majority. Democracies can get away with sampled polling and simple majorities. By contrast, a dedomenocracy needs comprehensive polling in order to obtain a high confidence in measuring minority opinions. Minority opinions are essential to calculate urgency for a policy that punctuates the libertarian status quo. Dedomenocracy needs accurate measurement of the attitudes of the minority groups capable of staging a destabilizing protest or rebellion.
This sets up a need for a new kind of economy in the dedomenocracy that is reluctant to coerce opinions. The dedomenocracy will need to compensate people for their opinions. In other words, the government will provide the population with a steady income for doing work indistinguishable from opinion columnists in magazines or story-tellers publishing books. However unlike the current economy where only a few celebrity opinion-leaders can make money, the dedomenocracy pays everyone for their opinions.
This looks like a welfare state. People get money from the government for the mere fact that they are alive. I argue that it is very unlike a welfare state because the dedomenocracy is getting something of high value from the population. That value is a comprehensive accounting of everyone’s opinions of current conditions and directions.
Within this opinion economy there are market forces. The more valued opinions will be the ones that are least conforming or most dissenting. Dedomenocracy needs the nonconforming or dissenting opinions in order to measure urgency for policy matters. If a population of a dissenters grows significant enough (10-20% of total population), they could stage destabilizing protests or rebellions that will challenging for the state to quell. This means that there will be more monetary value in non-conforming or dissenting thinking.
People will quickly learn that their dissenting opinion has monetary value and they can bargain for higher compensation to release their opinion. Similarly, people will learn that there is more money to be dissenting than to be conforming. A topic for a future post is to argue for the implications of this economy, but my opinion for now is that this is beneficial. In exchange for higher expense to government to pay for dissenting opinions, the government will be rewarded with better overall economy by the broader participation of contributing disruptive ideas.
The original title for this post was to talk about balkanized data. In contrast to big data, balkanized data stays sealed and protected at the source. The large-scale analytics or policy making needs to engage with specific transactions with data owners to obtain the data they need for each specific analysis they are performing. This type of economy denies the centralized corporations or government access to volunteered release of bulk source data. Instead the centralized entities need to negotiate terms for each transaction. The terms will describe precisely what they want so the source can deliver the most appropriate answer. That negotiation will involve a monetary exchange with amounts proportional to the eagerness of the requester for that data. The result is the creation of a new economy analogous to intellectual property but where the property is simply personal data. Data is property. Owners will soon wise up about giving it away for free.