If someone wants to cause trouble for the big data owner, they can leverage the known missing data to raise accusations that the big data owner will not have any data to use in defense. The accusations can suggest cheating, fraud, criminal activities, etc that can harm reputations or invoke costly and lengthy investigations that can deny the owner of realizing the potential benefits of the big data analytics.
Data deception is a concern for automated decision making based on data analytics (such as in my hypothetical dedomenocracy). I think it is already a concern with our current democracy. I fear the current enthusiasm for data technologies because I do not see much in the way of appreciation for the possibility of deception. There is a huge confidence in the combined power of large amounts of data and sophisticated statistical tools (such as machine learning). Missing from our consideration is how well the data actual captures the real world. The data is not necessarily an honest representation of what is happening in the real world. It is very possible that the data may include deliberate deception.
I’m describing this as the security of the datum instead of the data. Specific observations are vulnerable to exploitation instead of everything observed by sensors. The malware is in the population being observed instead of in the IT systems.
To combat this kind of problem, we are going to need an additional approach of datum governance to protect the observed population from deliberately inserted biases.
The enthusiasm for the benefits of big data comes from widely promoted reports of past successes. The promise of big data techniques is that it can provide similar successes in other contexts. Big data involves volume, velocity, and variety. The volume and velocity depend on automated queries and report building. The variety introduces the opportunity for new benefits. The combination of automation and opportunity from variety is what makes re-identification possible or even very likely.
Oral story telling was the original big data. The various oral stories were saved in persistent memory and captured a large volume and variety. The invention and adoption of written works displaced the oral tradition and that brought and end to that earlier big data. In this sense, our current excitement about big data may be a rediscovery of a capability available our ancient ancestors. Big data and oral story telling tradition both offer inexpensive and durable means to manage a large number of distinct and very individualized stories. In the modern era, we are rediscovering the need to collect individual stories and thus granting them ability to circulate like what happened in the preliterate society of oral story tellers.
In addition to the classic challenge of new data potentially disproving an old theory, the modern reality of practical data technologies makes possible decision making based on data alone without any need for human cognitive theory to justify the decisions.
My recent posts have promoted the concept of dedomenocracy as a legitimate form of government. In much of those discussions, I assumed that it was futuristic form of government because data technologies are not yet sufficient to automated normal government decisions. I’ve already noted that we are already seeing the authoritarianism of data in health care. A similar case can be made concerning metropolitan area decisions based on weather forecasts. In these areas, we are experiencing automated decision making where humans given no choice but to follow the recommendations from data (and in particular simulation models).
In modern data science projects with automated data collection and analytics, the hypothesis-discovery occurs at the beginning of the process. The modern decision maker participates at this early stage of the process to select discovered hypothesis that are self-evidently persuasive. The following data collection and analysis that supports this hypothesis will lead to a simple decision that does not require any last-minute invention of a story to earn the decision-makers approval. After the decision, additional invented stories will serve only the purpose of illuminating the underlying non-fiction of the data and analysis.
In this public release paper from MITRE, they describe a tool they developed that adjusts windows of time to allocate for certain operations on the ground and immediate airspace of an airport. In particular, this tool strives to reduce the already rare occurrences of near collisions of arriving and departing aircraft through the use…
To support the decision maker, the data scientist (the student of data itself) needs to anticipate the doubts of the decision maker. The data scientist needs to challenge proactively the data itself for the possible doubts of its authenticity, accuracy, and relevance.
Entertaining doubts is indistinguishable from skepticism and pessimism. This is a virtue for data science.