In an earlier post, I suggested a new form of government by data instead of by people. In my attempt to come up with a new world I added the Greek word for data to democracy and came up with dedodemocracy accidentally implying a different concept of democracy by data. My intent may have been better termed dedomenocracy where data is fully in control, but now I prefer the hybrid meaning.
In many other posts, I elaborated on the need for the general population to have data science skills (or technologies) to be able to query big data themselves to maintain control over their lives and to contribute to democracy and society as a whole. Increasingly, our lives are influenced by data-driven analytics often without our knowledge. To maintain independent freedom, the general population needs access to the same data and tools so they can see for themselves how data may be influencing their lives.
In particular, I complained about how big data is eroding the accountability of leaders being forced to follow incomprehensible big-data recommendations. Decisions impacting our lives increasingly have no human-accountability for us to address grievances to obtain satisfying explanations or to negotiate some modification for relief. As a result, there will be an increasing demand by people to have more influence over the data either by accessing the data themselves or demanding other data be considered. A peaceful progression from data-driven decision-making involves an increasingly data-literate public. Without competent access to data, society may be more vulnerable to less peaceful reactions.
In my more recent post, I explored a different topic that came from my earlier observation of frustration that big data is not offering much value for the current Ebola crisis primarily because of the lack of data. In that post, I saw an opportunity to enlist journalists more directly into the big data project. We need journalists to uncover new data to add to data stores in order to have sufficient data to enable big data analytics to contribute meaningfully to an immediate issue like the Ebola crisis. Currently, we use journalists inefficiently for data projects because they are paid to write human-engaging narratives. To use their contributions in data project, we need text-analytic algorithms to interpret their narratives for data nuggets. It would be far more beneficial for both big data and the journalists to skip the narrative part and just directly enter structured data of their findings. Unfortunately, we currently do not have a way to compensate them for this effort.
If we are indeed heading to a new form of government by data, then people will need more direct access to the data that is influencing their lives. At the same time, journalist narratives about leader opinions or policy-making are becoming irrelevant because of the erosion of human-accountability due to incomprehensible data-driven analytics. We need data, not narratives. We need to transform the concept of the morning paper from a bunch of stories to a bunch of data.
Journalism provides the fresh content for morning papers and other periodicals. Journalism would be more relevant by providing structured or tagged data instead of entertaining narratives. The morning paper of the future is something that provides access to analytic and visualization of the latest data including the latest journalist observations.
My personal experience approached data from a technical background so that I interpret data science as both a specialty of computer science (developing data-intense algorithms) or of mathematical/statistical analysis. When I view the progress of computer science and mathematical analysis over the past several decades, the current state of data science seems like a natural consequence of earlier initiatives. Modern data science is the same projects of mathematical analysis and computer science we’ve always been doing but now with more data and faster technology. Although analysis and computer science always had an impact, it was mostly conducted out of sight of the popular media. For example, a data-informed decision typically appeared in popular media in a way that draws attention to the decision-maker but now we are paying much more attention to the analysis.
This view of data science misses the fundamental recent shift that made data science a topic of popular discussion. This shift in public perception is a consequence of the growing appreciation that data will be increasingly important for everyone’s lives.
After writing the recent posts, I have come up with an alternative interpretation of modern data science as more properly belonging to journalism instead of computer science. The tools and technologies grew out of computer science and that growth was due to specific investments in computer science. But the way we use data is more like we use the products of journalism. We want data to tell us what just happened.
The computer-science origins of big data is similar to how the original printing presses came out of advances in mechanical machinery. The availability of abundant and cheap published material fundamentally changed the way people governed themselves. The popular perception of the printing press was this abundant availability of written material, especially in the form of the introduction of journalism that was tasked to discover new material to publish for periodicals and newspapers. The printing press technology continued to improve but the popular attention was on the new information being published, not the machines doing the publishing.
Data science technology is the new printing press. While there remains a need to continue improving the technology, the popular demand is for fresh and relevant content. As I suggested in my posts about Ebola crisis, data science need someone to go out and collect fresh relevant data so we can be better informed. Technical expertise in implementing new algorithms can not substitute for missing data.
I would argue that all of data science should be seen as journalism rather than as a STEM field. In earlier posts, I outlined where labor is required in a data science project. Consider a division of labor for data project into three parts:
- data collection and cleansing,
- data processing, analytics, and visualization
- and story telling
In my posts, I emphasized that the first part stubbornly consumes most of our labor. Even as we attempt to integrate new data sources, we struggle in maintaining old data sources to deliver timely, accurate, and relevant data. In other posts (such as here), I described the third part in terms of the role of story telling to the success of the data projects. In this break-down of data science, the first and last steps overlap journalist skills and practices. While the middle piece may be considered more exclusive to a STEM profession, it is really analogous to a printing press that subordinates to the tasks journalism collecting data and constructing stories.
When we consider the increasing importance of data for the general population to participate in the economy, their work, and their government, it makes sense to consider where they will get their data. Historically, these activities got information from journalism. By analogy, we can consider that they will continue to get information from journalism. From the perspective of the popular demand for information, data science can be seen as part of the journalism trade.
Currently, we think of data science as primarily a specialization within the STEM field (with very demanding qualifications such as outlined here). However, many people employed in data science work almost exclusively with data about humans such as noted here. Historically, we consider a journalist’s job to be primarily about getting information about humans or through humans. Although data science requires complex machinery of software, this software is equivalent to a printing press (also complex and mechanically tedious equipment). The bulk of project of data science involved collecting and scrutinizing the information at first and then finally to construct a story that will appeal to the primary audience. These two essential skills are more readily found within journalism than within STEM.
STEM-trained data scientists are like the ancient printers. Despite abundant evidence that arbitrary printed material does not sell itself, and that certain authors or content sells better than others, they assume the primary value is coming from the printing press. Data is not inherently valuable. Data projects need someone to obtain the right data and someone to construct a comprehensible story to sell it. Relevant data projects need journalist talents.
The new reality of data driven decision making makes it important for people to understand the data behind the decisions that are impacting their lives. In an earlier post, I described a specific case of just-in-time workplace scheduling can cause chaotic lives for the workers not being certain of when their next work hours will be. One way to regain control over one’s life, is to have access to the same data to predict for oneself how the future schedule might turn out. This will require employers allowing employees to access this data (something that currently is not readily permitted). But it will also require that the employees be able to do the data-science equivalent of reading. Although this example was specific to workplace (and retail work, in particular), I think similar innovations are occurring in all aspects of the economy and in government. We are entering an age where data recommendations will obligate both decision making and obedient participation. To cope in this data-driven world, we need to redesign our education system to emphasize data skills comparable to the current emphasis on reading, writing, and arithmetic skills.
Expanding upon the above analogy of the journalist collecting data and preparing data stories, there is mass audience for this data. The general population will demand timely and comprehensive data with appropriate information to permit further research when a topic specifically interests an individual. Like newspaper readers, they will challenge the publisher on matters of fact-checking or of fair presentation of all relevant information on a particular topic. In the future I see coming, we may be obligated to follow directions from data analytics, but we will demand accountability for the adequacy of the data that went into those analytics. Instead of taking the case to an impotent decision maker, we will take our issues with the providers of the data. Again, this is more analogous to the journalist role in direct interaction with his audience than to the typically reclusive STEM-trained scientist.
Recently there has been a growing interest in a data lake concept ias advanced (such as here) and debated (such as here and here). The concept seems to have attention of planners of enterprise data systems. In contrast to the legacy practice of purpose-built databases or data-warehouses that require extensive front-end data governance and cleansing, a data lake accepts all data immediately and makes it immediately available to all. The data lake includes both data and applications to interpret that data. The goal is to provide end analysts with direct access to all data related applications in order to meet his immediate analytic needs.
In an earlier post, I discussed my own experience with data projects and with the commercial SIEM tools that appear to be precursors to data lake concept. The lack of strict data governance and standards for the data results in redundant and potentially conflicting data in the data store. In contrast to the optimal goal for data warehouses to provide a single version of truth, the data lakes will provide multiple conflicting versions of the truth. Multiple versions of the truth can lead to contentious debates that we typically want to avoid. We believe there can be only one version of the truth, and that is what accurately presents the real world. The problem is that data will always be a poor witness to reality. The quality of a particular data point depends a lot on the context that requires that data point. I’m inclined to believe that it can be beneficial to resolve the relative correctness of conflicting data in the context of a specific high-stakes decision. Our quest for a single version of truth runs the risk of prematurely discarding data that merely is not as trusted as another piece of data at that time of data ingest. We may find the discarded data to be more relevant when considering its impact on a high-stakes decision. Building the single-version of truth be an unintentional form of cherry-picking that happens to reinforce a decision that may not look so good if we instead had access to the discarded data.
At the enterprise level, there is a trend to develop what are called data lakes. The concept of the data lake is to have a central repository of all data for the entire enterprise. It is my impression that the data lake model will be more relevant for much larger and complex organizations with many nearly independent operations. By extrapolation, it may be the best model for making data available in the data-driven future that depends on a data-skilled (data literate?) population. When made available to the general public in the form of open data, the data lake concept appears analogous to our historic experience of having access to multiple newspapers and periodicals offering different assessments and opinions on issues of immediate importance or interests. In the context of describing data science as a form of journalism, the data lake is the analogous of a newsstand or a library.
In the modern era of data-driven decision making, we need an equivalent to the daily newspaper, but one that presents data instead of human-interest stories. As with a daily paper, we want the latest data, specifically everything of interest that happened in the previous day. Long ago when I was working on a data project, I was explicitly using journalist concepts to describe the daily tasks we. Although our data was machine-generated and concerned matters related only to optimal performance of machines, I had a daily requirement of reporting everything that happened the previous day for the world-wide distribution of machines. I often made comparisons of my task to the task of publishing a newspaper. For example, I had a specific deadline to distribute the published report of what happened the previous day. The concept of publishing was that once it is published, I can’t take it back, at best all I can do is issue a correction. Once published, that version will remain available because someone may already be working on that data. Outside of the initial day’s task of final-checking of the previous day’s copy prior to publication, the bulk of the current day’s activity was to prepare the next day’s report. I explicitly invoked the imagery of a pulp-version newspaper operation to explain the sequence of tasks and to motivate the staff to perform the tasks at the times needed. Many other data projects probably use the same imagery for their tasks over different time intervals (such as hourly or monthly). My point mentioning it here, is that I explicitly used this newspaper operation metaphor in delivering the data products that this project demanded. With that background, it is not surprising that I would be inclined to see a parallel between a data science project and journalism. What is surprising is that it took this long to conceive of data science as a special form of journalism. Data science emerged out of technologies but the ultimate value of the project benefits mostly from skills coming from the journalist trades: collecting and fact-checking data, and then building comprehensible stories from that research, and finally publishing something to a deadline so people can learn what happened yesterday (or the last hour, or the last month, etc.).