Much talk in data science concerns various analytics of existing data. Initially we called this concept data mining. Although the expanse of analytics and visualizations have expanded far beyond the initial concepts of data mining, I still find the term data mining more informative than analytics or visualizations. Perhaps my preference comes from my upbringing in an area with a history of coal mining and from my consideration of entering mining engineering. When I think of data mining, I conjure a vivid image of a dust-covered miner emerging from a mine shaft carrying a load of rich ore separated from the surrounding earth. Data mining is mining into a pile of data in order to extract the narrow seam of rich ore.
There is also an implicit ambiguity in the phrase data mining. One way to think of mineral mining is to see it as a project of entering earth to separate one type of earth (valuable mineral) from the other types of earth. This is analogous to how we generally approach data today: we enter existing data to extract the valuable data from the surrounding (and more voluminous) worthless data. Frequently, we mine into existing data to find what we want. However, an alternative view of the same effort is to think of mining as seeking out the valued mineral and the best method for its extraction. In other words, the ambiguity may be described as the difference between mining for a mineral or mining into earth. The same ambiguity should apply to data, but most of our attention is on mining into data instead of mining for data.
Just as our ancestors found abundant earth to mine into to find something (anything) of value in the lands they controlled, modern data scientists have abundant data to mine into to find something (anything) of value in data they already possess. Generally, the task of the modern data scientist is to use available data to find valuable data that we can exploit in new ways. The value is there all along in the data we already own. We just need the miners to go in and extract it for us.
I think of historic examples when although there was a preference to find precious minerals like gold, such minerals are only practically available in certain areas. Meanwhile, people looked at the lands they controlled and found other minerals like copper, tin, or obsidian. Many times, people looked under their lands and found something of valuable especially as long-distance trade became available. Taking advantage of what is available locally is one sense of mining but it is the sense most commonly followed by data scientists who find new ways to exploit data the corporation already owns.
The alternative historical examples are those of the mineral prospectors who sought out where desired minerals would occur. In my mind, the most vivid example is in the oil industry that studies the entire globe to identify underground conditions most likely to contain retrievable oil. This kind of exploration occurs for many minerals but those efforts do not get as much attention as exploring for oil. This sense of mining implies seeking out new areas where the desired mineral exists. In contrast to the first sense of mining on ones own properties, the exploratory miner must acquire the rights of land that he supposes will contain the mineral he wants.
This latter sense of mining, exploratory or prospecting for minerals, receives much less attention in the analogous data science activities. Data science invests most heavily on processing existing data. There is a sufficiently huge amount of available data to keep data scientists busy taking advantage of this data without needing to invest in acquiring new data.
In there earlier years of data analytics, the volume and velocity of available data was already a major challenge that required custom optimized algorithms to extract that value in this data. Since then, the volume and velocity has exploded with increased automation of operational systems that provide a steady stream of new data. Even though new data becomes available for exploitation, the data scientist’s role in identifying this data is relatively passive. The data becomes available as an accident of automating business operations. The data scientist may have some input in identifying additional observations to collect, but it is in the context of an already approved investment in automation based on business needs.
There is plenty of data to keep data scientists busy. There is so much of this existing data, that we hear of alarming announcements of a shortage of skilled data scientists to do just this work. I assume the people employed at the task are overwhelmed with the challenges of working with the data they already have.
Back to my mining analogy, I recall hearing stories from my grandparent’s generation about the boom period when mines operated 24 hours a day and people regularly spent half of their days in the mines. At that time, there was a labor shortage for miners in that the mines employed much of the locally available labor. In order to work in a mine, one had to live close to it. In analogy to the over-worked data scientists of today, those miners were similarly overwhelmed by the abundance of coal and the high market demand for that coal.
What makes this analogy powerful from my personal experience is that these were stories from my grandparent’s generation. I spent my youth living the same ground they did, but all of the mines were abandoned. The coal in mid-state Illinois was inferior sulfur-rich coal. Also, I assumed the mines ran out of coal to mine although that may have been due to the mining methods of the time as this article describes abundant coal still available to modern techniques. The image I’m referring to here is that presented to me during my childhood surrounded by abandoned relics of a long-gone boom in mining of black-gold under my feet. The current obsession of data science with existing data reminds me of exactly that same childhood impression of coal mining. Although data scientists are not covered inside and out with dust, I suspect many spend more than 40 hours per week trying to keep up with demand for their skills. Also, the market is changing to provide tools to eliminate these jobs as suddenly as my community saw the collapse of coal mining in the mid-20th century.
This recent blog from Oracle predicts that new tools will end the current demand for data scientist. Implicit in this article is the description of the task of the data scientist as the miner into data like my ancestors were miners into the property they lived on. I agree with this prediction about this aspect of data science labor. Most of what we call data science is actually a branch of computer science involving the skills to implement efficient algorithms to achieve some statistical goal. Inevitably these computer science projects spiral outward to find more generalized solutions so that software can perform what once required custom solutions. At the same time, the available operational data is improving in terms of its consistency, conformity, reliability, and comprehensiveness. Sensors are becoming much more affordable and new designs incorporate the expectation that the operational data will be exported for analytic purposes. These trends combine to reduce the need for local data science talent previously required because the data was dirtier and algorithms had to be implemented from scratch.
The bursting of this aspect of the data science bubble presents an opportunity to redirect these resources to that other interpretation of mining that involves exploration and prospecting for desired data we do not yet have. While established enterprises have a wealth of in-house data to support analysis, this data is naturally limited to what they have in house. Mining this local data is inherently limiting on what can be discovered. At some point, the company will have no place to grow because they’ve exhausted the value in the data they own. The result is like what I saw growing up where formerly mining towns struggle on top of now worthless property. In order to gain new value, eventually companies are going to have to find new sources of information outside of their operations.
In recent posts (such as here) I discussed this problem in the context of addressing the Ebola epidemic. Those posts emphasized the lack of observational data about healthy populations before the outbreak of an epidemic. This lack of observation healthy-population data hampers the ability to exploit big data analytics to help us learn ways to control the spread of the disease, to identify its modes of transmission, and predict where it will appear next. The big challenge for data science is not mining into data (in this case, there is very little data at all) but instead is the mining for data: exploring or prospecting for new sources of data that we can exploit.
The Ebola crisis is one of many current urgent issues that need data that doesn’t yet exist. In contrast to the highly publicized climate change issue that is well funded for a multitude of new data collection initiatives, most crises have no efforts to obtain data at all. In addition to Ebola, we have difficult issues of refugees, illegal immigration, various regional conflicts, drug-trade violence, and a variety of ways the existing institutions and governments are fragmenting into smaller units. There are many other examples of similar cases where we are reacting to events that surprise us. Both the surprise and the clumsy reactions are symptoms of a common problem of a lack of data. Like I discussed in my Ebola posts, we have no data where we most need it: in the populations before the bad stuff happens to them.
Instead of having data in advance, we have to work backwards from specific cases to uncover the evidence we need. This is very labor intensive and at the end only describes just that one instance.
More troubling is that in order to streamline this investigation, we impose our preconceptions as an excuse to limit the collection of new data. I mentioned in the previous post of an example of a journalist cameraman falling ill to Ebola. Despite the unlikelihood that a western-educated journalist and especially a cameraman would be prone to direct contact with Ebola, we are quick to assume this must have happened. If there is a hint this did occur, then we halt our investigation with the confidence that this must have explained this particular case. In these individual case investigations, our preconceptions confirm themselves because it is too expensive to consider other possibilities.
In contrast, if we had abundant information about what led to all infection cases where this is just one example, we may observe a common situation across a group where this common situation may suggest another route of transmission, perhaps even the actual route that explains their infections. We can not make this kind of discovery with individual investigations because the cost in human labor for that investigation is too high. When it comes to individual investigations, we immediately stop at the first sign of an already accepted explanation.
At the start of this post, I suggested that the term data mining has an ambiguity of meaning either mining into data or mining for data. The concepts are not interchangeable and may be fundamentally different. Mining into data is like the miner digging into his property looking for something he can sell. Mining for data involves going outside of what is currently possessed to find or to create new sources of data. As the computer science and sensor technologies mature to provide more automation for mining into science, the demand for labor for this aspect of data science will decrease. There will still remain the need to mine for new data.
Perhaps mining for data is not really a data science task. When I started this post, I was thinking about the future journalism. Over my lifetime as an outsider to journalism, I observed the changes that are occurring in that profession as it strives to remain relevant to the new digital age the values fresh data but has little interest in paying for it. Although I realize the economic conditions for the career in journalism is not promising, I am pleasantly surprised to see that there remains at least some investigative journalism to obtain new information where that information was not previously available. This aspect of journalism is similar to what I call mining for data, to obtain valuable information that was not previously available. In most cases, investigative journalism resembles the scenario that I described about investigating individual cases of a disease: the investigation pursues a particular subject until it finds a recognizable condition (often some crime or scandal). However, there have been many more exhaustive studies across entire populations to identify new ways to measure something important about those populations. The later example of investigative journalism is more like what I call mining-for-data.
The inspiration for this post is the possibility that mining for data is more of a journalism task than a data science task. Alternatively, I think it is more likely to find data gathering skills among those trained to be journalists than those trained to be data scientists (especially not computer science). Although data science community includes statisticians who do have skills for designing experiments that includes obtaining data, the data is focused on a particular goal in advance. In contrast, the journalist is more likely to collect observations without a set objective and piece those observations together later. This sense of collecting new types of observations for later interpretations comes closer to the goals of having data analytics deliver some surprising insight about the world.
In my earlier posts, I described the under-appreciated role for labor to quality checking of the data and the assumptions behind the algorithms. I described this as a central competency of data science although modern usage of data science emphasizes the implementation of software to process data. The labor-intensive skills I was referring to may be described better as fact checking the data. Fact checking is a core journalism skill. Another journalism skill is to uncover new facts.
The modern data projects needs to go beyond mining into already existing data and instead explore for new data either to fact-check data or to exploit in new ways. My recent discussions of big data and Ebola touch on both senses. We could use more data to fact-check the official claims of a limited means of transmitting the disease. We need far more data for detailed observations of the spread of the disease. This is typically a journalism task.
Unfortunately, journalism is currently constituted around an old, now outdated model, of selling narratives. Initially, I didn’t think of journalism as a branch of data science because I considered it to be primarily a field for writing narratives. Certainly, journalists are trained in writing that can get published. However, the writing comes secondary to the tasks of collecting new facts or fact-checking other facts. The narrative is necessary to fill out a periodical that people are willing to buy with their money. The narrative brings in the funding for fact collection and checking.
In recent years (decades), the market for narratives has declined. There are fewer people regularly subscribing to content and when they do they are getting that content from fewer sources. To compete, the market for journalism has largely transformed to obtain revenue from advertisers instead of readers. This market transformed journalism to place even more emphasize on writing to create formats that attract the most page views. In recent years, the articles attract ridicule for click-bait titles (having little to do with the content) and presenting information as lists (or slides) to encourage multiple page views for the same article. Another complaint of modern online articles is that they present old content in new ways to attract viewers even though nothing new is uncovered.
I do not doubt there is a healthy market for writing such content, but this type of writing is not delivering the older expectation of providing new facts or follow-up fact-checks. Data science projects need more facts, and they need more fact-checking of their earlier assumptions. These facts need to be uncovered through investigations and on-the-ground exploration. The data world needs fact prospectors. Fact prospectors appear much like old-fashioned investigative journalism.
This need for new facts is still there, but the market has changed. We need the data to fill in data stores for data-science analytics instead of filling in narratives for subscription periodicals.
The modern economy has a high demand for need the skills of the journalists. Our large data projects need the skills of uncovering new sources of data or in cross-checking existing sources of data. This market would prefer to consume the products of this journalism labor in structured data instead of human-friendly narratives. The data is more important than the story.
As I pointed out in my posts on Ebola, we need more data about what is happening on the ground where the epidemic is spreading. Journalists are there now in the risky hot-spots to collect some data, but we really need data in the broader cold spots. We need data about the practices and customs of uninfected populations that either avoid the epidemic or will become infected in the future.
Our data analytics can benefit greatly if we had more abundant data on routine life in all communities without the infection. When a new outbreak occurs, we we can quickly compare existing data about that community with other communities, or about the infected individual with the non-infected individual. Having this data available in advance allows us to ask questions of how the disease spread and who may be vulnerable to get it next.
Current data analytics does not have this data to mine into. We need prospectors to mine for this data. Those prospectors are closer to journalists than data scientists.
Data scientists demands high salary for their efforts to apply new algorithms to data. However, the cost of data prospecting is much higher because it is far more labor intensive. A single data scientist may develop some algorithm to evaluate many billions of records ever day. In contrast, we need many teams of multiple people to fan out across a continent to collect data or ideally to identify recurring data sources to feed these algorithms. The analytics makes the best it can from available data, but that may not be good enough when confronting an urgent issue like Ebola. Making-up data to fill in our ignorance may satisfy some visualization of analytics, but it will not deliver relevant and effective solutions to make a difference in the crisis.
We need journalists. In my last post on Ebola, I pointed out the on-the-ground that provided the key information of an initial victim of the disease was a house-bound pregnant woman who was least likely to acquire the disease through the standard explanations. She simply didn’t have the opportunity to have direct contact with another symptomatic victim of Ebola. This is data that was not available in any data-store. It had to be obtained by some journalist asking questions in that neighborhood.
We need journalists to provide us data. From the perspective of data-science, it is cumbersome to receive this data in the form of a published narrative. It is not clear that text-processing algorithms would recognize the significance of this observation embedded deep inside a news report. Certainly this would be easier to discover if this were provided as a structured data record of some sort. We need journalists for their data, not their narratives.
For the dedicated investigative journalists, I suspect they find the narrative-writing of their job to be a distraction. Many good investigative journalists may have difficulty in their careers despite their superb investigative skills because they lack the engaging writing styles that will assure their observations get published (and thus compensated). Many of the best may simply not have much time left for writing such articles. They would probably be just as happy to sell their notes.
If their notes were in the form of structured data, we could just go straight from notes to data stores and bypass the narrative requirement entirely. With the modern age of personal electronics, it should be easy to have form-driven apps to collect observations in a structured way. Perhaps the content is free-form text but that content may be recorded captured in an field or an accompanying tag-word that the data scientist can recognize as potentially containing a valuable observation.
Currently, publishers buy the products of journalists. These journalist products are in the form of compelling verbal or visual narratives that publishers can sell for subscriptions or ad revenue. Although data-science has a high demand for journalist products in the form of structured data records instead of human-engaging narratives, there is currently no method to compensate journalists for their valuable contributions directly to data projects. Instead data projects parasitically feed off of published narratives of news reports. Data project attempt to use text analytics to extract the desired data such as the example that I observed manually by reading a published article.
For data science purposes, it very inefficient to mine data by text-processing published news narratives meant for popular consumption. It is also inefficient from an economic perspective because it fails to compensate the journalists for their contributions to the data science project. The economic inefficiency is that we are not providing the necessary economic incentive for more journalist activities to uncover much more data and at a faster pace.
The next stage for the data-science evolution is to provide economic incentives for people to uncover new data sources to feed data projects. This mining for data is very much like the field or investigative journalist. This next step for data science is to enlist journalism and to transform that field to produce data products instead of narrative products.
Update 10/7/2014: this article illustrates the point I am trying to make about collecting information for the progression of Ebola including before the diagnosis and the community response. It would be great if this were available for all cases, and extend back for weeks prior to the infection. Even this one case study provides intriguing details.