The title of this post is a response to the title of a Gartner release “Beware of the data lake fallacy”. My reaction is the curious choice of using the word fallacy in the context of a concept that apparently does exist and can work in some scenarios. I can see their point of the concept as having some limitations but I don’t think those are universal or insurmountable.
To be honest, I have never heard the concept of a data lake or exactly how it is promoted. But their definition expressed in the second paragraph sounds like a very basic concept and obvious first step in many big data projects. The quote from Gartner Research Director Nick Heudecker:
“The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.”
To be clear, my sentiments are on their side that this has a lot of issues. The data starts off in individual data stores that are frequently carefully controlled to meet the needs of specific missions and to restrict access to authorized (often trained) staff and even then they may be restricted to specific parts of the data or query capabilities. Moving this data to different location doesn’t magically transform the data into something that is easier to comprehend or less prone to misuse (misinterpretation) or to abuse.
Again, I have not heard of the term data lake, but it seems at least related to the concept of the open data concepts. Open data leverages the same free access concepts that have benefited the software community in the form of the open-source movement. Making source code freely available provided the benefits of more people to improve the code. It coincidentally also opened the opportunity for misuse and abuse but the community became robust enough to tolerate this inconvenience. For example, if an open source capability results in frequent difficulties in its use, the community responds by making that capability easier to use or less prone to errors. It is hoped that open data can offer the same benefits especially when paired with open-source tools for investigating that data.
A data lake may be another term for open data. Certainly mixing bare data in a common storage area can present challenges, but those challenges may be partially addressed by the accompanying open-source query and analytic tools for each type of data. Even if the problems are not solved by the initial sharing of these tools, the history of success of the open-source community promises that they will come up with improved software solutions to make this work. It may be a difficult task and it may be especially difficult to manage, but I don’t see how it qualifies as a fallacy or a fantasy.
It might work.
More specifically, the above definition reminds me of the very real and growing market of the SIEM (security information and event management) tools within the specific domain of IT (information technology) log files. Each of the log files are specific to particular types of equipment and often unique for each vendor and even each product of that vendor. The most basic description of the SIEM data handling seems identical to the above stated definition of a data lake. Get all the log files in a single place but as close to their native format as practicable and then open that data to a broad community to investigate that data. That community will need to recognize that an event called “over utilized at time T” has different definitions of “over”, “utilized” and “T” for different devices and vendors. Despite these challenges, the SIEM market is very large. It appears to be a successful market at least in terms of selling products and customer’s upgrading their licenses to build bigger stores. I assume the customers are finding value in these data lakes that probably appear quite swampy.
I have mentioned SIEM tools several times in the past. I think they have their place but that place is along a data supply chain. SIEM tools (that I assume can be generalized as data lakes) reside in the middle of a data supply chain. The Gartner release suggests the same thing where the data is available for everyone in an organization to perform their individual analysis.
At least if I’m one of those analysts, I would have in mind a project that manages this data appropriately to meet my mission with as little risk as possible. I want my analysis products to be high quality to assure success of my mission. I also want to avoid violating any governance rules. While I didn’t have a data lake to draw my data from, my project operated pretty much the same as it would if it drew data from a data lake. I negotiated with each data source to provide me the data they had in their native format. I took whatever they had and then discussed with them any questions I had about what the data format mean and don’t mean. Although each source was a specifically negotiated data feed, it would have been the same if the data were in a common area for all to access. The only difference would be that retrieving data from the common area would far easier than negotiating different data transfer protocols for each source.
I don’t find this concept to be that challenging. It is a lot of hard work. It is frustrating work because no one of the source data provides exactly what I need either in terms of a format or content (their definitions don’t match my own). However, it is not impossible. The effort requires skills but the effort is also continuous because the data sources can and will change over time. There is something to work on every day.
The following quote from the release I think captures the primary concern of the approach:
“The fundamental issue with the data lake is that it makes certain assumptions about the users of information,” said Mr. Heudecker. “It assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources without ‘a priori knowledge’ and that they understand the incomplete nature of datasets, regardless of structure.”
While these assumptions may be true for users working with data, such as data scientists, the majority of business users lack this level of sophistication or support from operational information governance routines. Developing or acquiring these skills or obtaining such support on an individual basis, is both time-consuming and expensive, or impossible.
The first paragraph describes that hard work I discussed as the project I worked on. I agree that we demand users to understand the data they are working with. The second paragraph argues that this demand is impractical. Data scientists are rare breeds and in short supply. Also data scientists are so specialized that we need to import them into the organization. Developing data scientists from internal staff is “time-consuming and expensive, or impossible”.
Any form of development activity of staff to take on new challenges falls into the category of “time-consuming and expensive, or impossible”. However, any organization that refuses to develop its human capital is short changing its future. Humans offer long term value to an organization precisely because they are growing living beings who are intensely motivated learners eager to advance their careers to greater responsibility or at least continued relevance. Smart businesses do not freeze people’s capabilities to be what they qualify for on their date of hire as if people were products. Smart businesses invest in time-consuming and expensive efforts to develop staff to handle new challenges. Sometimes, that effort is impossible for some staff but the smart businesses will instead swap the career paths to match the challenge.
The argument is that training internal staff to become data scientists is a fallacy because such training is time-consuming, expensive, or impossible. If that is a fallacy, then any form of staff development is a fallacy. In my cynical viewpoint, I think many modern businesses do adopt this philosophy so perhaps it is a de facto fallacy because these businesses can’t tolerate the investment to develop staff.
I think in most cases, businesses do invest in their staff to keep their staff relevant to the business needs and to grow staff with potential to take on greater responsibilities or more difficult challenges. In particular, businesses want to keep their most highly evaluated staff who show commitment, enthusiasm, and effective work behaviors to get jobs done well. They will invest the expense, the time, and the risk of impossibility to develop these staff for just about any area of business.
But, we are asked to believe that data science is fundamentally different than say advancing a junior accountant to become fully certified such as becoming an associate of the society of actuaries (ASA). I’m not sure where the word data science originates, but it sure has the word science that seems to elevate the job description to PhD level status. Data lakes must be a fallacy because there aren’t enough PhDs to staff them.
The same job description could have been named “data clerks”. It is just a word to describe the specific tasks in handling data. Certainly data tasks require training and special aptitudes, but so do other clerk positions. Changing the name of the field from “scientist” to “clerks” changes how we can view staffing the positions. Businesses develop their clerks internally with a goal of advancing young talent internally to meet the more challenging work expected later. In contrast, businesses hire only accredited scientists by importing them after they have been developed.
A simple change in the name of the field would answer the fallacy issue. To make data lakes feasible, we need data clerks instead of data scientists. Same basic job can have different names.
I have a hard time seeing what makes data science different from other fields like accounting, actuarial, or financial work. These fields all have graduated levels of capability where the top enjoys the full trust in being able to operate independently. However, these fields have mechanisms to bring in junior staff and develop them into senior positions later. In these other fields, the certifications demand years of preparatory professional experience. A simple college education does not suffice.
If anything, we are more demanding on these other skills than we are for data scientists. Although we reject as too expensive or time-consuming to develop data science skills from our experienced staff, we readily hire recent college graduates as fully capable data scientists.
There is a reason why the other fields demand experience on top of challenging certification exams. That reason doesn’t disappear because the job title has the word “scientist”. Experience confronts reality where things require good judgment as well as good skills. Much of the demand for data governance, for example, is the demand for good judgment and ethics. Judgment and ethics comes with experience.
The objection may be that data science is still impossible in most cases because it involves specific skills of abstract concepts. My answer is that the very nature of data is abstract but there is a fair majority of humans who seem quite capable handing data.
For example, the concrete concept of my car is what I experience when I go out and touch it, open the door, and start driving it. However, I have no problem equating that to the printed statement from the county saying I owe property tax for the same car based only on a few printed characters on a piece of paper. They want tax for the physical car, not the ink on the paper.
Many disciplines deal exclusively with data and have a wide range of skills to work with that data. In many of these disciplines most of the practical skills are learned on the job. The certification exams tend to focus on broader practices and restraints.
The ability, for example, to operate with efficiency a particular spreadsheet program does not come from the certification but instead from on the job experience. I’ve seen many people who are very skilled at performing their jobs using Microsoft’s Excel program and what I’ve noticed that much of their most valued skills derive from understanding the data they are working with. Excel provided only a convenient tool box to use to tackle that data. Our most prized Excel experts are in fact data experts experienced with the data they must deal with.
Most of the reason for the presumed shortage of data science skills come from expectations set by calling it a science that requires specialized outside-of-work training. To fill data science positions, we accept as fully qualified those who have data-science related degrees from colleges and we accept nothing else.
At one time we had an appreciation for a specific skill known as typing. Although typists were not as highly paid as data scientists today, we required applicants to pass certain typing tests that can only be learned outside of the job. To get a job, the typist must be able to type a large number of words per minute with no errors. I consider myself a decent typist, but I suspect I wouldn’t qualify for a typist position in the 1960s or 1970s. At that time also, we delegated certain tasks that could only be preformed by typists. A simple inter-office mail had to be typed by the typist even if we supplied them a type-written draft. We recognized that they leveraged their typing speed and accuracy to better employ their knowledge of appropriate standards for how a memo should look and how the particular audience should be addressed and other considerations that remained mysterious to outsiders like me. When I started my career, I had to queue up my requests for typed copy in the typist’s inbox. Typists had broader responsibility titles, but that’s the job I needed them to perform.
At the time of the early 1980s, I recall the introduction of personal computers with word processing software. The word processing software had lots of mysterious commands and needed training to learn how to use. The command didn’t seem that mysterious to me, but still we accepted that we needed to hire or train our typists to specialize in these tools. There was justification in this approach in that although anyone may be able to learn the commands, only the typists were able to type at high speeds without errors to be most efficient at using the expensive computer and the rare type-quality printer that required more service in proportion to the pages printed. There was a still a premium on printing only high quality documents.
The typists of the 1970s and before had a rare and prized skill of typing very fast without errors. These skills were an entry requirement to get the job in the first place. Once on the job, they alone had the opportunity to develop those skills and to leverage those skills to learn business governance rules involving paper correspondence and documentation. We had a shortage of typists. A room of 20 engineers doing daily tasks may need to wait their turn to get a paper typed for them to review perhaps days later.
I don’t hear much about a typist shortage today although clearly there is no way to compare the daily output of typed material of today compared to the 1970s. We’ve all learned to become typists. There may be some who still type using one finger at a time, but most people I see on computers are touch-typists just like the specialist typists of the 1970s. We may not have their speed and accuracy, but we have auto-correct and word processors. Since each person is their own typist we don’t really need the speed required by the poor typist who needed to type the work of dozens of typing-challenged thinkers.
My point is that we solved the typing problem by redefining it as a task everyone needs to perform instead of having to make it a separate job description with a specialized skill set. Today it does not seem controversial to expect everyone to learn touch typing on a qwerty board, but it was a lot more controversial in the early 1980s. Learning to touch type is very hard, especially for older adults and especially those whose day is filled with other job demands.
I was in the set to learn touch typing late. I recall thinking I could get by typing with just one finger per hand and finding keys with the desired letters printed helpfully on top. Then I encountered a situation where I had to do something on a terminal in front of someone whose time was very valuable. I knew then I had to accept the fact that I had to obtain the skills of a typist.
In contrast, data science involves learning the names of relevant algorithms and how to invoke their library in a language like R or Python (or Java, C# for those who insist on purity). No matter how abstract the concepts are, it only takes a few minutes of reading or perhaps a week of training to get familiar with the tools required to get the job done, and what exceptions to catch.
Learning to touch-type when you previously only knew hunt-and-peck is a far harder training experience. It took many months of daily practice to even begin to take my eyes off the keyboard, and much more time to get to the point where I can type as fast as I can write. In some ways, I had an advantage over my peers because I had a hobby of creative writing. Typing skills requires lots of practice typing never-before-seen material.
In the 1980s, we could not expect to send a senior manager to a 1 week boot-camp job to turn him from a hunt-and-peck typist to a touch typists. In many cases, the goal was impossible, but we confronted the reality that in order to get the modern office work done we needed all modern workers to become touch typists. This was expensive, time-consuming, and sometimes impossible. But the modern reality is proof that this was not a fallacy.
As I discussed in earlier posts, the future reality is one where everyone will have to be what we today call data scientists. In response to the question of what the future job title will be for what we today call a data scientist, my answer is citizen and employee. To survive and thrive in the future, we need to have data skills just like today we expect touch-typing skills.
Returning the point about the cautionary about the fallacy of data lakes, the fallacy itself is a fallacy. It assumes there is an inherent specialty in data science that can not be trusted to general population of an organization. It is making the same assumption we made in the early 1980s that typing must be done by typists because the rest of the staff can never learn to touch type as well. We did learn to touch type. We will learn to find the “analyze” menu in some software package and then select “guassian bayes classifier”. That software user interface will be an applet on our smart phones that we’ll get from a link shared to us from a twitter message.
Not only will it be easy, but it will be essential. Our day to day lives will demand that we understand how to get data and how to interpret it without making embarrassing mistakes. With those skills, we will demand data lakes to fish in.
7 thoughts on “Beware the fallacy of data science”
Pingback: Databases motivates philosophy with multi-valued logic anticipated by Buddhist thinkers | kenneumeister
Pingback: Data is antagonist of science | kenneumeister
Pingback: Materialize the model to level the competition with observations | kenneumeister
Pingback: Big data can re-identify de-identified data | kenneumeister
Pingback: Dedomenocracy: unsupervised government | kenneumeister
Pingback: Perspective of real time analytics | kenneumeister
Pingback: Databases motivates philosophy with multi-valued logic anticipated by Buddhist thinkers | Hypothesis Discovery