One recurring theme of my questioning of practices with big data is the lack of provisioning for human labor to routinely scrutinize the data, not just for operational problems but also for problems of underlying assumptions being or becoming invalid. I call this scrutiny data science and taking the science seriously in the same sense of traditional disciplines characterized by continuous skepticism about the information, its interpretation, and the underlying assumptions.
Data science also has a trendy meaning for job descriptions where the emphasis is placed mostly at the back end as a consumer of big data rather than as producer. This field values the data scientist who is skilled at quickly excavating treasures out of the given data.
My meaning of the term would be disappointing to such employers because I’m more interested in making sure the data is something worth mining in the first place. My meaning is more likely to slow things down.
Today, I was reading some about law and about supreme court cases. I should note that I’m in the tribe of individuals who are fascinated by law but disinclined to make it a career or even a serious hobby. My interest is a step up from casual interest only when controversial cases hit the news or opinion circuits. I can get interested in cases that only law experts would discuss. I’m intrigued by the dynamics in law. It is very slow and interconnected.
In an earlier post, I discussed about how some professions are inherently low productivity and this inevitably makes them low paid. I focused on medicine and education. I could have added law. As best as I can tell, the bulk of those in legal professionals probably make about as much as those in education and medicine. If so, I wouldn’t be surprised. The practice of law is inherently low productivity.
Law involves engaging in an argument where the actual time in the argument is dwarfed by the amount of time required to prepare for that event. Higher level court cases move even more slowly. In addition, law tends to compound its workload by creating new legal problems when resolving older ones.
As I mentioned, the extent of my interest rarely gets very deep or past the supreme court level. But that limited depth is relevant to what I want to discuss. The challenge of making decisions at that level is to come to a good resolution the specific case using an argument that is least disruptive to the entire rest of the body of law. The presented arguments and the resulting decisions reach deep into all relevant cases and law.
We allow the process to move slowly. We expect careful deliberation to be certain the results are best not only for the outcome of one case but for all of the country’s investment in prior law and legal decision.
Allow me to surrender at this point of trying to be in any way knowledgeable about law. All I’m trying to say is that I admire the profession’s tedious and never ending project.
I want to point to this as an example that data scientists should follow in their tackling of big data projects. I discussed this in a post where I asserted that there is a need for intense labor in working with data. In contrast to the hot fields of data science of assembling ever larger data centers with the latest technologies, I’m discussing a very low productivity side of the project. The project of making sure the data is any good for what it is supposed to be used for.
In some ways, we need to develop a lawyer-like mentality dealing with data. In an earlier post I talked about treating data like witnesses in a court case. I used the court analogy for describing the labor part of data science that is often overlooked. In today’s post, I’m suggesting a stronger assertion that data science should aspire to follow the example of the practice of law.
Historical observation data is very different from laws, legal opinions, and case histories. Big data is not equivalent of law. However, what could be more similar is the obligation on diligence of the practitioners.
What really impresses me about law is the absolute obligation it places on its practitioners to be diligent, thorough, and honest. I may not always be impressed that this obligation is always or uniformly enforced. But there is a recognized strong sense of obligation to do the best possible job arguing either side or weighing both arguments. This obligation is so expected that we don’t complain that it is a big part of what slows the process down so much.
There is no corresponding obligation on big data. Most of the emphasis is on making big even bigger. There is a lot invested both in hardware and in software. In software, there is an obligation to get the algorithms correct and scaled to the size of the data. What is missing, in my opinion, is on the obligation to make sure the data remains right.
There is a lack of commitment or obligation at the back-end to be sure the data means what we hope it means. This lack takes the form of limited budgets and staffing. There is no one around to look for what might be going wrong.
In my experience, big data projects do take any discovered problems seriously and immediately take necessary steps to fix the problem. The problem is not that problems are ignored. The problem is that the projects would prefer that the problems not be discovered in the first place.
Big data projects follow life cycles with an up-front engineering followed by an implementation and a period of operation. The assertion of correctness occurs at the front-end engineering effort. After that point, the implementation and operation are expected to be stable. The up front engineering is expected to characterize the data’s properties completely and this characterization will not change throughout the remainder of the project life cycle. Consequently, the project can plan on a very low level of budgeting or tools for routine careful scrutiny of each new batch of data. A measure of a good project design is its low operational costs. Low operational costs mean there is not much in place to catch emerging problems and then quickly address the issues within the operational phase of the cycle.
This concept of project life cycle derives from life cycle practices of smaller projects involving much more specific tasks. Typically these smaller projects involve some real-time or operational need. In these projects, it is important to have the requirements frozen at some point and the project enter a stable period for operation. Given their limited operational scope, it is reasonable to expect stability for this period.
We want to treat a big data project is just another operational project just with a larger data center with different software. We haven’t yet come to terms that it is a very different undertaking than smaller operational projects. The difference is the in the different nature of big data itself. Big data is historical data that combines data from multiple often unrelated operational systems across multiple generations of each of these systems, often with major changes.
Another way to contrast the two is that smaller operational projects can expect a world that will not change for the life of the project. We choose a life span for a project in part based on our expectations of that stability. In contrast, a big data project must deal with an uncertain future where every moment is vulnerable to a surprise.
Big data projects are less like operational software projects and more like legal projects.
If the practice of law followed the software life cycle concept it would be like synchronizing law with election cycles where an election will set the rules for the next two or four years and then after the election can allow computerized automation handle all legal cases.
Instead, every court case is a new challenge and has no well defined scheduled. It is a tedious process involving constant work.
In recent years the two disciplines are colliding. The practice of law or government regulation is relying on products from big data. And big data is introducing its own legal questions. In both cases, big data is receiving more legal scrutiny.
This can be a good learning opportunity for both sides although the lessons may be especially painful for big data.