I saw a recent report forecasting that in five years the demand for data science skills will outpace the supply of labor by some 200,000 jobs. Conveniently, that’s enough time for people to make choices for their college career choices. It will be interesting to see how richly those new data scientists will be compensated for their highly sought rarity.
At the same time I observe the mainstreaming of the primarily core technologies that define the technical challenges for most of the jobs under the grandiose title of data science. Well capitalized and industry respected commercial vendors are converting what once was difficult specialized programming of open source code into office products with analyst-accessible user interfaces. This trend is being facilitated by the move to similar commodity products for virtual machines allowing companies that resist using commercial cloud services to build their own clouds using standard practices that will support the increasingly readily available products.
It looks like the early 1980s all over again.
In an earlier post, I equated the modern usage of the term data science to the way we understood the most challenging parts of computer science in the early 1980s. Both were devoted to solving exactly the same problem of having to accomplish complex algorithms involving more data than existing hardware can handle easily. The emphasis of computer science as a college discipline at that time was on the concept of building algorithms that were efficient and can solve the then difficult challenges of data analysis.
My personal exposure to that era was in the field of signal processing that seems indistinguishable from a big data problem. Signal processing in the 1980s had to deal with the three V’s: velocity (signals almost by definition are streaming), variety (multiple sensors plus the need for enriching data), and volume (a consequence of accumulating signals in time frame of a decision). It was the same problem.
I do not agree with an argument that say that today’s algorithms are different from signal processing. The defining characteristic of computer science is innovation, not the reproduction of something already developed. Just as new machine learning algorithms need to be invented now, we needed to invent new signal processing algorithms in the 1980s. I specifically recall the excitement of finding alternatives to the fast Fourier transform despite the fact that that it was already remarkably fast. It had the same feel of today’s contrasting the differences between different machine learning algorithms.
Big data is signal processing. In the early 1980s we had to invent these algorithms. Many solutions involved customized implementations based on narratives published in academic journals. Even if the researchers wrote their software code, they artificially translated their code into mathematical symbols in order to be approved for publication in academic journals. At that time, open source meant a published mathematical formula that requires quite a bit of skill to imagine how it could be translated into a computer program. A single published paper would result in thousands of independent implementations in a variety of languages, often machine language or even hardware logic arrays. Open source in the 1980s was often in the form of sharing something that at least looked like a mathematical formula.
Signal processing remains an important challenge for computer science. In addition, much of modern data machine learning appears to be direct descendants of earlier signal processing challenges. Throughout the 1980s and 1990s, processing radar, sonar, or photographic images required meeting the challenge of quickly processing huge volumes of a variety of inherently noisy data to find and track items of interest sufficiently fast so that we could make decisions about successfully arranging a rendezvous with those items.
In numeric terms today’s data is measured in bigger quantities but the problem was essentially the same of more data than existing algorithms can handle with existing hardware. The job was to find a better algorithm or to build better hardware.
By the early 1990s, signal processing became widely available and used throughout industry and even consumer products. We witnessed extremely rapid growth of the market of new implementations embedded in different products or provided in software packages. What we did not witness was an equal growth in demand for signal processing jobs. The signal processing jobs may be well paid due to their expertise, but overall the economy didn’t really need that many actual jobs to be filled.
There is a simple explanation for why a field of technology like signal processing can experience rapid and diverse adoption without requiring the economy to produce new jobs. The technology became commercialized. There became a few basic vendors who could efficiently employ a few specialists to build marketable products that can be mass produced to satisfy a particular market.
Although I say this is a simple explanation, this explanation was not apparent in the early 1980s. I recall at the time being impressed with the intricacies of the algorithms that needed to be tuned for different circumstances. The very complexity of the problem of implementing an algorithm and getting to perform well given a particular technology seemed to defy any possibility of mass production.
Actually, I recall thinking it was possible to create a higher level of abstraction to create a user interface to supply parameters to reusable code that would be able to be tuned for different challenges of mission and available hardware. I vaguely recalled the experts dismissing that concept because such flexibility would be more difficult to write and the performance would be degraded. It turned out that compared to writing effective signal processing code, the additional code to make it tunable was simple. While there may be some degradation in performance, often that degradation was hard to measure and certainly hard to compete when considering the extreme additional cost involved.
I enjoyed learning signal processing algorithms and their implementations. I didn’t specialize in it but if the market for signal processing skills had grown as fast as the available of signal processing powered products, the market would have begged me to take a job in it. I never experienced that kind of urgency.
I suspect the same dynamic is occurring right now in so-called data science. In the past couple years, there has been a deficit of skills because companies rushed to adopt solutions based largely on open source software. Earlier successes that drove this increased interest in the technologies benefited from cheap enthusiastic labor to customize the open source software. Much of this labor was not compensated at all. In order to replicate this effort broadly across the entire market, this labor has to be multiplied for each implementation.
One problem with open source based on volunteered labor is that we never get a good measure of how much labor actually goes into it. By volunteered labor, I am referring to people who are employed, often as programmers, but who put in extra hours to get the project to work, or spend their free time contributing to the project as a hobby. Such enthusiastic volunteered labor can be very productive and effective. The problem is the very limited supply of such enthusiastic workers willing to work for free (outside of their day job).
The projections says that we will need a tremendous growth in jobs for specialized algorithm developers to customize open source software to meet each particular business needs. To me this is like saying every submarine will require its own set of programmers to customize mathematically-symbolized sonar processing algorithms in order to customize for the local ocean environment and the targets they are tracking. The submarine may have some signal processing specialist and he may occasionally need to write some code, but I doubt submarines have 12-member software development life-cycle teams operating in 2 week scrum sprints with a scrum master to isolate the team from the rest of the crew.
What is driving the labor demand is the cost of replicating the software development life cycle to customize source code from open source projects. From a macro economic viewpoint this does not make any sense. Competing companies for the same market will be employing their own staff to do the same tedious work of quality software development to produce essentially the same capability. In the same market, the main thing that separates competing companies is the data they have access to. What they will need to do with that data will essentially be the same thing. They may make different choices about the algorithms they choose to employ, but in general they are going to select from the same library of possibilities.
Inevitably there will be companies that will supply exhaustive libraries of capabilities where each capability was built to high software development quality standards and each has a simple tunable interface. The customer of these libraries will need only to select the algorithm from a list, and to supply the parameters that tune it for this particular task.
In my experience, I continue to be amazed at the longevity of computer spreadsheet projects. These tools remain very popular and very accessible to non programmers. In fact these tools continue to enjoy enthusiastic support from its users because of all of the capability that they make available to them without needing to employ a software development team. This office software continues to offer sophisticated algorithms that are as easy to use as basic mathematical functions.
I recognize the algorithms as the ones I was eager to have the opportunity to code from scratch using whatever language I was using at the time. These algorithms are so fascinating they would be fun to implement in software code even today. I felt cheated out of similar opportunities in the past because some vendor mass produced some algorithm in a very reasonable implementation available to virtually anyone.
The mass produced implement performs adequately for most people. Anyone who is justifiably unsatisfied with these mass produced implementations will be challenged to produce a competing version with sufficient time to enjoy the benefits before the vendor releases its improved version. In the end he may enjoy some bragging rights, but everyone will be using the vendor’s implementation.
In recent months, there have been an increasing pace of major product announcements of vendors who have created commodity implementations of what started as open source software. These vendors are generally cooperating with the open source community so that customers still have the option of producing their own implementations from the open source. But the vendors have a huge advantage in terms of avoiding this need for in house software development. Compared to customer companies where the software development project is an under-funded overhead project with whatever talent they can find, the vendors have the capital to fund much more diligent software practices and employ specialists for each component or step in the software lifecycle.
Even a company willing the spend the money to employ its own software teams to develop their own implementations, it is unlikely that they will do as good a job as the vendors either in terms of meeting their specific needs of the comprehensiveness of the solution or in the specific performance of a particular chosen algorithm.
I think some of the projection of labor growth comes from the recent events of start-up companies that suddenly disrupt a market of established players. Although these start-up companies could benefit from the vendor implementations, they are likely to be attracted to the lower entry cost of using the open source. Being start ups, they probably have access to enthusiastic and cheap labor willing to work every day and for long hours. If these start-ups succeed in disrupting a market, then they will need to continue to supply labor to maintain their open source roots. Their commitment to a proprietary implementation forces them to hire to continue supporting that effort.
It certainly seems possible that we will see many more start up companies offering services that disruptive technologies. Certainly the next one will probably come as some surprise. But I doubt the established endurable companies will fall to start ups. Unlike start-ups, established companies understand what it takes to stay around for the long run. Also, the fate of most successful start-ups is to be bought by an established company who will then impose a more realistic choice for the supporting technology.
Even in the case of a disruptive start-up, it is becoming less likely that they will find an internal development of open source to be cheaper than leveraging existing vendor implementations. The software vendors are capable of learning lessons from the past. They are being creative with alternative pricing models that could be appealing to a start-up. Essentially the offer is to share the fortune of the start-up: if nothing happens then the vendor gets nothing, but as the start-up succeeds, the vendor will get more revenue.
An example is a data product that is free to use as long as the daily data volume stays below a certain threshold. For the example I recall, the vendor had the foresight to allow a set number of days per month that the product will be allowed to exceed the threshold. The customer will need to pay only after experiencing good evidence of sustained success.
Overall, I see today’s data science to be identical to the 1980’s software engineering challenges such as for signal processing. Clearly there is a need for sophisticated algorithms. These algorithms need to be carefully implemented to achieve the desired performance. There is a huge enthusiasm for the amount of work that will be required. Despite the success of the signal processing market (the algorithms are everywhere but we hardly think about them any more), there never was a huge demand for signal processing programmers. The reason was that vendors found ways to consolidate the capabilities to efficiently employ a few specialists to satisfy nearly the entire market.
There was a computer scientist labor bubble in the 1980s but luckily it wasn’t noticed because the trained labor for complex algorithms were able to find good jobs in less challenging software fields. They may have been more competitive for those jobs because of the extra diligence required to accurately implement algorithms that ran fast. I doubt that advantage was very large given the some of the poor software products that enjoyed commercial success. In any event, the people who studied to be signal processing experts found gainful employment elsewhere. Some may be disappointed that they never had the opportunity to use their skills they took pride in learning, but they were not out of work either.
I see a similar bubble building for data science (today’s signal processing) but I see less likelihood of there being a safety net to catch the excess labor when the bubble bursts. In five years, there will be jobs dedicated to careful implementation of deeply analytic algorithms. There just won’t be that many in number. There will be dominant vendors who will optimize their teams to produce the necessary breadth of product offerings to satisfy the majority of the market. The vendors will be more clever in their licensing arrangements to compete with free options.
The evidence for a bubble appears in the comments I’m hearing from data science enthusiasts as they delight in new product offerings by major vendors. They are imagining an even higher elevated role for the elite group known as data scientists. The availability of cheap and easy to use tools will free them to do a truer and more valuable forms of data science.
The comments I read are fantasies about what these new elite data scientists will do now that they don’t have to be burdened with software development. No one doubts they will still be employed.
The evidence of a the bubble is their justifications for their continued relevance after the vendors have turned data science into an office application similar to a spreadsheet application. Data scientists inherently offer more value than typical office software users. One suggestion of example of this additional value is that data scientists are story tellers.
Office workers will be able select from menus the appropriate algorithm to apply to the available data sets and immediately receive rich, immersive, and interactive visualization that turns the data into something approximating a computer game or even a 3-D CGI animated film. Unfortunately, the typical office worker will be unable to tell the story presented to him. Companies will need to hire data scientists to tell the story being told by the visualization of data.
The future of data scientists is to become the equivalent of Hollywood film critics.