In a recent post, I discussed the importance of shopping skills for a career in the STEM fields of science, technology, engineering, and math. Most STEM projects involve a small amount of stuff specifically built to a particular purpose and a much larger amount of stuff that has to be obtained pretty much like a retail product. Having effective shopping skills means not picking the first item that might work, or of picking things without paying attention to the unexpected details the vendor is pointing out as important. For the practitioner and the practitioner’s employer, good shopping skills can result in finding a stronger option that is more likely to get the job done, or get that job done requiring less work or less skill.
The cost of poor shopping skills plagues the software industry today where lots of computer scientists are very comfortable with their programming language and associated development environments. When a new project comes up, they eagerly and confidently apply their past experiences. Even within their own environments, they may uncomplaining proceed to work tediously to get some minimalist library to get the job done without checking for the availability of a more appropriate library that is more specialized to the task and would require less work to produce or maintain. More troubling is the reluctance to check outside of their programming language to see if some other package may already exist to solve the problem.
I recall a recent argument where the stated need that the data was so challenging that the files needed to be stored on disk in binary format and extracted using algorithms specifically designed for these formats. The argument hinged on the belief that this kind of concept was unavailable from any commercially developed product but had to be innovated ourselves, in the 21st century.
Back when memory was counted by number of bits instead of by terabytes, there was some concern about efficient data storage and retrieval strategies. Programmers from 30 years ago were skilled with efficiency in their own work. I would imagine that if they were employed to build a product, they’d probably apply those same skills for that product. In this particular case, the argument concerned avoiding database technologies. There are clearly opportunities to out-perform a top end database for certain kinds of operations, but in practice this is not easy because top end database engines implement a lot of capability that was accumulated over decades of experience with real world data.
Shopping is important to the employer although curiously the employer usually defers to the developers to make the choice of shopping or using what they have. The modern practice of agile programming has the unfortunate side-effect of building a barrier between so-called stake holders (employers) and the developers. The stake holders define what capabilities they want. The developers choose how to deliver those those capabilities. Generally, the developers will use what they have. They are not going shopping just because the job turns out to be hard. The mark of excellence is making a hard job look easy. A job will have to hit a dead stop before shopping is an option or the missed opportunity of shopping is regretted.
In agile practice (using some form of a scrum model), the choice of implementation is largely in the developer’s hands. Frequently developers are free to deliver a capability even when they end up reinventing databases complete with their own unique query languages. Not considered was the possibility of buying a commercial product that competitively does the same thing plus brings the luxury to the employer in terms of access to a large pool of existing skilled labor and abundant training and value-added resources. Even when a database is explicitly available, the developers will choose to retrieve contents of entire schemas to their local environment so they can create their own algorithms instead of attempting to leverage the algorithms inherently available in the SQL query engines in the database.
It takes skills to write efficient algorithms plus handle the technical difficulties of moving large amounts of data to a location the algorithms can access. It also takes skill to write effective SQL. But SQL not only avoids the data movement problem, it also allows the database to be more selective about data needs inspection in the first place. It takes an employer’s perspective to recognize the benefits that SQL skills are more readily obtained and the compact structure of SQL specialized for data problems enables a higher productivity of producing more queries.
It is interesting that this argument appears to be more or less settled with the so-called NoSQL (Not only SQL) approaches becoming adopted by database vendors. This clever marketing concedes that SQL wasn’t good enough for some tasks, but the result is a commercial product that uses a high-level data-manipulation language that looks a lot like what SQL always looked like. Using the commercial products often means or increasingly means there is no need for software developers to code up custom algorithms in a general purpose object language. The commercial vendors reassure developers they continue to have the opportunity to do this custom coding if they absolutely need to. The cleverness is in not reminding everyone that they always had this opportunity with databases but it turned out that few people needed to use it.
I complained earlier about my disappointment that my definition of data science as being a minority view. I took the word science as it is used for the natural sciences so that data science was the study of data and information just like chemical science is the study of molecules and reactions. My understanding of the majority view is that data science is a specialty of computer science devoted to the challenges of more data than available resources can handle. To me this is a curious definition because before the popularity of the Internet and graphical user interfaces, the core of computer science was focused on precisely this problem.
Data science is computer science. Computer science is the skill to implement an algorithm in a general purpose language so that the implementation will accomplish its task despite the limited resources available. Conversely, if a data problem can be solved with a commercial product and without needing a general purpose language (and its associated practices) then that project is not data science. Users of a commercial product may be domain experts, but they are not data scientists.
A recent Teradata blog post described this point is a way that seems to agree with my point of view. This view says that data science is about understanding data. Sometimes this requires writing custom software using rigorous computer science discipline, but this is quickly becoming unnecessary with the rapid improvements of commercial products that bring data directly to the end user, the domain expert.
A software team (with their data scientists) often chooses to reinvent their own technology instead of shopping for a commercial technology that is ready to do the job immediately. The result is a slower productivity as they build and test each algorithm one at a time in carefully scheduled sprint 2-week cycles. This productivity is justified by the diligence of processes that they follow to test and re-factor the code to meet modern standards. Countless stake holders with their unique query requests need to take their place in the product backlog and argue their priority in the periodic sprint planning sessions that will select the one or two top priorities to focus on in the next cycle.
If an employer finds out that a commercial product could present to these countless stake holders an ability to build with confidence their own queries to get the data they need, he may see the data-science sub-discipline of computer-science was a mistake.
Finally, I get to the point of this blog post that mistakes are common. In particular, a direct consequence of the point of my previous STEM post is that poor shopping skills inevitably leads to mistakes. My impression is that much of STEM labor is devoted to struggling through their mistakes resulting from their poor or non-existent shopping skills.
In the case of data science, we repeatedly make the mistake of assigning a task as a project for software developers to decide how to implement. The software developers know how to write software in their general purpose languages with the proprietary re-usable code modules they’ve accumulated over the years. They accept the task with confidence that they can get the job done without the need for buying anything (other than their labor). They may run into trouble so that they the need more sprints to get the job done, but they are smart people who are obviously working hard so we accept the delay and labor costs. After all, there is no other option because this is a job for data science and they are the data scientists.
A large part of STEM labor is spent on the team struggling through a problem of their own making because either they dismissed the possibility that something may exist to make the job easier or they dismissed that such a possibility would be too costly and not perform as well as they can do if they did it themselves. Once the decision is made to reinvent what could have been licensed, they will struggle to meet the project’s goals with their collective smarts, sweat, and blood pressure.
I recall a scenario early in my project where I implemented a database schema for an analytic data project. Because the data did not need to be updated with individual transactions, I opted to a highly de-normalized data design. A database expert scoffed at this de-normalization and demanded a redesign for a properly highly normalized design despite the fact that there would never be record-level updates or deletes of the data. Because my project was already enjoying productive use, my project was allowed to continue while another team was built to do the job properly. Actually, I was part of that team as well and I cooperated with their design as junior subordinate to a more senior database developer. Months passed and my project remained the only viable solution for productive use. I pointed out that the project was languishing because of the continuous re-factoring of the normalization as we discovered new data sources demanding their own atomic representations. I convinced the senior developer that a denormalized approach made more sense, but at this point it was too late. We (again I was part of that team) had to make the new project work. Although management was upset at the continual delays, the participants in the project were generally rewarded for their hard efforts and demonstrated skills. Ultimately, the project was shuttered and my project was retained because there was no other choice.
We made a mistake by failing to shop. In this case, there is a distinct difference between analytic and transactional databases. Transactions need normalization to protect the integrity of data when doing updates and deletes. The affected data needs to reside in a single location so that the change will be available immediately to everywhere it is needed. The problem is that normalization takes a lot of effort to build in the first place and it takes a lot of effort to extract the data by combining the right tables to reach the desired data. The denormalized approach accepts redundantly stored data because that data will never be changed. The queries are simpler because all of the required data is in a few (often just one) table. Analytic databases are approached differently than transactional databases.
We made a second shopping mistake by not adopting specialized multi-dimensional data warehouse technologies. At the time, these technologies were not as well developed as they were today and did not appear to be up to the challenge. But compared to the fate of the competing project (it being completely scrapped), the option of using these technologies would have been a better choice. The shopping mistake was that we ignored the vendor’s messages that we thought was irrelevant. The vendor’s message was that these technologies are poised for rapid improvement in capabilities and will be far easier and cheaper to operate than home-grown solutions. It turned out that technology improvements exceeded even their expectations.
The reality of STEM practice is that often we find ourselves struggling from the consequences of mistakes. The fact that we could have been better shoppers doesn’t alleviate the fact that we are stuck with what we chose to buy. We need to work with what we have. Often the only alternative option (as in my example) is to scrap the entire project.
The practice of science, math, engineering takes on a different flavor when the technology is determined. As an an analogy of ancient architecture, if we choose to start using mud brick in a humid environment, we either solve the problem of humidity or scrap the structure and starting all over with stone. The problem of humidity required continuous maintenance that would not have been required for stone.
My engineering education is decades old but I think it is probably pretty much the same today. The focus on the course work and training was on avoiding mistakes (curiously without much discussion about shopping). A proper STEM practitioner never makes mistakes. An excellent STEM student gets straight A’s.
Real world STEM projects are usually struggling through consequences of mistakes. Yes a job might have been easier with a commercial product but we have committed investment and staffing to building it ourselves. The commercial product may have benefits from hundreds or thousands of experts in many specialties but we have to muddle through with a handful of staff we expect will pick up all those specialties. In employed situations, the STEM job positions prizes the practitioner who excels at improvising around past mistakes. The ancient landlords would recognize today’s STEM valued employees as their continually employed mud-brick specialists who continually implemented new solutions for different conditions of sagging walls.
Of course, we would like to see STEM as never making a mistake. But the reality is that STEM makes mistakes (again because STEM’ers tend to be lousy shoppers). The education of STEM should focus more on the mistakes. Such an education may involve homework problems with a problem statement and a mistake. Traditionally, a student would be required to find and correct the mistake. But what employers really need is the practitioner who can make the mistake work. Most STEM people are employed making work the earlier mistakes instead of replacing the mistake with a correct solution (as they were traditionally trained).
This reminds me of the mistake we collectively made by adopting personal operation of motor road vehicles (cars) with minimal standards on driver certification or vehicular integrity. I contrast cars with private pilot licenses that require more rigorous testing and certification and private owned aircraft that require annual airworthiness maintenance and inspection. In my mind the reason we don’t have the same rigorous regulation for automobiles is because they derived from an older tradition where we tolerated unskilled drivers operating vehicles that were nearly inoperable and unsafe to the driver, passengers, and others on the road. We see these drivers on the roads today: very old vehicles with rusted mufflers dragging on pavement, and with insecurely fastened cargo. We tolerate this because we made a mistake and it is too late to fix the mistake by imposing stricter regulations on operators or their vehicles.
Instead we invest our efforts in making the mistake work. One of the make-the-mistake-work approaches is to demand safer standards for newly manufactured vehicles, replacement parts, or licensed mechanics to fix old vehicles. Eventually the old unsafe vehicles will disappear, although they stubbornly persist longer than we imagine. Another make-the-mistake-work approach is to demand automobile designs that are easier to operate by minimally trained drivers, and to make the vehicles less likely to injure driver, occupants, or bystanders if the operator encounters an untrained scenario. For example, anti-lock brakes in part answers the problem of lack of training to pump or stutter the brake to get effective stopping distance and retain ability to steer. We solve he mistake of allowing poorly trained drivers by introducing technology to make the mistake work.
In many projects there are many cases where the team finds itself blocked by previous mistakes to anticipate this point. The phrasing “painting oneself into a corner” comes to mind. At this point, we can observe where we made a mistake earlier, but there is no option to go back to fix that mistake. Instead we have to find some solution to make the mistake work. We have to find a way out of the corner.
Perhaps the largest IT mistake of the modern era is what we call the Internet, and in particular the various application protocols that were codified as RFC (request for comment) specifications. The initial success of the Internet was based on its approach of keeping things simple so everyone can understand it.. But that approach caused huge problems later on when people began to exploit the simplicity for malicious purposes.
For example, we continue to use original the simple mail protocols (SMTP) despite their exposed weaknesses for spoofing and spam generation. Despite its problems, e-mail continues to be very useful because of all of the additional innovations we added to make the original mistake work. An simpler solution might have been to ban the old protocols and force everyone to use a more robust implementation with tight security for authentication and non-repudiation and so on. We instead found add on fixes to allow the original mistake work.
There is a definite art to making a mistake work. The practice of making mistakes work is something that has a high demand in many projects.
It is a skill that can become part of the STEM training, redefining the M to mean Mistakes, or Making Mistake Work.