In my earlier posts, I have been describing a role for data science that involves careful scrutiny of various types of data in light of the intended purposes for that data. In this context, I envision a person filling a data science role will be actively involved an all areas where data is used in the system. The data scientist scrutinizes the retrieval and ingest of data, the intermediate handling of the data, and the final version of the data available for analysts. Focusing on the data as it propagates through the data system provides the advantage of reusing knowledge of particular data. This focus implies a staffing that divides up work based on different types of data such as one person scrutinizing a particular kind of data and another person scrutinizing a different kind of data. Both will apply their understanding of the same data throughout the stages of the system to verify that the data meets the expectations for each stage.
In contrast to the above vision, the more common practice is to allocate staff based on the system itself instead of based on the data. As a result we dedicate staff to the input side of the system, different staff for the middle part of the system, and different staff to the analyst-facing part of the system. Each of these positions specialize on the particular software and hardware issues involved and they look at data only in the context of their specific duties.
I disagree with this approach because it implies a concept of perfection of data as it moves through the process. The assumption is that it is sufficient to pay full attention to the hardware or software issues. As long as the hardware and software is working then the data will be fine. Conversely, if there is a problem that problem must be some kind of hardware or software issue to fix. While I agree that either approach finding a problem will need some type of hardware or software fix, the focus on the software instead of the data may mislead us into fixing the wrong problem. This hardware and software focus may result in modifying hardware or software to perpetuate a tolerance for misbehaving data.
This is most clearly apparent in the analyst-facing end where reports may be criticized for poorly communicating some concept. Working strictly with the reports of data residing in the data store, the analyst adapts the report to better present the data. An example may be an occasional data point that stands out as being unrealistic and the report is modified to automatically smooth or reject this data point. The analysis specialist assumes that the data can not be changed and the only solution is to improve the reports (or the queries). With this division of labor, this unchangeable data makes sense because the report must report on data already in the store. At this point it is too late to do anything about that data anyway. The result is that the analyst considers the data available in a query is the best possible representation of that part of history of the real world.
The people who work at the middle (intermediate processing of data already in the data store) and the end (presentation of data to the analysts) primarily focus on software or hardware issues. Their jobs involve dealing with the data that is already available to the system.
In contrast the initial stage of data handling more directly confronts the nature of the data itself. This initial stage involves the receipt of new data, the parsing of the received content into data of interest to the project, the preparation of this extracted data to be compatible with the data store, and finally the input of the data into the appropriate data structures in the data store. We call this the ETL (extraction, transformation, load) process. Most projects seek out staff who specialize in this particular area.
It appears that ETL positions are the the hardest of the three positions to fill even though it seems like it should be easier to fill.
In the title of this post, I am making an analogy to the ETL position as like the fireman (fire stoker) of a steam locomotive or ship. Instead of continuously feeding the fire under boiler to maintain the speed of the locomotive, the ETL must keep feeding the data into the data store to maintain the analysis efforts. Similar to both jobs is that the process can get very challenging and stressful to keep things running at just the right level. Both positions are under-appreciated for how much the system’s success depends on their constant diligence at their jobs performed behind the scenes.
For example, consider the scenario where the locomotive fireman trying to keep the fire at just the right level. The available coal may be burning hotter or colder than normal and he has to adjust the feeding to accommodate. This variation may occur within the same trip so he has to keep checking and adjusting how much to feed the fire. Also, the fireman occasionally needs to retrieve more coal to replenish the stock that is readily available to feed the fire.
Although the work is not a manual, the data warehouse ETL staff is responsible for feeding data to the system in analogy to the fireman stoking a fire. Most of the usual work is automated but there is still an element of the nature of the data not being of the usual quality (like coal that burns too hot) or of data not arriving in time (similar to the fireman running out of local stock). The ETL task is to either revise the automation to handle the new conditions or to manually intervene to recover why the automation fails.
This role is under-appreciated because this is the role that has to confront the bulk of the data science issues involving the quality of the data. As noted above, the later stages make the assumption that their available data is already as good as it will ever get so that they can concentrate on the software for handling that data. This puts a lot of the data quality assurance burden on the ETL staff. In addition to making sure that the data feeds are working properly, the ETL staff has a burden to assure that the information in that data is what is expected.
Just like a locomotive requiring both the train driver and the fireman, the data project requires the ETL and the analysts. Like the train driver or engineer of the locomotive, the analysts are the more visible position for the data projects. The ETL position is less visible like the fireman. In a real sense, the ETL position is doing the dirty work while the analyst is doing the cleaner work.
For large data projects, the ETL processes are very difficult. Large data warehouse projects must retrieve data from a large number of different data sources or even a large number of variations of the same kind of data source. Each of these data sources may require their own custom ETL implementations to adapt to peculiarities of the source in order to prepare consistent data to feed the data warehouse. Data sources can change abruptly and the ETL task is to adapt to those changes in such as way that the analysts will not notice the change.
Another aspect of the grittier work of the ETL staff concerns the diversity of tools that are required. The other stages of staffing usually have the luxury of working within a common platform of data structures and software tools. In contrast, the ETL position often requires a variety of tools that are needed for each particular type of data source. Typically each upstream data source has its own interface requirements requiring its own technologies for retrieving and preparing the data. Also, compared with the later stages, the ETL tools are are more likely to change due to a change in one or more data sources.
The ETL job is very challenging because the duties are constantly changing due to changes outside of the control of the project. The changes in the remote data sources dictates the changes the ETL staff must accommodate. Even if existing sources don’t change, new sources are constantly being added. Often the ETL staff incorporate new technologies to build solutions to accommodate new sources that may ultimately populate existing data structures in the data store so that there is no impact on the later stages at all. New data sources may place a high burden on the ETL staff while the only change the other stages will see is that their queries are returning more results.
From my own experience, it is easy to visualize this work as getting dirtier in the details than the later stages of the data warehouse where most of the working environment is stable and under strict control of the project. At the end of the day, it is easy to relate to the dirt-covered and exhausted fireman of the steam engine.