The elastic database concepts of sharding remind me of what I was doing in early 2000s for my data project. In that project, I used SQL Server as a RDBMS for a single project involving a lot of data. Instead of a single integrated schema that would contain all of the data, I created separate databases for each day of data and then used a master database and tables to direct scripts to the appropriate database for particular queries. Each of these databases had consistent database structures so that the queries would work in each database, but combining results over multiple days required scheduling separate queries in each database and then aggregating the results either in memory or in a temporary table in a master database. This design looks a lot like the modern practice of sharding, but the motivation for the design was not for performance. In my design, all of the databases reside within the same RDBMS on a single but large server.
The Azure SQL platform operates similarly with a logical database management presentation over a collection of databases sharing the same security and management policies. The advantage of the Azure platform is that the databases within the collection do not need to reside on the same underlying hardware. The individual databases can be on separate machines. Sharding the database is the practice of copying the schema onto multiple separate databases and then using a shard map manager to route the queries to the appropriate database to perform a query. In an ideal design, the shard database will need a complete schema so that it can run its query without relying on information from its peers. I can see this being a challenge in large transactional databases involving independently large tables (such as customers, sales, and employees) that will be updating at the same time. The shard assigned for managing sales transactions may need employee information managed on a shard for that employee. This probably involves some kind of job to cache recent but independent data managed in other shards.
The motivation for sharding is for performance, allowing a database to scale horizontally by adding more compute platforms to the data project instead of scaling up by acquiring larger servers.
My experience taught me other benefits of sharding even when performance is not a driving concern. In any database scenario, most of the data needed for read (using select statement) queries is data that has not changed in a long time. This is especially true for reporting. Even when reports include the most recent data, most of the information in the report will come from older data. Usually the older data is the primary goal of the reporting, but at the very least the older data provides context to interpret the most recent data.
It is not unreasonable for such reports to combine two queries, one from a system with the most immediate information and another from a system holding older data. Such a scheme will avoid imposing reporting burdens on the operational system that should be devoting its efforts on insert, update, and delete operations. This scheme is easy to implement in the cloud databases. For example, in the Azure SQL service, a daily operation using the “create database as copy of” command would present an application with two databases to work with. The primary database can be used to obtain the current day’s data and the copied database can be used to complete the reporting results from all of the previous days. Many reports may not need the current day’s data at all. For example the top level report can report on data in the copied database but provide option to include the more recent data when specifically requested.
There are other advantages of sharding history this way. Each day’s copy can be retained online. A multi-shard query can query each day separately on separate databases. In the cloud scenario, these databases may run on different resources and thus provide some performance boost. There is a greater benefit though in that each copy has a preservation of the related tables to capture the full context of information available during that day. A query on the copied database can return complete results using data only within that one database. As a result, the query will see the related information that was present during the same day.
For the project of knowledge or hypothesis discovery, this sharding of history is more valuable than attempting a historical report using the operational database. The sharded history retains the context of the data. For a business example, assume a report for the previous period involved some action by an employee who has since been promoted to a different position. Using the operational database for this historical information will naturally return the erroneous result that the new position was responsible for the prior action when in fact that action was done in capacity of the older position. Certainly it is possible that the database design has duty-transfer time-stamps on employee records or the report can have cleanup steps to resolve this kind of ambiguity. Unless those extra efforts go into the query (and probably slowing reporting to handle this logic), the natural report from the operational database will be to report historic information referenced to the most recent data in the database. A historical report will be as if it happened in today’s environment. I discussed this problem earlier as the veracity problem of matching a variety of non-contemporary data.
In my case, I had copies of databases for each day over a year and a half. I can’t think of any such period where we didn’t experience some major change in the environment that would render meaningless reports if historic data were relationally matched with more recent data. In order to make sense of what happened a year ago, we needed a copy of all of the relational tables that existed at the time. My sharded-history design naturally preserved this context. Each database was self-contained to support any query relevant for that day and any report was guaranteed to have contemporaneously consistent data.
I suspect many big data analytic projects are vulnerable to errors from non-contemporary data residing in a common data lake (such as a hadoop file system). The projects will succeed in matching different varieties of data to prepare the report by matching data that belong to different time epochs. A common demonstration of analytics involved real-time pulling of most recent supporting data to match historical measures. The report inherently assumes that the current supporting information was equally relevant at that time, or more subtly it assume the information available today was available at the time. Unless specific efforts (investment in human labor) are employed to check that the data is contemporaneously consistent, the resulting report will create a fictional world where the historic measurements occurred in an environment of a different time. Usually the fiction is the historic event occurred in the present tense world.
Often big data projects do not invest in this kind of scrutiny of data. Certainly the promotional demonstrations emphasize the ease of connecting to 3rd party data sources to retrieve current information to match with historic data. Even when the third party data sources offer time-stamps for historic records (for example annual census reports), there remains an ambiguity of how the dates are measured. For example, the date stamp for a particular year may represent the end of the year summary when the historic data considers the year to be the current year so that the report will match data from different (but adjacent) years.
The concern I have is that typically these types of query reports are deep within a larger project of predictive analytics. The query report data feeds some type of machine learning or statistical algorithm to treat the data as if it were an accurate representation of what actually happened at that time. At the strategic planning decision-making level, the bulk conclusion may be reasonable enough to base a conclusion especially after a final editor removes any obvious outlier information. The trend though is for big data analytics to inform more tactical decision-making involving items that can be at the outlier level. The high-velocity opportunity is to take advantage of the outliers identified by the end-to-end analytics. The outlier may be a fiction created by matching non-contemporary information.
In an agile decision-making environment where decisions occur at high frequency, this historical fiction may not be a big problem as long on average the decisions result in more benefits than disappointments. Although many projects are adopting big-data predictive analytic capabilities, they are applying them to less agile and less frequent decision making. For example, there is a big interest in exploiting big data analytics to construct individualized health care policies or treatments. The promise comes from matching the big data of historic cases with related information to be relevant to a particular patient. These scenarios will often result in one-time decisions with not opportunity to average costs and benefits over multiple decisions. The risk is matching the patient to a fictional creation of joining historic cases with non-contemporary matching information.
For example, there is much interest in using voluntary data collection of individual’s daily activities with activity or exercise trackers. An example may be a GPS-enabled tracker that monitors blood pressure and heart rate during a jog. When combined over a large population, this information may be matched with later health records such as reported heart attacks or diagnoses of diabetes. This information can be useful for preventive advice or treatments for new individuals. Because this is preventive advice, these individuals have not yet experienced the bad outcomes recorded in the data so there needs to be a different way to match the records to specific individual. That matching may involve prior health visits or physicals for the historic case that later developed a condition. Almost certainly that supporting data will have been collected far earlier than the activity tracking recording. The matching data may be annual physicals that may not have captured some hidden condition that emerged later. The result is that the treatment is matched to a fictional construction of non contemporaneous data.
There is an inherent bias in obtaining more accurate information for patients who later develop medical conditions requiring attention. In addition to the patient undergoing more extensive testing, there will be more careful collection of patient history information. This more in-depth patient history information will not be sought for the individuals who do not develop the condition. A new participant in a preventive medicine program will be more like the individuals who do not develop medical conditions than the ones who do. In contrast, the bias in the data will tend to recommend treating the new participant more like the historical examples who did experience medical conditions. This bias is the fact that there is more later discovered information about the sick examples than from the healthy ones.
For broader healthcare policy-making the data-driven approach may provide benefits when averaged over the larger population as long as the individualized recommendations are helpful more often than harmful. The question is how much we are willing to tolerate errors at the individual patient level, especially when those errors are the result of incorrect matching of diverse data sources. Similar challenges occur in all areas of public policy. Even when the aggregate performance benefits society as a whole, there will be individual instances where the policies will fail due to over-confidence in data by overlooking relative time-stamp information.
A historical sharding approach may provide a better training set for predictive analytics or machine learning. This approach preserves the snapshot of all information available at the same time. The historical shard is self-contained to resolve all of the references known at the time, including errors and omissions. Training the algorithm to work with this data is more realistic to present data cases because it takes into account what is known at the time of the observation. The alternative of admitting later information can result in an incorrect inference that future data will be available at the time decisions will need to be made.
Sharding historical data results in a lot of redundancy. In order to allow each shard to be self-contained to resolve any references within the shard, the same information of slowly changing features will need to be replicated in many shards. With modern data technologies, this is affordable. Storage is inexpensive, and queries within a shard require less resources than querying the data that spans all of the shards. Because the shards retain the information known within the time frame of the shards’ core or unique observations, the shard approach becomes a more faithful record of history.
The sharded history approach can still support analysis where later discovered data can enrich the historical data. The enrichment will occur in new shards that can combine the historic shard information with the newer data. This is how I approached my project. In my project, the day-old shards became read-only. Although the queries within the shard could only have access to the information within the shard, I created other queries in memory or in temporary tables to match the historical data with enriching data from outside of that shard. The reports of this enriched data may be more valuable than the reports constrained to use only the information within a shard. However, the retention of the read-only shards remains useful for validation to show what information was known at that time. For example, later data may suggest that something could have been caught earlier. Having the sharded history capture the contemporary information of the shard allows us to understand why the hindsight discovery was not possible at the time of the shard. This is useful information to lessen the confidence of predictive analytics. The shard preserves the uncertainty resulting when available supporting data may have been out of date.