Historical data shards divorced from the methods, post object-oriented strategy

Object oriented design combine into a single unit data and methods that can operate on that data.  These objects permit design patterns of inheritance and polymorphism.    Most introductions to the object-oriented concept involve analogy to every-day objects that have properties and well-defined procedures for using those properties.   Object oriented approaches allow humans to visualize the problem in terms that are compatible with everyday experiences.   It has been very successful in bring computing within grasp of a larger population than was possible in the era before object-oriented design became popular.   This is especially true where humans must interact with the computer: user interfaces and software development processes themselves.

While acknowledging the benefits of object-oriented design, I have always had a reactionary attitude toward the concept when the data involves observations of the real world.    Object-oriented design is well-suited for design and use patterns of human-made artifacts like computing where the concepts are abstract.   It is not as well suited for the observations of the natural world (everything outside of human control).

In my last post, I reflected on the unintended consequences of a pragmatic design choice I made when working with large amounts of data.  I ended up breaking up historical data into database shards where each day of data ended up in a distinct database that itself contained all of the supporting information available on that day.   This strategy is very similar to the scale-out strategy for increasing database performance by dividing the database into shards.  At the time, I described the strategy in terms of publication.   I imagined the daily databases to be analogous to daily newspapers (the kind distributed in newsprint).   Once the paper is distributed, the paper can no longer be edited.  This is beneficial because it captures the best information available at the time of publication. Whether future information confirms or contradicts the published paper, the published paper offers a learning opportunity in terms of understanding the thinking processes of the journalists and editors.   What data was available to them and what constrained their thinking.

The publication analogy works in other works including peer-reviewed journals and academic press books.   As some point, the document achieves a publication status that freezes its content forever in the future.   The publication captures the data and thinking at the time leading up to the publication.

This publication model has served mankind for centuries and even in today’s big-data environment, ancient texts still offer lessons worthy of study.   The value in the published document goes beyond the relevance, accuracy, or wisdom of the content within the text.   The value includes the capture of the context of the writing of the document: what information was available at the time and how did the thinker assemble that information.

In my mind, it is merely coincidental that shard strategies for big-data performance goals resemble the publication model.  In order to achieve high performance, each shard needs all of the relevant data that matches the core measurement data.   In context of relational diagrams, the core data needs a local copy of all of the relational references.  On the other hand, shard strategies are unlike publication metaphor in that shards permit incorporation of more recent related data.  In big data practice, the shard has access to the most recent version of trusted data.  A query of some historic event will return different results later in time because the related information about that historic event will continue to change: improving with increased validations, or degeneration due to neglect or due to data loss.

In big data shards, there is no guarantee that subsequent readings of the historic record will return the same material content.  This distinguishes shards from traditional publication where the original editions reveal the same material content repeatedly over time.

The big-data NoSQL concept that enables the shard strategy includes the concept of imposing a schema on the data at the time of reading the data.   This schema-on-read concept contrasts with traditional relational database concepts of imposing the schema on the write.  The implication of the two approach is both significant and subtle.

When we impose a schema at the time of the write, we have confidence of a query returning the same results no matter what time we query the data.   The relational database ACID properties are motivated by this goal of consistent reads.  Another description of this assured consistency is the concept of a single version of truth: the query will return the same result at all times and to all readers.   As an aside, my first introduction to the ACID concept involved an analogy with paper publication’s emphasis on acid-free paper for long-term preservation.   This introduction probably seared into my mind the objective equivalence of ACID tests of databases and of paper.

In contrast, the big data concept of imposing a schema at the time of the read presents the inevitability of returning different results for each read.   Data scientists take advantage of this acceptance of different read results to optimize performance.   When the reader accepts that different reads can result in different versions of truth, the data scientist can optimize performance goals by deliberately pushing the reader’s tolerance for inconsistency.   On the positive side, the data scientist is free to alter history by including newer information so that repetitions of the same query will return different result that happen to be improvements upon older results.   The prospect of improved data rewards the readers for their acceptance of inconsistency from imposing schema-on-read content on replicated queries.

Another aside concerns recent controversies within the legacy publication industry (ranging from journalism to academic journals) involving the publication of poorly researched or irreproducible findings.  One of these controversies concerns the overuse and abuse of the statistical significance test of the p-value.  It seems to me that these controversies share a common element of an attempt to impose a schema onto data at the time of read instead of write.

Instead of reporting on carefully researched data (imposing schema-on-write), the reporting involves story-telling (imposing schema-on-read).   For example, the p-value test is a form of story-telling.  The schema-on-read for p-values is to impose statistical model in order to interpret available data.  The construction of the p-value occurs nearer to the time of publication than the time of data collection.  As I mentioned earlier, the problem with schema-on-read strategies is that there is no guarantee that any two reads will come up with the same result.  This is the problem with p-values where different studies of the same topic can come up with contradictory conclusions.  We expect inconsistency when we deliberately impose schema-on-read.  We accept this inconsistency on the expectation that the most recent read will likely be the most reliable because the underlying data will have a chance to improve.  In big-data, we make this choice explicit by calling it a NoSQL approach.   My point here is that this NoSQL concept may be a specific example of a broader trend in academics toward postponing rigor to make it closer to the the time of publication than the time of data collection.   NoSQL in data is part of a far broader acceptance of inconsistency with the expectation that the most recent read of the data will be the most authoritative.   What matters in publication is not the rigor of the research but the recency of the publication. NoSQL is just one example of this modern attitude.

NoSQL or schema-on-read approaches come with a big risk that comes from the combination of expecting inconsistency of read results and that the most recent read will be the most authoritative.  The risk is that the data, and in particular the enrichment data, can become tainted or faked.   The conditioning to accept schema-on-read inconsistency and authority of recency will prepare us to overlook the creeping and potentially manipulated problems with the data.

There remains an advantage to the postponed schema on the data when we focus our attention on the schema used to read the data.   The disadvantage of schema-on-write is that the attention on the schema occurs only during data ingest.  Schema-on-write encourages an attitude to ignore the implications of the schema after its has done its job of ingesting the data.  Over time, our inattention to the schema for ingest will cause us to forget some of the lessons learned while designing the schema.   This amnesia can lead us to make other errors in judgement, and in particular to commit a fallacy of authority of tradition.   The tradition may no longer be applicable.

The advantage of data on read strategy is that it separates the processes of data collection from the processes of applying a schema in order to interpret the results.   Imposing a schema on the data collection process inevitably imposes a biased world-view onto fresh observations of the real world.   Consequently, schema-on-write contaminates historical data with the world views of the data scientists, or more likely computer scientists, or even unaware coders.

One way to compare schema-on-write (traditional RDBMS database approach) with schema-on-read (modern big data strategy) is to consider the cases ideal competence is available to both.   In the ideal, the schema-on-write approach demands prior design and validation of concepts before data ingest.   The RDBMS forces the data into the schema or activates some remedial action when the data does not fit.   This assures the single version of truth, but this truth is very consistent with the intentions and biases of the database designer.   In contrast, the schema-on-read exploits new data and new concepts that results in the multiple-versions of truth that are characteristic of schema-on-read.   The ideal schema-on-read expert will scrutinize the evidence (both data and algorithms) to assure that the most recent query is more reliable and any prior query despite the inconsistencies.   Assuming that neither expert makes an error in their respective practices, the schema-on-read approach is more likely to learn new things about the real world because the data behind the schema-on-read are unaltered observations of reality.

Realistically, we don’t have ideal practitioners in either approach.  Also, in real data projects there are many data practitioners involved, each with varied and inconsistent skills.   Although we associate schema-on-read strategy to modern big data approaches, humans have always had access to this option.  Indeed, schema-on-read seems to be the default strategy in nature: when confronting a new problem we reason through our experiences in light of the immediate observations and interpretations.   Part of the reason for humanity’s success in the natural world is the ability to regulate the new observations in light of past experience.  The human reaction to new observations is to first force the observation into compliance with world views.  (I suspect all animals do this, but human access to literature gives them a deeper foundation for world views.)

Also, the schema-on-read approach is instinct or “common sense” that unskilled or unlearned people use to interpret new information.   The schema-on-read is the default strategy.  The schema-on-write approach requires acceptance of authority of more learned others to select the most trust-worthy observations.  We designed traditional databases around the notions of schema-on-write that existed long before databases existed.   Through centuries of experience, we learned that schema-on-write is more reliable and more effective in engaging with the real world.   Humans are naturally more competent in schema-on-write than schema-on-read.   Alternatively, the human advantage is its skill in building schema-on-write systems.   Our success is related to our ability to build a single version of truth by being selective in terms of what data to accept.

Schema-on-read approaches are more difficult for humans to manage.   Alternatively, competence in schema on read skills is far less common than competence on schema on write.  We see this today in the phenomena of shortage of data scientists to work on big data.  The true shortage is for people who can competently manage the inconsistencies inherent in the schema-on-read approaches necessary to enable large data processing.   Schema-on-read requires more standard-deviations above average skills because the queried data lacks conformity to a single version of truth that comes from schema-on-write of traditional databases.

Back to the topic of object-oriented design approaches, the object model is another form of a schema-on-write.  Object-oriented approaches typically impose a requirement to assign observations to objects with some level of constraints on its properties.   Also, these objects immediately impose a set of methods that can operate on the object’s properties that captured the observations.   Frequently, the object design makes the observations (the properties) private so that the only access to the data is through the methods.    This restriction and then hiding of observations is the same as schema-on-write.   The goal is also the same: to present a single version of truth for consistent results for reads at any time.  The object’s methods and data will provide the same behavior in any number of queries about that object.

Another line of evidence of the correspondence of object-oriented design and schema-on-write is the tight association of objects with relational databases through the object-relational-mappers (ORM).  The relational databases offer a path for objects to persist their properties and this path offers the ACID assurances that the restored properties will be identical to the ones saved.  Object-oriented software are closely related to relational databases.  They have a common ancestor of the human strategy of schema on write.

Moving toward schema-on-read approaches necessitates not only relaxing the ingest of data into object properties but also the divorcing of data from the methods.  While object methods can evolve (and hopefully improve) over time, this evolution is tightly regulated to the concept of versions where we still expect the same version of the object to behave consistently for each new read.

The schema-on-read approach has a much narrower concept of a version.  A pure schema-on-read approach effectively involves a new version each time the data is read.   No two reads will give the same result because imposing the schema at the time of read allows for each read to see different data.   We expect the most recent read will be the most authoritative because more time has elapsed to improve data or to admit slow-to-arrive data.   As a result, each read is a different version, or a different release of the query.

Object-oriented design approaches do not scale to this level of version control where every invocation of the object is treated as a new release.   I think the recent popularity of agile programming and two-week sprint cycles is an attempt to get objects closer to this ideal for schema-on-read.   Two weeks is an eternity compared with reads of the data.   Typically there will be countless read operations between two-week releases.  Even in the agile model for object-oriented releases, the reads within a release cycle will be made consistent.

The ideal for schema-on-read is to allow each successive read operation to be inconsistent with its predecessors if improved information has arrived during the interim.  Every read operation is a separate version of the system.  There is no single version of truth.  The most recent observation is our most recent guess at the single version of truth, but this has a problem where large-scale decisions will inevitably involve multiple queries that have different most-recently-read dates.    Refreshing any single component query of a large-scale recommendation will result in a different recommendation, or at least a change in the confidence of the same recommendation.

Pure schema-on-read approaches that permit every read to be inconsistent with its predecessors requires alternatives to the object-oriented strategy that freezes data with methods during release (or production) cycles and thus imposes a schema-on-write constraint on the data.

One alternative approach is the emerging concept of notebooks such as Zeppelin and Jupyter.  The primary goal for notebooks is to provide an effective means to document ad hoc queries and reports by analysts, particularly those working with large data sets.  The notebooks record the query (in SQL or some map-reduce or pig code) along with the results in the form of tables or visualizations.   The notebooks are documents that stand apart from the data and production-release software.  These notebooks provide a durable and transferable record of ad-hoc schema-on-read query activities.

One of the goals of the notebooks is to provide a means to replicate steps in prior analysis, but with an expectation that the repeated steps will return different results.  Because the notebooks are self-contained documents of the analysis steps taken and the results obtained at the time, the documents can be checked into repositories.   Exploiting repositories (such as github) to check in prior work will supply the desired goal of making each read operation a separate version.  When an analyst refreshes the notebook data, he can check out the notebook from the repository, refresh the data, and commit the changes as a new node in the repository.    In this way, notebooks are well suited for schema-on-read analysis.  It is not surprising that they are becoming popular.

The ultimate goal for data analysis is to learn from the data something new about the real world.  This new knowledge involves new ways of thinking about data.  To learn, we need to liberate our thinking to respond to the cues coming from the data absent any preconceived model of nature.   Obviously, competent thinking about data will include the understanding of nature, but the opportunity to learn from the data comes from allowing the analysts to confront data that challenges his world view.   The schema-on-read concept presents the data as it actually was recorded, with minimal if any human influence to get the data to conform to a prior world view.

Too often, our present fascination with data misses the true goal of discovering new knowledge.  Knowledge resides in the methods, not the data.   We are aware of this in the historical sciences where new interpretations of old data regularly gives us new insights even though the data hasn’t changed.   The same happens in data stores.   The object-oriented practices of attaching pre-defined methods to data and the database ETL schema on write practices imposed prior knowledge on the data.  In both cases, prior knowledge regulates the access to the data of real-world observations.   While there is merit to defending these practices for their ability to deliver a single version of truth, that truth is inevitably old knowledge refreshed with newer data.

We strive to obtain more data because we wish to find new knowledge.  Implicit in this pursuit is our dissatisfaction of our current knowledge.   Also implicit is the expectation that future reads of the data will contradict earlier reads (as in the metaphor of the historian finding a new interpretation of ancient record).  If we truly want to replace current knowledge, then we should not allow that prior knowledge restrict our access to the observed data.   We should instead embrace the object-free and schema free approaches to data stores and postpone the schema-application until each time we retrieve the data.

We can learn more easily that our prior knowledge was wrong when we get prior knowledge out of the data store.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s