The advantage of data on read strategy is that it separates the processes of data collection from the processes of applying a schema in order to interpret the results. We can learn more easily that our prior knowledge was wrong when we get prior knowledge out of the data store.
For the project of knowledge or hypothesis discovery, this sharding of history is more valuable than attempting a historical report using the operational database. The sharded history retains the context of the data. For a business example, assume a report for the previous period involved some action by an employee who has since been promoted to a different position. Using the operational database for this historical information will naturally return the erroneous result that the new position was responsible for the prior action when in fact that action was done in capacity of the older position.
In an earlier post, I presented some interactive reporting based on custom categorization and aggregation of data available from Capital Bikeshare. Those reports used Excel pivot tools and SQL Server Reporting services using both relational T-SQL and an Analysis Services cube I constructed to make the desired navigation and aggregation easier to report. My eventual…
With modern speed of data retrieval, analysis, and visualization, we may be encountering a new form of logical fallacy of appealing to authority where the authority comes from the speed at which we can present affirming data for our theses. Assuming that human behavior is a product of evolution, there has not been enough time for evolution to adapt to the new reality of nearly instant affirmation of some consequent. Historically, we recognized a pattern that we can trust affirming data if it arrives quickly. Before modern data technologies, the speed of finding affirming data was an indication that affirming data is abundant around us so it didn’t take long to find. That particular mode of thinking is no longer valid with modern data technologies. The instant access to a wide variety of data makes it possible to find affirming data very quickly. It will take a few generations for evolution to catch up to teach us to not trust speed of affirmation as proof of some hypothesis.
I’m describing this as the security of the datum instead of the data. Specific observations are vulnerable to exploitation instead of everything observed by sensors. The malware is in the population being observed instead of in the IT systems.
To combat this kind of problem, we are going to need an additional approach of datum governance to protect the observed population from deliberately inserted biases.
The enthusiasm for the benefits of big data comes from widely promoted reports of past successes. The promise of big data techniques is that it can provide similar successes in other contexts. Big data involves volume, velocity, and variety. The volume and velocity depend on automated queries and report building. The variety introduces the opportunity for new benefits. The combination of automation and opportunity from variety is what makes re-identification possible or even very likely.
Having model data explicitly materialized into tables gives the data clerk to recognize the deficiency that this data is not observed data. This provides the data clerk the opportunity to ask whether there can be another source for this data. Perhaps, for example, some new sensor technology became available that provides observations that previously required models to estimate. The analyst can then revise the analysis to use that new data instead of the model-generated data.
In addition to the classic challenge of new data potentially disproving an old theory, the modern reality of practical data technologies makes possible decision making based on data alone without any need for human cognitive theory to justify the decisions.
Sharing this model-generated data is not the same as sharing the models themselves. The source code for the models still can be hidden from the production system. The population will only have access to the the generated data captured in persistent tables instead of in temporary memory. The population can compare the model generated data with their own calculations to show that they can reproduce the results. Reproducing these intermediate model-generated results will provide confidence that the models are correct. Alternatively, the population can demand reconciling any discrepancies they find.
Data should meet tests against fallacies that apply to data like errors in grammar, logic, or reasoning are fallacies in arguments. The above example of a medical health record of a birth with same-sex parents and the mother identifying as a male is analogous to a grammatical error even though the data itself meets the business rules for the form. We should be able to object to this data as valid to use for some purposes such as determining eligibility medical necessity for health services just like we would reject a grammatically incorrect sentence in an formal argument.