Exposing model generated information for public scrutiny

In my imagined dedomenocracy, all decisions will come from algorithms working on all available data. This is a futuristic vision because it will involved access to far more data than we have now. However, the algorithms are not futuristic. The algorithms are relatively standard application of statistical tests to identify clusters and establish trends that can be extrapolated into the near future. For this form of rule making, the trends do not need to make sense to be useful. The rules have short lives and will be successful if the resulting benefits outweigh the harms, as will usually be the case with trends. In other words, the algorithms have no intelligence beyond the implementation of statistical concepts. This places all of the intelligence in the data. In my opinion, this is where it belong. Intelligence is data, not algorithms.

I describe dedomenocracy as an authoritarian type of government that requires strict enforcement of rules from these non-intelligent algorithms. Because the rules have short lives and change frequently, there is a practical limit on how many rules can be enforced at any time. As a result, most of time most people will not be governed by any rules. Populations will attract rules when they offer some measurable opportunity. Often this opportunity will be to restore peace that is has being disturbed within the group. The rules will often be punitive in nature to restore public order. However, rules may also impose an obligation on a group in situations where there is an especially high potential for benefits or hazards even when the group itself is well behaved. In either case, the rules will be brief so that the population will soon resume its liberties. I anticipate a dedomenocracy to be mostly libertarian in nature, perhaps the most practical form of libertarian government with an authoritarian oversight.

Similar, I imagine a democratic participation in government. Although dedomenocracy necessarily excludes humans from the process of making decisions (as what happens in modern democracy), it welcomes public participation in the scrutiny of its data. The ideal future citizen will become a data scientist rather than a political scientist. The data science will focus on the content of the data: dedomenology instead of computer science. The population will be able to observe the data, to challenge the data’s validity and relevance, and to seek out new data sources.

In most of my discussions about dedomenocracy, I emphasize the superiority of what I call bright data. This is data that is well governed to accurately match the specifications of its documentation. Bright data are the best observations of what is actually occurring in the natural world. Bright data is part of a taxonomy of data that includes less-bright (dim data) and model generated data (I call dark data). In many of my posts, I have been very critical of dark data because it can distract us from actual observations from the real world.

I much prefer to have rules based strictly on observations with no influence of human cognitive theories about the world. With sufficient data including historical data, the statistical trends of the data will have predictive power that is competitive with human theories. Even with the potential of acting on spurious information, the purely non-intelligent statistical treatment of observation data will likely outperform human theories when it comes to making decisions for future actions.

Because my concept of dedomenocracy is futuristic, I can envision a pure form of dedomenocracy that operates exclusively on observations with no interference by human theories (other than statistics). It is that pure form that I anticipate will provide superior governing of the population compared to human forms of government.

However, I concede that no population will suddenly adopt a purely data driven decision making scheme where the only permissible data is observations. Even if sufficient data technologies for sensing and storage existed today, we would demand human theories to be included in the algorithms. For example, we will demand a distinction between correlation and causation. We will also demand that known laws of nature are included in any decision. We will reject anything appears spurious or nonsensical. For an initial version of governing by data, we will demand that the conclusions conform to human theories.

It is inescapable that early experiments with dedomenocracy (such as what is occurring in medicine) will include abundant dark data. Algorithms used to generate data will include code that captures certain theories of causation or of various human cognitive laws. Experts will review the algorithms to confirm that these theories are present in the code itself. The algorithms will include some form of embedded intelligence in the sense of encapsulating subject-matter expertise.

In many modern implementations, the dark data is computed through algorithms that will operate on the data. The algorithms will be encapsulated in code and the source code will not be available in the production system. Also, the dark-data algorithms will produce temporary data that also will not be accessible to users. This is another reason to call the data dark: it is not available to the end users. Because dark data comes from accepted human knowledge about the world, there is no need to expose this data to the user. In particular, the added cost for making this data available is considered to be unnecessary.

In a dedomenocracy, the final decisions ideally come from simple statistical results working on data and all of that data should be available to the population.

The models could become just another data source even though it is deriving data internally instead of obtaining observations from nature. The intermediate results from models and calculations should be captured in persistent tables that users will be able to scrutinize. This data will identify its source as coming from these models.

Sharing this model-generated data is not the same as sharing the models themselves. The source code for the models still can be hidden from the production system. The population will only have access to the the generated data captured in persistent tables instead of in temporary memory. The population can compare the model generated data with their own calculations to show that they can reproduce the results. Reproducing these intermediate model-generated results will provide confidence that the models are correct. Alternatively, the population can demand reconciling any discrepancies they find.

Even when the model generated data is reproduced, the data itself will be present alongside the observation data. The model-generated data is simply data that comes from a different source, a source that happens to be internal to the data systems instead of an external sensor.

Part of the population’s scrutiny of the data sources is to identify better sources of data. Having the model generated data accessible like observational data invites the population to find alternative natural-world observations that can supplement or even replace the models.

Stored in persistent tables, the model-generated data is exposed to competition from other sources. Again, I imagine early forms of dedomenocracy to have abundant model-generated data sources. However once in place, the system will evolve to gradually accept more observational data sources that will eventually replace the models. Within the data, there will be an evolutionary survival-of-the-fittest that will take place where the endangered species is the human-theorized models.

When sufficient data sources observing the real world, the statistical algorithms will eventually reconstruct the predictive power of human theories and even out perform those theories. The results will be better predictions perhaps even without a human comprehensible theory to explain the predictions. There is no need for human stories because the statistical algorithm can derive good predictions based only on the data.

I will continue this discussion in a future post. The motivation of this thinking is my discovery of unexpected value of retaining temporary tables that I created for use in making complex queries. Initially, I created the tables in order to break the problem down into more manageable parts or to obtain some better performance. But when I used permanent tables to hold the results, I found that I can use those results for new types of analysis. In particular, access to these temporary tables of intermediate tables provided an opportunity for verifying the results. In contrast to a view that may give different results when the underlying data changes, the intermediate tables recorded the results of the query that went into the final analysis. That provided the opportunity to reconstruct what went into a higher level summary result. It also provided an opportunity to build alternative summary results off of the exact same intermediate result so that those summaries are mutually compatible. In addition, the intermediary tables provided unique and unexpected information of its own and that resulted in new and valuable information for the analysts.

My point is that the intermediate tables (like model-generated tabled) initially seems unnecessary because their content can always be reconstructed from queries or model calculations. However, making these results persistent provides a snapshot of the results available at the time of the analysis and a valuable resource to explore for its own sake. This lesson is what motivates this topic about materializing modeling results into tables to be used in queries instead taking query results and processing those results using code that necessarily is hidden from production. Materializing this data into tables exposes this valuable information to the user so that he can challenge the modeling assumptions even though he can not see the code for the model.

In a future post, I will try to clarify this idea of materializing models into persistent tables. (Update 2/19/2015) I also discussed this concepts in earlier posts where I describe data supply chains to incrementally make data more intelligent. The supply chain adds value to source data and this value is exposed to the successive steps. That value would include modeling results that applied to specific set of data being processed. I intend to elaborate more about the need to expose the intermediate products (model computations) for scrutiny in context of democratic participation in a dedomenocracy.

This post provides an example of hiding model data. In this case, the observation data comes from temperature sensors with with variety of situations that can lead to error. The raw actual measurements are available but there are known error contributors to each one due to issues such as measurement timing, local terrain, calibration frequency, etc. The published data for analysis is corrected for known errors in advance so that the raw data for analysis incorporates the model data of the correction offsets.

A better approach would be to publish two tables: one of the actual as-measured data and another for the corresponding offsets to apply to each individual measurement. The instruction to the analysts will be that the the official measurement should be the sum (or multiplication) of the two numbers.

This will leave exposed the original unaltered measurement data that came from the sensor. There may be important statistical analysis of this unaltered data that may discover new insights into nature unrelated to the goals of the correction algorithms (goal is to test hypothesis of global warming). By correcting the data to make global warming calculations easier, they may be erasing clues that may lead to unrelated discoveries.

More importantly, exposing the correction data separately subjects this correction data to analysis of its own to observe potential bias in the correction that may tell us more about the model than about nature. The above post derives a portion of this correction by comparing the same data set from two different publications (downloads). This required extra luck on his part to have access to both publications and extra work to derive the difference to observe a suspicious pattern that should be investigated. Many analysts may not have access to the data to derive this shift in measurements and others may miss the opportunity to explore this possibility because the offset data was not provided separately.

There is justification for correcting data when there is confidence about the value of the error. The offset should be provided separately for analysts to combine to get the corrected results instead of preemptively correcting the observed data that is published.

Having a separate table of corrections is an example of exposing model-generated data.

2 thoughts on “Exposing model generated information for public scrutiny”

kenneumeister says:

2015/03/10 at 13:58

This post provides an example of hiding model data. In this case, the observation data comes from temperature sensors with with variety of situations that can lead to error. The raw actual measurements are available but there are known error contributors to each one due to issues such as measurement timing, local terrain, calibration frequency, etc. The published data for analysis is corrected for known errors in advance so that the raw data for analysis incorporates the model data of the correction offsets.

A better approach would be to publish two tables: one of the actual as-measured data and another for the corresponding offsets to apply to each individual measurement. The instruction to the analysts will be that the the official measurement should be the sum (or multiplication) of the two numbers.

This will leave exposed the original unaltered measurement data that came from the sensor. There may be important statistical analysis of this unaltered data that may discover new insights into nature unrelated to the goals of the correction algorithms (goal is to test hypothesis of global warming). By correcting the data to make global warming calculations easier, they may be erasing clues that may lead to unrelated discoveries.

More importantly, exposing the correction data separately subjects this correction data to analysis of its own to observe potential bias in the correction that may tell us more about the model than about nature. The above post derives a portion of this correction by comparing the same data set from two different publications (downloads). This required extra luck on his part to have access to both publications and extra work to derive the difference to observe a suspicious pattern that should be investigated. Many analysts may not have access to the data to derive this shift in measurements and others may miss the opportunity to explore this possibility because the offset data was not provided separately.

There is justification for correcting data when there is confidence about the value of the error. The offset should be provided separately for analysts to combine to get the corrected results instead of preemptively correcting the observed data that is published.

Having a separate table of corrections is an example of exposing model-generated data.

Pingback: Materialize the model to level the competition with observations | kenneumeister

Hypothesis Discovery

Listening to Data

Exposing model generated information for public scrutiny

2 thoughts on “Exposing model generated information for public scrutiny”

Leave a comment Cancel reply

Share this:

Related posts

2 thoughts on “Exposing model generated information for public scrutiny”

Leave a comment Cancel reply