I offer a new word to inspire this post: Dedodemocracy : government by data. In the very recent past, there has been a very rapid shift to exploit large data collections to change the way our government makes policies. We increasingly using large historical data stores to replace statistical approaches using smaller sample sizes. We are also fine tuning policies to apply to smaller subgroups, getting closer to much more individualized policies. My recent posts suggested how this may be happening in health care policy and in particular the provisioning of health care to control costs.
We are increasingly become a government based on data instead of by the people. This is accelerated even more by how more and more policy making is being transferred from democratically elected legislatures to bureaucracies that are largely out of reach of the democratic control. In part, both trends are reinforcing each other: policy relies on data and for various practical reasons that data is available only to the agencies who make that policy.
Data technologies are improving at a rapid pace. We are quickly removing technical limitations that in the past made it impractical to share data outside of small groups of analysts. I mentioned in another post that we should open to the public any collected data that the government justifies collection based on the idea that there is no expectation of privacy for that data. For example, any bulk collections of metadata that is justified as not requiring a court order or warrant should be available to the public at large as quickly as it is available to an agency’s analysts. Earlier objections that this was not technically feasible are increasingly indefensible. Technology can allow for a large base of users accessing the same data with their own queries.
The remaining barriers are political in nature. Those barriers will remain in place for a little more time because we haven’t really discussed the implications of dedodemocracy. The government admits this is public data because it doesn’t have to get a specific warrant for its collection. That data is accessible by citizens who happen to be employed by the government. There is little qualitative difference between that analyst and similarly trained analysts who are not employed by the government. And only training distinguishes the trained analysts from the general population.
As we increasingly rely on data to make and enforce policy, we will recognize we need more democratic participation in that data project. Eventually we will object to the government’s exclusive holding and accessing of this data the government uses to govern our lives. Eventually, we will want to have open data: data available to all.
Open data initiatives are already starting and some of those are gaining momentum. The government itself has begun to make its data more open. These are small starts because the government data is generally post-analysis data. The currently inaccessible bulk pre-analyzed data should also be open data. Assuming that we will remove the technical barriers that impede the ability of the entire population being able to access this data, then only policy remains to prevent this from occurring.
Returning to the comparison of analysts. There is little difference in trained data analysts employed within government (civil service or contractors) and trained data analysts outside of government employment. Only training distinguishes trained data analysts from the general population.
If the future of government is government by data, then we should be training everyone to become data analysts. Data analysis skills should be taught as part of the basic school curriculum. Data skills are different from other primary or secondary curricula. Data skills include the topics of scrutinizing the data-qualities of the data. By data qualities, I am referring to the distinctions I have been discussing in earlier posts partially summarized in my proposed taxonomy. Those concepts should be expanded to include other uncertainties (such as precision, accuracy, or missing data) and to include good practices in interpreting the query results.
It is easy to see that eventually we’ll be demanding much more participation by the citizens in querying the available data in order to properly participate in modern democratic government. Even today, many of the current debates of modern politics depend on data where access to that data is highly restricted, either to one side of the argument or to exclude access by the general population. As citizens who need to be persuaded by the arguments, we need better access to that data.
The primary barrier to that access is training. The training is not hard, it could be done as part of elementary or secondary education. Unfortunately, the current focus of education appears uninterested in this type of training. We are more concerned about skills of earlier generations: math and sciences (and both are mostly about memorizing historical mathematical or scientific discoveries). We may be better off discarding some of that emphasis in favor for building skills necessary to make new discoveries from analyzing data.
In my taxonomy, I emphasize the highest quality data as recorded observations that are obtained from very well documented and well controlled methods. Ideal observations are accurate and precise. Ideal observations are completely free of any influences imposed by prior theories or hypothesis. I described how dark data (data generated by models to replace observations) and forbidden data (observed data that models reject) can bias the data in favor of confirming old ideas instead of discover new ideas. These model-influenced types of data are necessary for quality control reasons, but we should at least be able to distinguish them from more valuable direct observations.
The goal of making public policies is that new policies will be relevant to current conditions. This demands avoiding propagation of obsolete notions into the present data. We need to find in data what is reality today.
The trend in public policy making is to become much more focused onto smaller subgroups. Policy is becoming more specific to specific categories of populations, locations, or circumstances. My recent posts explored some of that specificity occurring in health care policies. In order to debate these policies, we need skills to recognize what data is relevant to these debates and to recognize the relative value provides by the different parts of that data.
These are not hard skills to learn. These are skills that are best learned through repeated practice. The ideal time to learn these skills are during the period of primary and secondary education. Starting data science training at the third grade provides nearly a decade of practice for working with data. Also, in about a decade it will be necessary that adults be able to analyze data for themselves if they want to participate in policy debates.
Two things must happen immediately to prepare the next generation for dedodemocracy. The first as mentioned is to start to introduce the data-analysis or data science skill training in the primary and secondary education with continuous practice throughout the entire period perhaps starting at the third grade.
The second is to give these students access to real data to practice their skills. We need to encourage faster development of open data projects that make available real and current data for students to use in their training.
Most primary and secondary topics can be learned from information printed in books. Specific lessons usually fit on a single page or even a single paragraph. Data skills are completely different. The volume of data is impractical to print in books and printed data is impractical to analyze. Data skill training requires access to databases. These databases need access to real world data that are relevant to the exercises. Ideally that data will come from the government’s open data initiatives. Ideally these initiatives will continue to expand to expose all data available to government-employed analysts.