Data Mining For Children

This article Data mining your children from Politico provoked a lot of thoughts for me.

I appreciate and share the concerns raised by data collection of the learning process.   It does present some serious privacy concerns especially as this data will be assigned to the child for the rest of his life.   No matter what the original and short-term the original justification for such detailed data collection, this data will last forever and certainly be reused for new purposes as the child goes through life.   For example, it will not be long before we will be seeing detailed learning data being exploited by opponents in political or high-profile competitions.    If something is recorded, it can be retrieved.  If it can be retrieved, it will be retrieved for purposes not always in the person’s best interest and contrary to the intentions for the initial collection.    In addition, recent experience demonstrates that there is no assurance of secure handling of this data.   Eventually unauthorized agents (generically called hackers) will compromise some or all of this data.   While data security technology continues to improve, experience informs us we should expect the security and abuse to occur.

We will not be able to stop collection of detailed data that will be stored indefinitely and exploited in unexpected ways in the future.  Instead, the trend will be met by behavioral and cultural changes to adapt to the reality of everything being recorded.   I share the concern that this can be a frightening concept where our every move can come back to haunt us in the arena of public opinion.    I also have great faith in human ingenuity to find new ways to take advantage of this to each person’s own advantage.    People will continue to change behaviors to qualify for the good categories and to avoid the bad ones.   Our long history of trying to figure out human sociology and psychology is evidence that as soon as we think we have something figured out, people will change to negate that finding.

In an earlier posts such as this one, I advocate the introduction into our early education the concepts of data science.    In another post, I suggest this can occur very early on where even most basic learning can be reconstructed as database query exercises.  In these earlier discussions, I stressed the importance of building the habit of data science (working with databases) above the skills.  Given the above reality that increasingly every aspect of your life will be affected by external data, a successfully independent life will demand an ease of seeking out and scrutinizing data to make better informed decisions.

With this in mind, what I find disappointing in the Politico piece is that the child’s education itself seems to be devoid of data-mining skills and conditioning.  Despite the explicitly recognized wisdom of massive data collection for improving education, the education lacks imparting this wisdom to the child.    Instead, the coursework appears to be computerized versions of material we have always taught in schools.    The actual exercises are simply translated to a computer delivery system.    I assert the learning material itself needs radical changes to incorporate real data investigation.    The proper replacement of paper material is database material.   The education should focus on data instead of the current focus on the computer.   We need to start recognizing that the computer itself (processor, video screen, user input) is the modern equivalent to wood-pulp that makes paper.   What is important is the information printed on the paper, and that is the data.

In this particular example, the child (and parents) should be able to access the data being collected about him.   While this may involve reporting tools that summarize the underlying data, the tools should permit full access to every dimension of data that is being collected.   In addition to providing feedback, this provides a motivated learning and conditioning experience in preparation for an adult life the demands and expects access to data about himself.   Each student should have the opportunity (at least theoretically) to observe anything any third party could observe with that student’s data.

In addition, the overall data can be summarized and categorized in ways that can make the data available for all students to investigate for realistic exercises for performing the historical science of selecting, scrutinizing, and interpreting data.   The broad categorization can capture useful information in an aggregated way that protects privacy of the individuals.    The value of this data data is that it is relevant to the student’s immediate experience and offers real opportunities for verification and cross-checking.    The collected data about students, broadly categorized, can be ideal learning opportunities that will

  • Illustrate the practices of diligent data inquiries
  • Practice the skills of data analysis with opportunity to learn more advanced methods of data analysis
  • Reinforce the conditioning of making data investigation as habit that will be essential for successful life in the age of data

On a different topic in addition to the above observations about education, the above article also provides an example of my discussions of distinguishing historical science (data science) from present-tense science.

Instead of differentiating hard (STEM) sciences from other soft sciences (social sciences), I have been developing my own differentiation of present-tense science and past-tense science.

My focus is primarily on past-tense science that I equate to data sciences where the subject of study is records about the past.    The past-tense science is limited by the fact that the past is forever lost for making a new observation.   We have to work with what we have available.   I discussed many challenges about data, challenges that I described using various analogies for data in terms of light (data that is bright, dim, dark, unlit, etc).   Ultimately, most of what we call science is historical science that involves selecting data, scrutinizing the data for accuracy and relevance, and building arguments based on that data.  Ultimately, the strength of the arguments depends on the reliability and relevance of the selected data.  My concept of historical-science aligns with the popular notion of science based on the meaningful fruits of science in presenting and defending theories that support decision making.

From this perspective, I define present-tense science as that activity responsible for obtaining the observations that will become part of the historical record of data.   Past-tense science motivates present-tense science to obtain new observations to support or refute hypotheses.   Also, past-tense science demands high standards to collect well-documented and well-controlled observations in order to avoid challenges that the data is not relevant or not trustworthy.   I see the present-tense science of collection observations into data to be the harder of sciences for the simple reason that the present exists only once.  Even a repeatable experiment can not repeat the time-stamp of the original.   For any particular time, there is only one opportunity to make observations.   Once that moment is past, the we are stuck with what we have recorded.

The following is an illustration of this viewpoint.  One of my interests in historical sciences is archaeology where archaeological digs are inherently destructive.   Once a new artifact is found, every action to recover that artifact destroys the surrounding evidence.    Early in archaeology, the focus was on the artifacts themselves as fascinating evidence of lost but advanced cultures.   Later, we regretted the poor practices after obtaining better technologies that could have provided more information had the digs been done more carefully.    For example, microscopic inspection of the surface of artifacts can suggest how the artifacts were formed, used, or affected by natural processes.    This later investigation could be challenged on the grounds that the observations may have been introduced by the uncertain or clumsy excavation process.    As a result, later digs are done much more carefully with tediously extensive documentation of all aspects of the progress.   Each year demands more and careful observations.

I would characterize the present-tense sciences as having an obligation to record and document everything that can be recorded.   Even when an experiment is focused on a specific topic, everything about that experiment should be documented and made part of the record for historical-science to ponder.

With this in mind, I return to the Politico article.  The extensive data collection that the article criticizes is very beneficial.    There is a value in recognizing that learning experience of a particular child on any particular day is a once-in-a-lifetime opportunity for measurement.   We should take the opportunity to measure everything we can about that experience, including the response time for individual actions, time between keystrokes, intervening activities, etc.    The task of the present-tense is to preserve as much as possible for future analysis for not-yet imagined studies.   I recognize and share the privacy concerns about the intrusiveness of extensive data collections while at the same time appreciate the fact that this same data is being collected.

This article illustrates a specific case study of childhood education, but we are subjected to intrusive data collection for all aspects of our lives.  Most of these other collections share with the childhood education a mission to improve our understanding of the processes.   We want to learn how to improve outcomes in education and to advance education for challenges of modern life.    This learning comes from analysis of historically collected data.   This activity benefits from extensive data collection using very fine granular data that is easily understood and documented.

A final observation is the competitive pressures to grow the magnitude of data to collect.   The article identifies several different commercial companies competing for education spending on their products.   Because of the value they obtain from large data collections, the more competitive companies will have the larger population of students.    The larger the population to collect data, the better the opportunities will be to use that data.    Inevitably there will be some kind of dominance either with one vendor, or with a standardized method of sharing data among different vendors.   Consequently, the population will eventually include nearly all students.

Big data tends to makes its own demands for more data.   Having access to data for a larger population creates a demand for having more different observations about the individuals in that data.   In other words, the larger the population of data, the bigger the demand for more intrusive information.   This demand is to support more abstract categorized dimensions to help segregate the population into more comparable subgroups or sub-populations.    Coincidentally, that is also more intrusive on an individual basis.

There is an inescapable demand for big data to include the largest possible population and to include the largest possible dimensions of observations within that population.  While we can constrain this appetite to some extent, we should recognize its inevitability and begin preparing ourselves to live in a world governed by data.   That preparation involves developing skills and habits of data science.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s