A cautionary take on Predictive Analytics: it is a data operating system

In my last post, I argued the legitimacy of questioning predictive analytics on big data in terms of vulnerabilities it can introduce in terms of exposing the adopter to legal or civil challenges.   In particular, the very scope of the project uses algorithms to come up with new information from large and complex data.    That information could become the basis of decision making resulting in bad outcomes.   The bad outcomes can motivate challenges about the negligent even fraudulent misuse of predictive analytics for the data available.

In that post, I was thinking about unintentional fraud or negligence.   This unintentional quality is inherent in the promotion of predictive analytics.   Predictive analytics consist of simple-to-use implementations of complex algorithms so that the typical user would be unable to understand the algorithms.   Predictive analytics operate on large and complex data sets that are impossible for humans to independently interpret.    Predictive analytics select and reject competing options automatically where these choices are unlikely to be noticed by the users.

In short, predictive analytics can make serious errors that would invoke disciplinary actions if made by humans.   Thus it is realistic to expect similar accountability from the algorithms as we would expect from human staff.

For this post, I want to explore a separate concern of deliberate abuse of predictive analytics for fraud or exploitation for personal gain at the community’s expense.    I was thinking of several examples I’ve heard about.  I am generalizing the examples because I don’t know all of the details but they include the following:

  1. Popular internet search engines becoming manipulated by deliberate practices to get certain web content to appear higher in relevance order than they deserve to be
  2. Popular sites for social rating of companies or services with reviews and star-ratings being manipulated to give advantage to paying subscribers (sorting good reviews to the top, preventing competitor ads from appearing on same page, etc)
  3. Staff creatively manipulating data entry practices to optimize for their personal gain their performance scores generated by automated metrics designed to benefit the company.

It would be very instructive to go into details of each of these cases, but I want to take a broader view.   As I look at these examples, I see a similarity with prior experience with new innovations of operating systems.   I think of the early personal computer operating systems that automate managing of complex operations of computer motherboard hardware and in particular the operation of hard drives.   I think of network operating systems that automate the complex algorithms for reliably sending messages to the correct recipient and routing that across increasingly complex networks.

In each of these cases, we welcomed new easy-to-use software that allowed everyone to enjoy the benefits of these technologies because the software took care of the details too tedious or complex for humans to handle.    But shortly after adopting these simple to use technologies, we are shocked to learn that people are deliberately taking advantage of these systems for their personal gain.   In some cases, such as the early standards for internet protocols and early operating systems, the standards deliberately chose simple approaches with the expectation that no one would abuse them.    The simple implementations facilitated faster innovation and good performance.   This would have worked out fine if everyone would behave themselves.

At the introduction of these overly simplified implementations, there were many warnings about human nature is such that people exploit opportunities for their advantage at the disadvantage of the rest of the community.   I recall hearing those voices, but they were overruled by the enthusiasm for the potential advantages these technologies will bring.     Both sides were right.  The simple approaches did result in rapid and widespread innovation and adoption.    People took advantage of the opportunities to gain personal benefit by causing harm to the community of users.   Sometimes that benefit was merely the entertainment of watching the resulting chaos of dealing with a new virus, worm, email storm, etc.

Today, we are much more mature about our expectations of human behavior when it comes to computer and network operations.   There is a huge and ongoing investment in security of all aspects these systems (security in depth) in order to be sure that only authorized operations occur.    We take deliberate steps to prevent people from misusing these technologies to cause harm to an organization for their personal gain.    Despite the maturity of these technologies and the investments already made for protecting them, we continually learn new vulnerabilities that need new solutions.

I see a parallel where predictive analytics is like an operating system for data.   I assume that secure operating systems and networks host the huge data repositories available for predictive analytics.   The complex data itself becomes a new kind of resource.   The data’s multiple sources, different levels of confidence, or relevance become resources similar to a computer operating system’s motherboard components or hard drives.

The data itself is the new resource for delivering value to the user.   Predictive analytics provides the automated operating system for simplifying the exploitation of this resource to benefit the user.   Like the operating systems doing the tedious and complex tasks of handling components in a computer, the predictive analytics perform the tedious and complex tasks of handling the data.   Also similar to the operating system example, predictive analytics encourages the user by asserting that he doesn’t need to understand what the software is doing.    The software is designed to benefit the user.

Operating systems for computers and networks were also designed to benefit the user.   The problem is that the user includes anyone who can have access to the computer.

In the case of predictive analytics, the problem is more abstract.   Predictive analytics operate on data stored on computer and network systems.   I assume we can trust these systems are fully secure in terms of limiting access to authorized users.    That underlying security is focused on access to physical or logical resources.   For example, each resource can have different restrictions for who can access those resources.

The resources available to predictive analytics are distinct from hardware or database objects.   For predictive analytics, the resources are information within the data.    The potential for abuse comes from carefully manipulating the content of the data.

A historical analogy is the possibility of overwhelming e-mail systems by an email with a compelling story that encourages everyone to send the email to everyone in their address book.   This scheme works even though everyone has full authorization to their email and everyone performs an action they are fully authorized to perform (forward an email to whoever they choose).   I consider this an example of using information to gain unauthorized control of a system: in this case an example may be in the overwhelming of mail servers.

In my earlier examples from abuses of big data analytics, the abuses came from manipulating the information in the data where the manipulation was by fully authorized users performing actions they were fully permitted to perform.   Even in the case of creative record keeping to manipulate performance monitoring results, the operators actions are fully permitted by the system.

Information resources originate in the minds of people.   Mental activities have nothing comparable to access controls we can implement on computers, networks, or databases.   We are free to choose what information we enter as data into data stores that we are granted access to.

Consider the example of manipulating search engine results by clever design of web page content.   The web designer is fully in his rights to design content for his web page in such a way that deliberately manipulates the results of the third party search engine.   The search engine company can respond by improving the algorithms to counter this manipulation.   This is not an issue of unauthorized access.   The external agent was able to manipulate the results for his own benefit using rights he fully possesses.

The search engine example is a good one because it is often used as the model of successful big data analytics.   It is undoubtedly successful, but that success requires a continued investment of in-house experts to continually refine the analytic algorithms to protect the value of their service.    The person who is able to successfully manipulate the service to get an unfair high relevance rating for a particular search term will degrade the search service’s reputation by delivering poor relevant results to the general users.   The users may decide to stop using that service if they keep seeing irrelevant results.

A similar scenario can happen with other data analytic projects.   In recent news about the hiding of wait times in patient scheduling at the Veteran’s Administration, the operators were performing deceptive actions that they were permitted to perform.   In this case the deception itself was another step removed because the operators legitimately were following instructions that were designed to be deceptive.   The operators were doing their job but it was their job duty that was deceptive.   The deception worked as intended to manipulate the performance metrics to show better performance than what was actually occuring.

In the above example, although the performance metrics were used for individual performance appraisals (at least at a hospital level), the goal for the metrics was to manage the overall system of hospitals.    The algorithms worked properly on properly loaded data.   The deception was inserted by deliberate choice of information with an understanding of how that information can influence the measures.

This should be considered as a real risk for predictive analytics.   Like the earlier examples of newly introduced operating systems, there may be some initial big successes in using this technology.   But it will inevitably and shortly be followed by embarrassments.   Human nature is to observe and adapt.   People will find out how to game the information to gain advantage for themselves at the expense of the community as a whole.   It won’t take long for them to figure out strategies to injected carefully crafted information influence the remote predictive analytics.

Just like remote computer operating systems were attacked by people possessing their own computers, remote predictive analytics can be attacked by people possessing their own predictive analytics.    We should learn from the earlier examples that the value of these new technologies will inevitably involve a cost of on-going investment to adapt.

The lesson from past technology experiences is that it is unlikely that there can be a static investment in predictive analytics.   What works initially will not be successful for very long.  There is more cost to predictive analytics than the licensing and dedicated hardware.   The nature of the information in the data inevitably will change as the subjects become aware how behaviors will influence the analytics that is used to support decision making.   There needs to be a recurring data science cost of monitoring the quality of the analytics and adjusting them as the information changes.

Advertisements

9 thoughts on “A cautionary take on Predictive Analytics: it is a data operating system

  1. Pingback: Data Supply Chain: data enrichment close to the source | kenneumeister

  2. Pingback: Data Quality, Governance, Trust when some people don’t play nice | kenneumeister

  3. Pingback: Using analytics to trespass | kenneumeister

  4. Pingback: Government by data and urgency, but urgency must be defined by data | kenneumeister

  5. Pingback: Databases motivates philosophy with multi-valued logic anticipated by Buddhist thinkers | kenneumeister

  6. Pingback: A need for a new rhetoric for data, identifying fallacies in data | kenneumeister

  7. Pingback: Economic motivation in dedomenocracy: avoiding culture of poverty | kenneumeister

  8. Pingback: High frequency trading spoofing as an example of using data to attack data analytics | kenneumeister

  9. Pingback: Datum Governance: Distinguishing bots from real world | kenneumeister

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s