In several recent posts, I attempted to present my concern that people, for their private gain at the expense of others, can game the machine-learning or predictive analysis algorithms currently hyped as part of the big data benefits. The mechanism for manipulating the algorithms occurs outside of the security perimeter of the IT supporting big data and that includes things like standards for data governance. The potential for gaming the algorithms is to provide or manipulate the observed data itself.
The entire point of observed data is to observe what is happening outside of the security perimeter of the big data system. These observations are subject to whatever nature presents at the time.
By manipulating those observations, a perpetrator can cause the algorithms to make decisions in his favor.
I’ll fictionalize a scenario of a recent trend of flash robberies. These cases involve a lot of people suddenly converging on a store and then based on some instruction, grab as many items as they can and then leave the store. The crowd is too large for the security to stop and too large and too confused for police to mount an effective investigation. The actual news reports I have seen presented the case as if everyone involved intended to perform the robbery and perhaps that is true. I want to propose a possible alternative explanation.
Prior to the appearance of flash robberies there has been a popular trend of flash mobs where people would receive coordinating text messages on cell phones to meet at a certain place. Once they arrive, they receive additional instructions for what to do next. This would be filmed and posted on YouTube with huge success. The resulting popularity motivates a larger population to join in the next opportunity to make even a bigger and more impressive mob video.
At this point, someone has in mind he would like to shoplift from a store but realizes that security cameras would probably catch him. So he uses the flash mob technique that innocently invites a crowd to show up at a store and once there they receive instruction to grab what they can and then run out. They may initially think it is a flash mob event and that it would have some humorous ending perhaps by being instructed to return to the store and restock what was robbed. That later instruction was never received and they are left confused what to do next. Meanwhile one or a few in the mob got what they really wanted without paying for it.
The phenomena of a flash mob is a crowd implementation of an algorithm that conditioned a crowd to behave in a certain benign (or even beneficial) way but then that algorithm is exploited for a completely different and malevolent purpose. That reprogramming of the algorithm consisted only of changing the content of the observations (text messages).
The field of data science is heavily influenced by the professional software developers. As I described earlier, it appears the most common definition of data science is a sub-discipline of computer science that specializes in implementing efficient algorithms tackling data volume and velocity that challenge available resources.
Within the broader computer science discipline there is an expectation that the project involves code and data. With this expectation, code captures algorithms. Data carries the inputs and outputs of that algorithm. This dichotomy explains why many computer science projects prefer to think of database as outside of their project because the database should only be a repository of data and not an implementation of algorithms.
Code requires very rigorous standards for a development life cycle involving separate phases of maturity such as for development, integration, testing, pre-production, production. Each phase has its specific policies for the types of tests and modifications that are permitted or required. The culmination of the process delivers a trusted algorithm that can take data and present results.
Meanwhile, in this computer science point of view, the data has a different life of a data cleaning and ingest process that assures high quality of data that is accessed by the algorithms. Once the data is available to the algorithms, the data is trustworthy but otherwise unintelligent in terms of its purposes.
This simplified view helps in the project management of large software projects. Unfortunately, this view is not realistic. For almost the entire history of the practice of applied computer science, data can implement key behavior of algorithms.
Consider an algorithm that converts words for different colors into specific numbers where some words are synonyms for the same number.
One approach to building the algorithm is to write code that involves some compound statement (such as case, switch, or nested if blocks) so that each possible option is explicitly listed in the source code. Although this code can get lengthy, a developer or his peers would be able to look at the code and understand the algorithm well enough to predict the numeric value returned for each possible word.
A more common approach is to populate a data structure such as a dictionary where the word is a key and the number is the value returned for that key. Now the what appears in the code may be simple statement of referencing a particular key in the dictionary. Usually, the population of the dictionary occurs elsewhere so that a glance of the code will not provide enough information to predict what values will result for any particular word. A key part of the algorithm is captured in the data of the dictionary. Often that dictionary can change during runtime in production.
A very simplistic generalization predictive analytics or machine learning is that they involve dynamic and automated generation of dictionaries using a large number of keys to return a particular value. The software algorithm only defines the approach for populating the dictionary. The data determines the actual behavior of the dictionary.
Other cases of data providing algorithms include the match strings in regular expressions, script or macro languages, and databases. In databases in particular, there is a wealth of opportunities to produce useful algorithms using SQL select statements that join multiple tables including several tables specifically populated like the above dictionary to provide replacement data for some matching key value. A very compact implementation of an algorithm can be expressed as a select query while the behavior of that algorithm will depend on the contents of these tables.
The value of these dictionary approaches is that the dictionary is available for review and modification at run time in production. We can build privileged user interfaces for trusted operators to review the contents of the dictionaries and make changes where needed. The fundamental behavior of the algorithm is adjustable during run time or production because the behavior is captured in the data.
Data can capture algorithms. Data is outside of the control of the data scientist because his tools are limited to software code. The best the code can do is to enforce access privileges to dictionaries or the apply constraints on the data modifications in an attempt to prevent certain kinds of mischief. The behavior remains captured in the data. As a result, it is possible for manipulating the data to produce new behaviors.
Often the use of dictionary type constructs involve some kind of privileged access control to the actual dictionaries. While dictionaries offer the possibility of updates, the expectation is that they will be stable for long periods of time. Often the operators have to follow specific policies for modifying and testing dictionaries before introducing them into production.
The game appears to change with the recent deployment of predictive analytics and machine learning. These algorithms are expected to react quickly to large volumes of data arriving at high velocity. To accomplish their objectives, the algorithms have to automate the construction and population of dictionaries (using the term very abstractly). Ultimately, the success or risk of these algorithms depends entirely on the trustworthiness of the observations.
Trust in observations comes in two forms. One is whether the observation is an accurate measure of the real world. In my previous post, I described ways that someone can place something in front of the sensor to present a simulation of reality. The other more profound trust issue is whether the accurate observations are conforming to our expectations of how the world is supposed to work.
In the above flash robbery scenario, the appearance of crowd at the store is not completely unrealistic for the store and the texted message for a flash mob is not unrealistic to the participants. What went wrong (at least in my fictional scenario) is that both groups were fooled by their assumed model of how the world was working at that moment. The data was manipulated to change the algorithms for the benefit of someone who was neither a member of the staff of the store or of the volunteers participating in the flash mob.
In my last post, I noted how impressed I was that the relatively obscure crime of ATM-card skimmers could be so sophisticated and skilled to build an hide an industry that produces high-quality replicate veneers for ATM interfaces to hide skimming electronics. I do not doubt that well financed people can come up with equally elaborate strategies to manipulate crowds or trends to trick predictive analytics into making decisions in their favor at the expense of both the owner of the predictive analytics and of the customers that owner is normally trying to serve.
They have not appeared yet. They will surprise us like the sudden introduction of flash robberies. Unfortunately, the scale made possible by the large-scale implementations of big data analytics will make the surprise all that much more shocking.