My last two posts (here and here) were quick posts responding to some recent observations that share a common point about the problem of deliberate deception in data. I tied both back to longer posts I wrote about the problem lurking in recent enthusiasm for the promise of big data analytics. The lurking problem is that of deliberate informed deception.
Deliberate informed deception using data is very much like the problem of computer and network hacking where people use their knowledge of computers or networking technology to construct special code that causes the remote systems to behave according to their plans, often for malicious purposes.
In the case of code-based deceptions, it took the computer and networking industry decades to build a culture of security to protect their infrastructure from attacks. Initially, we expected that security would be assured with better designs that removed the previously exploited vulnerabilities. Alternatively, we assumed that we can fight malicious code with policing code such as virus checkers and penetration-testing scanners. While these strategies have improved the security posture of computers and networks, they were not enough. Eventually we have learned we needed continuous investment in a human-labor intensive industry now known as cyber-security. No fixed implementation can assure complete security. We need continuous human monitoring and forensic investigation to discover new attacks before they can cause much damage. Today, we devote a significant portion of IT budgets to the overhead of protecting the IT.
IT security exists to combat against deception. This battle is never ending with a kind of balance between two competing forces. The forces of deception have at their disposal the near infinite number of not yet discovered vulnerabilities. The universe of missing data is vastly larger than the data we have of known exploits. On the other hand, the forces of cyber-security have the advantage of financial support for continuous security operations including significant funding for full-time staffing. The cyber-security forces outnumber and outspend the forces of deception. The relative advantage of both sides are just about even so that we are at a steady state where neither will ever go away.
Cyber-security is a net burden on the economy. It is the overhead we must pay to protect our infrastructure from the ever-present risk of attack by deceptive actors who have at their disposal an endless supply of not-yet-discovered (zero day) vulnerabilities.
Ultimately, all IT is about data. Cyber-security investments also protect data technologies but it is focused primarily on the technology vulnerabilities that are physical or software in nature. Cyber-security is a necessary component of protecting data, but it is not sufficient. Even with satisfactory cyber-security, there should be concerns about the security of the data itself.
An extra level of security needed for data is data governance. Data governance is about establishing rules to assure trust in the data. Trust includes confidence that
- data came from the proper source
- the source prepared the data properly according to agreements
- the data is secure at the destination
- the destination will handle the data according to agreements
Data governance involves negotiated contracts for how data is handled from one end to the other. These contracts will require periodic audits to detect violations and periodic reviews to establish that the contracts remain fully relevant to current the current data streams.
Data governance provides additional value to security value provided by IT-security. A way to distinguish the two concepts is that IT security defends the hardware and software infrastructure while data governance oversees the owners and operators of that infrastructure. Both have the intention of policing against deception, deliberate or accidental.
The combinations of data governance and IT-security is still insufficient to protect against deception. The problem I illustrated in the past two posts is where deception occurs outside of the surface containing the IT infrastructure and its owners and operators. The attack occurs outside of what we think of as the attack surface for IT security.
We invest in data technologies is to obtain information about the real world, often about human activities within the world. The data we seek are observations of the real world and of the real people residing in it. The interface between the real world and the data system is the field of view of the initial sensor. That sensor is a physical and software instrument that collects observations of the real world. Data governance and IT security protects against deception that may exist within that sensor or in any of the subsequent operations on the data.
Even with perfect protection from data governance and IT security, there remains the opportunity for deception in the real world that is in field of view of the sensors. We need a different level of overhead to protect against deception in the real world in view of the data sensors.
I am distinguishing data governance from IT security even though both could be thought of as part of the some broader concept of cyber-security. Distinguishing security into different levels of hardware, software, owner/operator, etc allows us to recognize the different strategies we need to employ to achieve that security. A universal covering concept like cyber-security can blind us to what we are missing. In particular, I think that common usage of all-encompassing cyber-security overlooks or under-appreciates the problem of datum governance.
By datum governance, I am referring to the problem of securing the sensors’ field of view of the natural world (or of human behavior) from deliberate deception. A fundamental assumption in analytics of data is that the data represents natural and unperturbed behaviors or observations. The success of the analytics depends on the fact it is not fed deceptive data.
For example, this talk presents the synergy between data science and social science. One of the points raised is the unique contribution from the social scientists to carefully design surveys so as to not give away the study’s intention because people will tend to answer in a way to confirm what they think is the goal of the study. The goal of data collection is to get as unbiased data as possible.
The careful design of polling experiments for social science is an example of what I call datum governance. The actual observations (poll responses) are outside of the direct governance of the data system. IT security and operational policies have no control over the field of view of sensors. These policies are for protecting the data once it is received. Datum governance is about protecting that the data accurately reflects the natural state of the world (or human social environment). The social scientist’s datum governance is the care he takes to design the experiment such the wording of poll questions so that he does bias the data.
As I illustrated in a comment to my previous post, there are many ways that third parties can deliberately introduce bias without the experimenter’s knowledge. This technique exploits what the social scientist attempts to avoid. The attacker can interact with the study subject in a way to bias his subsequent responses. Conceptually, an attacker aware of the polling can contact the polled individual first and provoke a strong emotional response (anger or cheer) just before the experimenter contacts the same person. The attacker would then influence or bias the responses by that emotion that the experimenter assumes will be absent or at least random in a way to cancel out over a large population. Although this conceptual example is unlikely, much of the social-science oriented data science involves sampling with the assumption that the randomness of the sampling will avoid biases.
Mining social media feeds for social science experiments implicitly assumes that the population is unaware of the actual study. A more fundamental assumption is that the population is not biased to behave in a way that will affect the social-science analysis. Datum governance challenges this assumption.
Modern social networking services (Twitter, Facebook, etc.) offer APIs allowing for the development of bots to automate certain interactions. These bots can automatically like/unlike, follow/unfollow, or even create reply messages. These actions can have an impact on the active user. The bots could exaggerate some social trend to make it grow faster or slower than it normally would if only humans were involved. More specifically, the bots could deliberately manipulate the sentiments of active human participants in a way that will reflect on their future actions on that same social media. People will be talking about something that the bots want them to talk about. Alternatively, their discussions of a topic may be in response to their emotional responses provoked by the bots. The bots can change the participants general emotions in a way that will influence their future interactions on unrelated topics. A study of social media for evaluating some marketing strategy can be invalidated by a bot-attack that emotionally disturbs a significant number of the targeted group.
A future bot may use technology such as described here where the app will read facial expressions of user during entertainment in order to optimize the entertainment experience. The facial cues correlate with physiological changes. The feedback loop of machine modifying entertainment content based on expressions may be used to emphasize a certain state of mind that will persist when the user contributes to social media. Even as different users will access different entertainment content, the common emotion-recognition software can make consistent reactions at a time when it is expected that the audience will be subject to social-media data mining (such as lead-up to national elections). This could provide a means to confound the analysts trying to optimize their message for their campaigns.
The strategy for confounding polling may be subtle. Usually political polling involves a randomized sample over a small pool with questions (or other data) to categorize people according to age, political leanings, or sex, or ethnicity. The analysts will report results for different cross-tabs and weight them according to their prevalence in the actual population.
A strategy for confounding the political polling can be to target just one of the subgroups. With the example of the above emotion-reading software that mediates an entertainment experience, the software could operate selectively for a particular production that attracts a certain age group or political group. The software can then emphasize an emotional response to certain topics that will likely occur in an opinion survey (based on current events, for instance). This effect may be short-lived, lasting only a few days, but if the poll happens to sample from this population, they are likely to respond differently than if they hadn’t been primed by the entertainment software.
Certainly this effect is nothing new, people respond to polls not just to the immediate questions but also according to their immediate recent history of whatever held their attention.
Historically, a very popular new movie may bias the audience’s mood in one way or another that would influence how they will answer survey questions. I think a recent example is the movie the American Sniper that quickly became controversial along political lines. During the more heated period of the controversy, it is not hard to imagine that strongly liberal and strongly conservative people’s response to any political survey may have in part be biased by the controversy ginned up by the movie instead of the actual content of the polling question.
This concurrent controversy is usually not detected or recognized by the analysts. The survey questions, for example, probably would be prepared long before the controversy occurs and the blind-nature of the polling may not be aware that the controversy may be biasing on unrelated polling questions.
The future scenario is where the content of entertainment is more fluid in order to support an experience that can change according to the watcher’s reactions. The entertainment becomes less of a single cut of a movie and more like a video game. This flexibility allows for adjustments in the algorithms that could take into account not only the watcher’s emotions but details about the watcher’s demographics. For example, the algorithms could deliberately prime a middle-aged conservative audience to experience (as a group) the entertainment differently than other groups. That difference in particular may be to emphasize some particular point in the entertainment that has a strong association with some current event or political question that is likely to be polled. If any of this populations is polled, they may be influenced by the prior preparation.
My point is that this prior preparation could be deliberate by some group that is aware that certain topics are likely to be polled in near future. They have a specific goal to influence the poll either to bias in their favor or to discredit the poll with suspicious cross-tab results.
Although my imagined scenarios may not be very convincing, I’m confident that there will be innovative ways that people will find to bias attitudes in specific ways to influence the social-science polling that assumes the population is not influenced by the goal of the study. I mentioned that a polling experiment may strive to eliminate introducing biases in the wording and sequence of questions. As entertainment becomes more personalized and people start interacting with bots as if they were human peers, the bots could deliberately introduce the bias that the social scientist running the poll is trying to avoid.
For this post, I want to point out the security vulnerability that exists in the living space outside of the IT infrastructure. The IT systems may have excellent security of hardware and software, and have excellent governance of operations but still be vulnerable to manipulations by antagonistic actors who can influence the environment being observed by the sensors. The same sensors can receive good clean information for some subjects and very biased information for other subjects. In the above example, the non-targeted subgroup may provided unbiased answers while the targeted subgroup will be biased. This deliberate and selective bias introduces a security concern.
Because the security exploit is occurring outside of the IT infrastructure and instead depends on the subject being observed, I’m describing this as the security of the datum instead of the data. Specific observations are vulnerable to exploitation instead of everything observed by sensors. The malware is in the population being observed instead of in the IT systems.
To combat this kind of problem, we are going to need an additional approach of datum governance to protect the observed population from deliberately inserted biases. The existing concept of data governance is very difficult but it is at least possible to get strong commitments for how to operate the systems to be free of systematic problems. Datum governance is far more difficult because the problem is coming from outside of the entire data system. The problem is occurring in individual observations. Datum governance requires accountability and scrutiny for each individual observation instead of all observations from a particular type of sensor.
Data governance is something that may involve a periodic review or audit of practices a few times each year. In contrast datum governance involves continuous scrutiny for each individual observation to verify that it is free of bias. Alternatively, datum governance may require controls on the population such as sequestering the population to prevent outside biases.
I do not know how we can obtain datum governance or whether it is even possible. Nonetheless, I have no doubt there is an increasing risk of actors exploiting the lack of datum governance to manipulate the observed subjects in a way that will mislead the analytics into thinking the observations are from unbiased subjects. The manipulations may be so clever to invalidate the analytic conclusions or cause costly harm to the people who will act on those analytic results.