In earlier posts (such as Workforce participation in activity tracking: addressing frustrations with micromanagement), I discussed the idea of using big data technologies for observing and analyzing internal activities. In contrast to frequent examples of big data analysis of external factors (such as market research), this application applies the same techniques to internal operations. In those posts in particular, I discussed using the data as a means to regain more democratic participation in government when most functions of government are performed by unaccountable bureaucracies. My concept was to collect activity data within the agencies so that the general population can at least observe the amount and category of labor the bureaucracy expended in making particular policies or decisions. Even if this information does not expose the actual internal deliberations, it will at least provide some means of grading the extent of internal activity that went into decisions. My concept was to make this meta-data available as open-data for the public to query through their own query tools to find patterns that may add to political discussions at the legislative or judicial level.
Recently, I was exposed to a different problem where this kind of detailed data collection may also be helpful. This problem concerns the maintenance and design of user-interfaces for data entry into a database. The database itself may have extensive design with ability to assure reliability and accuracy of data (such as using ACID technologies with well-normalized schema). The application also will have middle-ware that enforces business logic and front-end validation to protect the data. All of this data would be directly relevant to the mission and it will be stored in a relational database.
The challenge, especially for very mature applications (applications that have been around for a long time), is to improve the design to add new functionality while not interfering with existing user experiences with the interfaces. In designing an update to the user interface, it is necessary to have extensive libraries of user stories and test scenarios to cover the extent of scenarios that are likely to be encountered by actual users of the system.
While the database records the successful transactions, there is no record of the attempted transactions that did not succeed either because the user canceled a transaction, or attempted a transaction that had some validation error. As a result, the designers often have to work with anticipated normal user stories with an arbitrary collection of edge cases either from well-known historical incidents or from the imagination of the requirements analysts.
Although the application processes each request for sequencing the data entry process and to validate the data, the only persisted data from the entry is the final validated result that is entered into the database. Even in the recent past, it was important to optimize the data stored and the performance of the user interface. That optimization minimized the collection of data that was not essential to the mission. In particular, there is no record of the rejected mistakes, the multiple revisions before making a submission, or the canceled entries.
As a result, the persisted data only records the successful entries that fit into the design of the process. The persisted data does not capture the edge cases where the cases do not easily match the existing data models or user interfaces. The persisted data offers little help in informing the designers of what the data-entry person was experiencing between the successful submissions.
Here is where a big data approach can be useful. Frequently, there are discussions of using new technologies to replace older technologies. For example, there is a frequent promotion of NoSQL databases to replace legacy RDBMS databases. Compared to RDBMS, NoSQL can meet new performance goals by scaling out (adding more computers to cluster) rather than scaling up (obtaining a larger server). Also, NoSQL database are quickly maturing to incorporate consistency (and other aspects of the ACID goals of databases) while also supporting a familiar SQL language to access the data.
Mission essential applications, especially ones with earned confidence from many years of use, are unlikely to move to NoSQL for performance reasons unless performance becomes the main cause for impeding the continued mission. Organizations have a lot invested in tiered applications with business-logic code optimized for leveraging the RDBMS protections. Also, they have the confidence of the years of experience of surviving many challenges of different data scenarios. These applications will be around for a long time.
Even as these applications continue in their current form, they are experiencing increasing demands for newer and more extensive capabilities. The modern data environment is presenting an explosion of new sources of data to incorporate into the model to advance the application’s mission. These data may include more types of data, or new details to allow relationships with existing data.
Often the internal architecture is able to handle the increase burden of more extensive schema design or larger databases. The main challenges facing these applications is the ability to deliver new capabilities to keep up with new data opportunities. While some slowness may be the result of careful development life-cycles and change-control procedures, a significant impediment of adding new types of data or relationships is the mystery of what kinds of problems the new data will present.
The new data may present new sources of integrity issues within the data. Perhaps more importantly, the changes to the user or machine interfaces may make previously difficult to complete transactions to become even more difficult or even impractical. Adding a new source of information may result in the loss of previous information capability: the data may arrive later or not at all. To protect against this possibility, the requirements analysts need information about what the current users are actually experiencing with the current application. The requirements analysis need information about the user’s experience that does not get recorded in the mission-specific database.
NoSQL or other big data capabilities can help with applications even as their architectures remain as RDBMS. The opportunity is similar to my earlier discussions on activity tracking bureaucratic work-products. It is also similar to current popularization of the “Internet of Things” (IoT) that introduce sensors into objects that previously did not leave data trails. In this context, the user interface itself could become a thing that can incorporate a IoT-style sensor that transmits activity data to a big-data store using unstructured or schema-on-read NoSQL databases and associated scale-out processing tools such as those found in Hadoop ecosystem.
In addition to the occasional transactional data traffic for CRUD (Create, Retrieve, Update, Delete), the application could be generating user sensing data such as typing speed, time spent on a form, time spent within a particular field, sequence of fields entered, time intervals between fields or between returns to the same field, multiple navigations through menu options between any data entry action, etc. The actual session activity can produce extensive information about what the user is encountering for a particular tasks. This tracking information will be associated with a successful submission of a transaction, but it will also include information on the abandoned transactions.
This information can provide a wealth of useful information for requirements analysis. For successful transactions, the tracking record can inform on the variations of difficulties for that particular type of transaction. The additional information from uncompleted transactions can inform about frustrations within a particular team or office for using the tool for their local case load. With extensive tracking described above, it may be possible to infer the edge cases for the requirements analysis. Alternatively, mining this data will distinguish the teams that are having easier time with the application from those that are having more difficulties. When the associated types of cases the team work with matches the topic of the new feature, the requirements team can research this team more carefully to identify edge cases to include in consideration of the future design.
I was thinking of this possibility as I watched a pair of demos (here and here) promoting the Apache Spark capability on Azure. Although I recognize that these are demonstrations of the technology and not of the data science, I noticed some conclusions in the demos as being not unusual in practice. Personally I am cautiously enthusiastic about the big data technology: enthusiastic because it makes my work easier but cautious because of the frequent attempt of a single leap from raw data to actionable intelligence.
The demos illustrate the latter concern when discussion conclusions about crime data in Chicago. I am not taking the conclusions seriously, but I can imagine someone in a more accountable position coming to similar conclusions using similar approaches. The interactive exploration of data is so easy that it encourages laziness.
One of the conclusions was that from this data of reported crimes, narcotics crimes were almost certain to result in arrest while theft was less likely. The story-telling explanation is that narcotics are enforced more aggressively. This is an appealing story that matches current politics. However, I think it is very misleading. The source data is reported crimes. Thefts are reported by non-police officers. The public reports a theft long after the theft occurs. There is very little opportunity for the police to identify and arrest the thief. On the other hand, most narcotics crime discoveries are likely made by the police themselves. They are right there when the observe is so they can arrest immediately. Both crimes may be equally enforced, but statistics favor narcotics arrests because the police are the ones who report the crime in the first place.
Another illustration of the story-telling with this data concerns the observation of daily reports of crimes having peaks on Jan 1. The story being that obviously New Years day was a busy time for criminals. However, in the brief demo it was clearly stated that although the data has time stamps at the day level, the reports themselves until recently had once-a-month updates so that all reports in the file would occur on the first of the month. This makes sense for reports of police activity, as it is an accounting of police workload and effectiveness. The modern analyst attempting to exploit this historic data to observe something about crime is using the data for something it was not originally intended. Another ambiguity concerns the timing of the reports: does the time-stamp of the record represent the start of the month that experienced the crime, or the time the prior-month’s report was prepared. My guess is that it is more likely to be the latter: a report of police workload would likely be dated on time it was prepared and thus represent activity for the prior month. This makes sense as there are likely to be a peak of theft crimes in the holiday season in December. The demonstrated conclusions may be useful as discovered hypotheses, but the irrelevance of the data places the burden on the analyst to find more relevant data to test the hypotheses. Testing with the same police data is a form of data fallacy, even if this fallacy is a common practice.
Another conclusion comes from a machine learning algorithm that predicts the likelihood of arrests for different combinations of features (type of crime, date, location). This bothers me because I can imagine a similarly direct approach of feeding raw data directly to a learning algorithm. Modern ML strategy involves supplying training data to multiple ML algorithms and then selecting the one that makes the best predictions with the testing data (a fraction of the data withheld from training). With a sufficient number of experiments it is likely to find an algorithm that will outperform the others and this out-performance implies veracity with the real world. There may be no fundamental reason why the chosen model is more realistic. It just happens to get the best score with the test data. I am very reluctant to make any bold conclusions about the future with machine learning. It is useful for predicting the past observations. There is a finite population of past observations. Predicting the future involves predicting missing data and that can not be inferred from observations.
The demonstration of the multidimensional reporting of the data exposed a problem with corrupted data in the data source. The data source consists of comma-separated value (CSV) files where the schema-on-read approach infers fields based on the number of commas preceding it. There are many reasons for this to fail. Commas may appear within fields and this would disrupt the parsing algorithm. A missing field value may result in the coalescence of adjacent commas into a single comma. The file format may change over the period (over a decade of monthly reports) where some months may add or remove fields. There are issues that need to be addressed by the schema reading (data ingest) algorithm. In the demonstration, this problem becomes apparent to the end analyst attempting to interpret the data. This scenario may be justified in the demonstration to show the speed of getting from raw data to an actionable conclusion by showing how the end-analyst can dispose the dirty data with a filter.
While reasonable for a demonstration, this is not a reasonable practice for actual analyst. For good data governance, analysts should not be burdened with the task of cleaning data. They should instead report any anomalies to be addressed by the teams responsible for designing the data ingest or the schema-on-read algorithm. Instead, we should prefer the analyst to request a review of the data processing to correct the corrupted data before proceeding with the report, or at least to report a lower confidence in the result based on the acknowledgement of the dirty data. In this case of the missing or excess comma counts, the problem can harm the analysis in multiple ways: the discarded data may have the desired field information in a different part of the record, or the accepted data may accept a value intended for a different field. However, the appeal of modern data tools is that they make it easy and fast for analysts to clean the data themselves and thus come up with potentially wrong or at least inconsistent results compared with their peers who choose to clean data in other ways.
A decision maker should be wary of accepting a report based on data known to be prone to corruption of some of its records. Inevitably, there will be cases where the analyst may underestimate the potential problems with this type of corruption or deliberately withhold this observation in order to meet a delivery deadline. A common fallacy in big data is the belief that size of the data overwhelms the occasional errors. This confidence should immediately decline when performing multi-dimensional analysis that subdivides the data into smaller categories. Eventually the categories may have small populations of data records where a few anomalies can have a major impact on the conclusion.
Another concern is that data cleaning by the final data analyst can lead to inconsistent approaches to clean the data. Different analysts may come up with their own solutions to the problem. The result is that different analysis of the same source of data can result in conflicting versions of the truth. The conflict results from different interpretations of dirty data instead of a conflict inherent in the data.
A decision maker needs an additional source of data that will report on the analyst’s efforts. The above scenario presents an opportunity for collecting activity data by the tool-user (in the case the data analyst). The activity tracker would record the data operations, including the actual time-stamped sequence of the introduction of new operations. The sequence of added operations can expose the analyst’s disappointment in their initial choice of analysis techniques. The record of multiple and later abandoned attempts at different solutions to correct the problems can inform a decision maker that the analyst was encountering problems with the data that should have been addressed before the analyst worked with the data. A regular data report on analyst activities can identify patterns that show that some reports are easier to produce than others. The increased difficulty (measured by excess activity) is a clue that the analyst is confronting problems with the data. This is a clue that some reports deserve more scrutiny or suspicion from the decision maker.
This scenario is an example of how logging of front-end users of mission-essential applications can provide value in advancing the goals of the application. The user logging occurs in parallel with the mainline mission application (such as managing data in a relational database). The user logging will include a wide variety of measurements (such as sequence of different operations, what was performed in each operation, how much time was spent in each operation, etc) and thus the resulting data will have higher volume, variety, and velocity (3 Vs) than the user’s work-product of managing relational database records. The data about the analyst activities will be bigger data (as measured by the 3 Vs) compared to the user’s delivered work product, whether that is delivery of a report or an update to the relational database.
Legacy applications can benefit from big data approaches without the need to replace the legacy architecture with new technologies. Instead the big data can augment the application by collecting higher volume, variety, and velocity data about the user’s activity using the application. Analysis of this data can inform decision makers where there may be problems with the work-products. Correspondingly, it can provide requirements analysts with information about where improvements are needed or with more complete library of edge cases to consider for new designs.