The title of this post uses the term accessory to refer to superfluous data. Something that easily observed but has no import. In particular, I returned to my imagined doctor’s visit where the doctor takes in a lot of subjective clues before reviewing the objective data. There is a lot to observe in that initial greeting. The observation will include the patient’s clothes and jewelry. For this post I’m using accessory in that sense of the clothing and jewelry of the patient coming for a routine doctor’s visit.
In an earlier post I talked about BMI (Body Mass Index) as an attractive data point because its measurement is easy and very consistent. I describe this as bright data that I consider to be closer to the real world than dim or dark data. If a query of certain health conditions or outcomes shows a strong relationship to BMI then it deserves attention simply because the high quality of the BMI measurement. As I suggested in that post, there are substantial objections to the findings of certain relationships of BMI to certain health problems. That is acceptable because the role of the historical-data pattern analysis to propose new hypothesis, the testing of the hypothesis is a separate activity.
Taking a step back and look at that earlier discussion, I recognize that I like BMI because its observation is easy and repeatable. Such an observation is more valuable than more difficult or costly measurements that are prone to large variations.
The above mentioned patient’s clothing and jewelry are also easily and consistently observed. We could include them into the data store along with BMI. I suspect they would have a very similar purpose. Such observations may be how well tailored the clothing is, whether the dress is business formal, business casual, leisure, etc, whether the clothing is freshly pressed or not, etc. What color is the different pieces of clothing. What kinds of jewelry are worn. What is the type of hairstyle. It should be possible to come up with some widely recognized categories to assign these observations and then enter them into the data store.
It is not hard to obtain this data. It is feasible to have some type of accessory recognition software to read this information off of a video image. It is possible to add this data about the patient to accompany the BMI score and vital measurements.
With this data, we have the opportunity to group patients by different categories of the accessories the patient presents during routine doctor’s visits, or follow-up visits when there is some ongoing concern or treatment. With a large enough data store of all patients, we may find some patterns about certain color or accessory combinations that can be associated with certain conditions or with certain procedure outcomes.
On finding a pattern related to certain combination of accessory traits, we can propose a new hypothesis that may be clinically useful. I emphasize my view that analysis of historical data can only suggest hypotheses. These newly discovered hypotheses always require testing through experimentation.
I can imagine that there may be information hidden in these accessories. After all, the patient chose the particular presentation from their available wardrobe with the understanding that at some point in the day this will be seen by the doctor. The available wardrobe and specific choice of the day reflects something about inner life of that person. Perhaps over a large group of people, we may find similar choices by people who share similar health risks or similar outcomes for certain types of treatments. From a hypothesis discovery point of view, we can imagine there may be some hidden truth to the patterns we observe when these accessories are dimensions of our queries. We could recommend that we test such a hypothesis just like we would recommend any other similarly convincing hypothesis.
Accessories could be more than just deliberate choices of the day such as clothing, jewelry, or hairstyle. This concept be expanded to include conditioned behaviors such as speaking quality, vocabulary, walking stride: traits that are individualistic expressions that take time to develop. This concept could be extended to unconscious individualistic qualities such as bone structure, facial features, cranial features, etc.
Many of these can be easily measured at least to place into broad categories. We can use them in our data analysis of all patients to find patterns. Go back to the BMI example. Perhaps the health outcome predictions based on BMI may be improved in accuracy if we included accessory data. A patient with a high BMI but who dresses with high quality, well tailored, well maintained clothing may tend to be healthier than other accessory combinations. Perhaps including the color of the clothing could improve the accuracy even more.
Allow me to pause here for the benefit of a reader who may find this post without context of other posts. I have no expertise in the medical field. I’m just using the medical example from a layman’s perspective as something that may be easy to communicate my thoughts about types of data. The accessory data is essentially extraneous data that is nonetheless easy to observe and categorize.
Many big data stores have a lot of accessory data. Often it is the only information available. Consider for example the marketing example relying on customer surveys where people quantify their magazine subscriptions, number of automobiles, number of bicycles, number of annual vacations, etc. What does this information have anything to do with the customer’s decision to by another appliance? An analysis could suggest some patterns that could be rationalized as a measure of readiness to dispose income and that suggestion may be tested by some new marketing plan. Perhaps in this case, a more direct piece of information would be how much money per month does a person normally spend on products when he bought this particular product. Such information probably would never be volunteered by the consumer. This is especially true for the consumer who pays close attention to his budget to be able to provide an accurate answer. The information has be gleaned indirectly.
The promise of big data is that it can give us insight into what is going on in the real world. We can discover hypotheses that we can use to make decisions. I assert that any decision based on a discovered hypothesis is by definition a test of that hypothesis: an experiment. But in practice, the decision involves some actual definitive choice of action.
Allow me to clarify by returning to the health care metaphor. Prior to the affordable care act, insurance companies can use discovered hypotheses to deny coverage in low rate plans. After the affordable care act, policies can use these hypotheses to deny access to care. In both cases, the choice is an experiment to test the discovered hypothesis.
The promise of big data is that it could give us answers. The problem is that big data is constrained by available data. Available data lacks information that we ideally want. We may have bright data (well documented and controlled, or highly accurate and precise) that measures something very remote from the causal relationships we seek. Or we may have dim (inaccurate, imprecise) or dark (model-generated instead of observed) data that is closer to the desired causal relationship. It is rare to have bright data right where we need it: at the causal relationship to the desired measure.
One of the health care cost problems is the detection and treatment of cancers. Assuming an ideal detection process, there remains an uncertainty as to whether the detected cancer (or tumor) will progress to become more serious or will progress to the point of becoming the cause of the patient’s eventual death. Current practice is to treat all (or the vast majority) of detected cancers aggressively. A cost savings approach is to be more selective about treatment. Today we can’t do this very well and we recognize that we are over-treating many cancers (how much over and how many is debatable).
In the future, we hope that more extensive data collection and storage can result in queries that will discover new hypotheses about which cancers to treat aggressively and which to ignore. The problem is that most of this data may be accessory data (analogs of what type of clothing the patient is wearing) that is very precise, or the data is remotely relevant and very imprecise.
I note that we have similarly high expectations for data collections for security purposes, law or regulation enforcement, or the production of new regulations. There is no doubt that the available data will qualify for the word “big”. The problem is that most of the available data that is relevant is probably very imprecise or that is very precise but irrelevant accessory data.
By its very nature, big data will permit discovery of patterns in available data. Analysts can then rationalize those patterns to suggest a discovered hypothesis. Decision makers may act on the discovered hypothesis to make decisions. Those decisions test the hypothesis, but the experiment is carried out as actual actions with real consequences.
Another way to describe the big data problem is that we really don’t know what data we should include. The project of big data is to discover new hypotheses. We want big data to surprise us. One way data analysis can surprise us is that it has access to surprising data. I mentioned the ideal of having bright data that is highly relevant in a causal way to to the desired measurement. But the starting point is that we have yet to discover that hypothesis that proposes this causal relationship. The hypothesis is likely to come from data we may initially not recognize as important. That is the message of my analogy of accessory data about patient’s choice of clothing. We don’t include it because we can’t imagine it being relevant and because it is not included it will never be possible to include such considerations for future discoveries.
I suggest we may be better off including accessory data, especially if it can be categorized in well documented and controlled ways to produce accurate and precise categories. We may eventually find psychological factors that may impact health outcomes and those factors may be revealed in certain dressing behaviors. Alternatively, and just as importantly, we can use these measures as control groups to show that an otherwise strong pattern can be replicated using extraneous information. We may find that BMI’s relationship to a certain type of cancer is no stronger than a preference for the color red for scarves or ties.
Bright data like accessory data or semi-diagnostic BMI values are very useful specifically because they are so precise and repeatable. Most of the relevant data we have is imprecise and costly or difficult to repeat. As we discover new patterns or hypothesis based on this relevant data, it may be useful to have access to more precise but irrelevant data. That could help us avoid making mistakes. It might also lead to stronger decisions where the additional information may come in the form of subconscious messages the patient is sending.
I admit that clothing and jewelry choices are highly unlikely to be useful clinically. However, by admitting it as allowable data to store, we loosen the restrictions for other types of information that may eventually prove to be clinically useful. With proper documentation of the nature of the data, analysts should have access to easily available and readily repeatable observations.
Pingback: Dedomenology: naturalist of the datum | kenneumeister
Pingback: Big data tall tales told by human story tellers | kenneumeister
Pingback: Spontaneous Data | kenneumeister
Pingback: Dedomenology: naturalist of the datum | Hypothesis Discovery
Pingback: Big data tall tales told by human story tellers | Hypothesis Discovery
Pingback: Accessory Data: When a label stands for nothing | Hypothesis Discovery