Databases motivates philosophy with multi-valued logic anticipated by Buddhist thinkers

This article describes the concept of multi-valued logic (having values other than true or false) as a philosophical debate between western and eastern philosophies: Aristotelian and Buddhism.   Western philosophy revere the principle of excluded middle (that allows only two possibilities of true or false) and the principle of non-contradiction (something can not be both true or false).   In contrast, Buddhist philosophy allows for possibilities of both true and false, and of neither true nor false.   The article describes even a fifth possibility of ineffability — something that we can not even discuss.   The observation of article is that modern western philosophy is more accepting of these possibilities and thus converging on Buddhist thoughts.

If there is a convergence of western logic to eastern logic, it may be coincidental.   I think the recent importance of multi-valued logic is a direct consequence of database technologies.  In particular, the normalization process that strives to remove redundancy in data records began to produce more ambiguous results.   In data design, we have a variety of relationships described as n-to-m where n and m can be different ranges of numbers.

Data design usually involves tables of records where each record can have multiple columns and each column can have many values.   The process of normalization removes redundant values by creating a new table the list all of the possibilities and that table offers primary keys for other tables to reference with a foreign key.   The normalization goal is to produce a relationship of many-records referencing the same primary-key so that the common value only needs to be stored once.  Usually this involves data such as text data that is far richer than a simple true or false value, but the reference table could have just two values: “true” has a primary key of 1, and “false” has a primary key of 0.

The article describes a way to describe Buddhist concept of catuskoti (four corners) in terms of sets.  True and False may be described as single element sets {T} and {F} respectively.   The use of sets allows for other possibilities such as {T, F} that is both true and false, and {} empty set of neither true nor false.

The article goes on to describe another concept of ineffability, something that can not be talked about at all.   That concept itself is a contradiction because mentioning that something exists that we can not talk requires talking about that something.  See the article for a better explanation of the contradiction of acknowledging ineffable concepts.

The article discusses the problems of contradiction and ineffability in terms of the goals of logic to evaluate the truth of a statement.

Once someone mentions sets, I think of databases.  Even if Buddhist philosophy did not exist, database normalization presents the real problems of multi-values and orphan data.   Database design does offer tools to enforce constraints to prevent these from occurring especially in normalized data so that we end up with every foreign-key referencing a single actual record in another table.   The problem is that we often encounter data that doesn’t fit right either due to poor design in maintaining the data, or due to poor design in acquiring data.

The goal of normalization is that a foreign-key reference will always be present, it will always find its matching primary key in the other table, and it will find exactly one primary key.    The database constrains any primary key to be unique and this eliminates the possibility of multiple results.   The database can also constrain any foreign key to match one of these primary keys and this eliminates the possibility of lack of matching values.

The problem is that through real-world operation, we encounter records that can no longer meet these constraints when records are added or deleted.   In databases, we recognize that there are times when records representing real information that can not fit the constraints.

Routine operation of the primary table can result in the need to eliminate a value in the primary table and this can orphan all of the foreign keys that reference that entry.   An example may be a table of orders where each has has multiple items.   When we attempt to remove the order (such as when an order is canceled) we have to deal with the items that reference this order or else the items will have a foreign key that has no matching primary key.    Until we clean up these records within a transaction to accomplish multiple operations at the same time, the record will have a status similar to the ineffable statement discussed in the article.  We can not talk about these records because their foreign keys do not match valid primary keys.   To solve this problem we block readers or writers from accessing these records until the records are clean again.   During this period of blocking within a transaction, we are dealing with a real data problem with something that does not fit the ideal concepts of having a valid exclusive relationship.    In this scenario we hide the data during the transaction until all of the data is clean.   While the data is hidden, no one can talk about it.   While not exactly the same concept of philosophical ineffability, it does represent a different rule of logic applies for information locked inside a transaction.

The ineffable concept may apply within the transaction when the child records exist with foreign keys that match a primary key that we are attempting to eliminate (the primary key is no longer meaningful).   This problem only exists behind the scenes inside stored procedures with transactions that hide the data being processed.   However, within the transaction the problem exists that we have records that reference a no-longer valid primary key.   Because we need to confront this problem, we need to be able talk about it.   Philosophical epistemology needs to explain how database technologists able to talk about what is going on within such a transaction.

Although, the alternative reality within a transaction may present an example of an ineffable concept, I don’t think it matches the concept described in the above article.   A transaction involves temporary ineffability that is destined to be resolved while philosophical ineffability is more permanent that we must accept.   At the end of this post, I will describe where I think data projects confront permanent ineffability.

Contradictions characteristic of Buddhist thought occur in data.  Usually these contradictions arise when we attempt to match records from multiple data sources.

I had a recent experience that illustrates this problem.   Recently, I used one of the online person-search services to find my own name.   I end up finding 4 copies of myself with the same name and age but with three of these persons living at addresses I had in the distant past.  The four records were referencing the same person but located this person in four places.  The reason why there are four results is because I used a search service that pulls data from multiple sources.  The sources offered contradictory data.

This is real world example of problems that data scientists confront in data.  These are problems database engineers or data scientists confront all the time.   We need a way to talk about them, to understand them.   Although data in databases lack physical reality that is frequently the topic of philosophers, the data does represent a practical topic that we have to confront and solve.  We need a way to talk about data relationships that do not fit neatly into two choices of true or false.

Data relationships resulting in multiple values such as the above example are analogous to the philosophical problem of a statement being both true and false.    The above examples can not be easily cleaned up because they represent different facts from different sources that claim the data to be valid.  In the example, I had in fact previously occupied those addresses.   If this query came from a data warehouse, we can blame poor database design for not using an appropriately unique primary to recognize a valid time period as well as a person’s identity.  Instead the query came from a heterogeneous search engine that combined results from multiple sources.  A similar result can occur in modern concepts of data lakes as opposed to data warehouses.   In these systems, the analyst has to confront multiple conflicting values.  Blaming the poor design does not excuse the practitioner from dealing with the problem.

Data relationships can have no values.   In databases, this is the null value: a reference that is explicitly empty.  Although ideally any foreign key must be constrained to match an existing primary key, there are reasons to allow the reference to be null.   The relationships between tables may be ad hoc relationships that are not governed by any constraints.   We allow for the possibility that something can lack an external reference.   This occurs frequently when we attempt to match data from multiple sources.

An example may be when we attempt to match people who register motor-vehicles to people who are registered as occupants of a property.   The vehicle registrant may list an address where he is a guest of the property where he is staying.   Demanding that the two records match is similar to the demand that something be either true or false: we must allow the option of “neither” in order to not lose the record of a real registration.

In database queries (SQL), we use outer-joins to tolerate these conditions, and we handle the nulls.   This null-handling in outer-joins is a real world cognitive experience.   To be relevant to explain this experience, philosophy needs to explain null-handling: accepting nulls amounts to accepting that something can be neither true nor false.

In data projects that involve the accumulation of data from multiple sources, the various records of data have some claim of validity.  The challenge for the data scientist comes in matching the data records from multiple sources that may conflict in some way and in checking that the aggregate of data actually matches reality.

Last year during the introduction of the federal exchanges for the affordable care act, there were numerous reports of frustrations of subscribers being unable to verify their own identity.  The website back-end required people to verify their status with various independent entities within the government.   The identity required matching the person’s information in credit-rating agency data, IRS data, social security data, etc.   The problem occurred when the records from this various entities did not match.  The consequence was that the candidate subscriber’s identity could not be verified.   As a logic problem, we could describe this as a statement “this person is who he claims to be” with the expectation that this statement is either true or false.   In some strict sense, the statement is either true or false.   However, from a practical perspective of information databases where all the information we have exists in databases, we need to confront the real logical conclusions that the statement is neither true nor false, or that the statement is both true and false but more of one than the other.

Philosophical multi-valued logic is inherent in data science.   Thinking in data science must include other options than true or false.   The above examples present a promise that there is an underlying truth or falsity of the statement so that there is only a temporary period of uncertainty that we always expect in determining the ultimate truth conditions of the statement.  However, within the data we have about the statement, there may never be a satisfactory conclusion that the statement being either true or false.   We have to allow for other options.   Human cognition is required to work through these problems.   Philosophy of knowledge needs to explain this type of cognition.   Database work can not exist in a world of the principle of the excluded middle nor the principle of non-contradiction.   Databases have contradictory data, and data that may forever be neither true nor false, or be both true and false.

In earlier posts, I described data science as a supply chain of information.   At the beginning of this chain, there is an attempt to find a single version of truth for some operational need.  Later steps in the supply chain need to combine data from multiple sources and this can result in multiple versions of truth.  Usual practice in the later stages involves a process to exclude the dirty data in order to arrive at a single version for that stage.  Each stages of the information supply chain makes a claim of the truth of their data.   However, each stage also must challenge the claim of truth by its suppliers.   Within the data supply chain the same information can be true, false, neither or both depending on where the information resides.   One step in the chain claims its data product information is true while the next step can determine that same information to be false due to conflicts with other data.

In general, subsequent steps may combine a large number of data sources and this leads to a inconclusive result.  The goal becomes identifying the best or most trusted data instead of identifying truth.   Because it is possible that the less trusted data may actually be true, the later steps in the supply chain are motivated to retain these data instead of discarding them.  The data store will have some data being less trusted than some other data but the less trusted data may still be closer to the truth.   We may distinguish the retained data with additional columns to identify our trust in this data.   Later stages of the information supply chain may have multiple versions of the truth with no real way to determine definitively what is true.   This is part of the challenge for philosophy of knowledge to address.   We have data that we can not exclude and yet is neither true nor false or it is both true and false.

Returning to the personal example of my querying my own name in online people-search engines.  In that example, I found copies of myself still living in addresses I have long since left.   Those are records that contain some true information such as the fact that someone with my name has my age.  The records also have some true information about my history of residences.   The falsehood was the interpretation of query result as stating my current address.   In this case, it should be easy to exclude the obsolete information such as finding no matching recent records of my activities at old addresses, or finding more recent records of some other person occupying the same address I no longer occupy.   However, within the data, the information persists in a form that can not be validated based only on information within that record.

Consider the introduction of a new record that contains columns to identify a name, an address, and a birthdate.   This record came from some data-entry process with an initial claim that it is true.  This is not randomly generated data.  We accept the data record into a data store.   However, we may add a column to the data to allow us to flag the data as being true or false.

This column can be a simple field with a constraint of having just two values: true or false.   For a new record that we have not yet validated, the value is neither true nor false.   To accommodate new records, we allow the column to have a third value of null or empty.   The null value informs us that the record has not yet been validated and so the value is neither true nor false.   However, we may still access this record in a query and depending on circumstances introduce a null-handler that replaces the null with true or false.  Thus in the context of queries, this null value is both true and false.

Database technologies recognize this special logical status of null.  For example, the database technologies may prevent matching two null values: in other words, null does not equal another null.   Null is thus not a value in the same sense as true or false or values.  Nulls can not match other nulls.   The existence of nulls is a necessity in data, especially during the data ingest process.   Database technology has a philosophy of how to handle nulls, and some details of this philosophy continue to be debated.   The philosophy of null handling appears similar to the eastern philosophical constructs and contrary to the western or Aristotelian philosophies.   Null handling is not an abstract concept like what happens to a person after they die, but instead is a very practical problem that must be solved in order to interpret data.

We receive a record from another source where that source claims is true but we have not yet independently verified it.  The existence of the new record is itself a claim of truth, but we are withholding judgement until we can match it with other data.   This is similar to the project of fact-checking in journalism as illustrated in the recent controversies such as the events in Ferguson MO, or the rape allegations at UVa.  In these stories, the journalists receive reports that are claimed to be true but journalist practice is to independently verify the details.   The problem of knowledge is how to assess these intermediate records with claims of truth but not yet verified (or perhaps unable to verify).   These records do exist, and we need to include them in our cognitive reasoning but the records are claims of truth that can be false: at the same time neither or both.

Typically, journalism records this preliminary information in personal and private notes hidden from outside view.   The publication of the article is the presentation of the journalist’s claim that he has successfully been been able to verify the information to some reasonable extent.

In data systems, this preliminary information (before verification) will exist in the data store.  This data may be in tables with limited permissions so that only certain groups or processes can access the data, but the data is in the schema of the database.  We need a way to talk about how we are handling this data that is neither or both true or false.   One approach is to allow this determination to be null, or empty of any value.

We need to populate this null column for recording whether the record is true or false.   When we do this, we need to know what the options are.  In this case where the options are just true and false, the options are trivial.   But in general, we will have multiple values where the list of options may change over time.   We can manage this by making the column a reference to another table that provides the currently available options.   When it is time to populate this information, the process will consult this new table for the possible options available in order to select the appropriate reference.   In the true or false example, this record of data ends up with selecting one of the values of a reference table with two values: true or false.   The record may start with a null reference but the foreign constraint limits the options to be one of the values in the table.   This is effectively the same result as the initial design of a simple column with a constraint of two values but will allow a null.

The problem with records is that they are often very complex.  In my personal example, the record contains a name, an age, and an address.  We may need to verify each of these columns separately.  Alternatively, we may need to validate this combination multiple times as we attempt to match it with other records.

There may be different fact checkers involved in validating this data.  Instead of having separate columns for each fact checker, we may introduce an intermediate table that matches the record to each fact-checker’s conclusion about the truth of the record.  This intermediate table has two foreign keys: one to match the primary key of the information record and the other to match the primary key of the table available choices (in this case the choices are true or false).   The intermediate table may include additional information such as the time of the check, who did the checking, what type of check occurred, etc.  With this table, we now have an opportunity to have multiple firm conclusions about the truth of the record.  When we query the truth of the record, we can obtain a no result (a null reference), a single result, or multiple results.  In general, we will obtain multiple results.

Fact checking results in both or neither values of true or false

With this example, we have to confront the conditions of multiple fact checkers having conflicting determinations of truth.  In the case of three fact checks, we can assign 3 records to the intermediate table each time we add a new information record.   Each of the 3 records designates a different approach to a fact check.   Initially, these fact checks are assigned null values because the check has not yet occurred.   This presents a truth result of {N,N,N} where N is null.   The goal is to eventually declare the record as true in context of the current needs.    However, the initial record of information exists with a truth value of {N,N,N} and we need to talk about the truth value of this record.

In a down-stream data warehouse, data mart, or data lake context, the new record came from a source that assures us that the record is accurate.  Even though we have done no independent checks of our own, the existence of the record is an indication of its having some validity.   We can make use of this record in scenarios where we are satisfied with the assurance of the source so that we do not require independent checks.  This is similar to what happens in journalism or social-science field work where the investigator records stories in a field notebook.   These stories are not fit for widespread publication, but may be shared with other investigators.   Because the story came from a reputable source, there is some value in the record of that story even before we fact check it.   Using this story is a tentative acceptance that the story is true, but the lack of fact checks informs us that we lack confidence in the truth.  The story is both true and false, depending on the context.

Ideally, we will obtain the desired number of fact checks and they all confirm what they were assigned to check.   This adds to the source’s claim of truth a fact-check truth value of {T,T,T} in my hypothetical of something requiring just 3 checks.  With this confirmation, we can confidently make this data available to our consumers, analysts, or down-stream data processes.  In the journalism context, we can publish the story.

Before this happens we may obtain the fact check results at different times.  As each fact check arrives, we update the truth value.  In the ideal case of a record the passes all tests, we will progressively observe truth values of {N,N,N}, {T,N,N}, and {T,T,N} before observing {T,T,T} that meets our definition of true.    Until the final check is confirmed, we have access to a record that is partially checked.   We need a way to talk about the truth of that data record.   In high-velocity data processes, we may observe high confidence that subsequent checks will be true.   For example, we may observe our independent fact checks consistently confirming the source’s claim of truth.   Alternatively, we may observe that if the first fact check is confirmed, then usually the remaining ones will be confirmed.   We may use these observations to permit us to use a record that is not fully confirmed to be true.  That choice to use of the data indicates our acceptance of the record being true, or at least true enough.

In this scenario, there is a possibility that the fact checks are not confirmed.   For example, following the initial fact check, we may end up with a truth value of {F,N,N}.   Western logic requires us to declare the entire record as false if something about the record is found to be false.  Many programming languages implement a lazy algorithm that will stop evaluating a conjunction as soon as one conjunct is found to be false.   However, in data, the record came from a trusted source that assured us that the record is true.   The first fact check contradicts the source’s assertion and this confronts us with the contradiction that the record is both true and false.   Instead of immediate discarding the record, we invest more effort to resolve this contradiction.  For example, we may wait for the remaining fact checks to arrive.   While we wait for resolution of the contradiction we need a way to talk about this record that is both true and false.

The same occurs when we have contradictory fact checks such as {T,F,N}.  In this case the successful fact check confirms the source’s assertion and the second fact check contradicts it.   The record is likely to be true but we have more doubts than we would have if the second fact check had confirmed the record.   Other combinations of truth values may be {T,T,F} resulting in more confidence or {T,F,F} resulting in less confidence.   In each case, we may continue to accept the record while also disclosing the results of our independent fact checks.   Our continued use of the record requires a way for us to talk about its truth value.   Because we continue to use the record, we are accepting its truth but the accepted truth is the compound statement: we have received a record that a trusted source assures us is true but our independent checks did not fully confirm it.  The consumers of this record need to comprehend the record as being both true and false.

There are other combinations that we will confront such as {F,N,N}, {F,F,N}, and {F,F,F} where each successively decreases our confidence in the truth of the record.   Again, in down-stream data systems (such as data warehouses) we do not implement a lazy strategy of discarding a record on the first false finding.  We may retain the record even if all checks failed because we still have the assurance of the source that the record is true.   We need a way to talk to about such records as we seek out other records with the same information but with better fact-check results, or we seek out explanations for why our independent fact checks contradict the source’s assertion.   We may even choose to use a record with all fact checks failing because it remains the best data available about that particular observation.  We use that record with the disclosure that we consistently failed to confirm it.

In these scenarios, the data processes do not follow the western-philosophy recommendation of efficient or lazy evaluation that discards the record on the first evidence of falsity.  In contrast to computer languages that often implement a lazy approach to evaluating a conjunction, data projects will continue to invest in data despite a false finding.   There is a need to talk about the truth of the record during this period of continued investment despite the unsuccessful fact checks.

In early years of database design we treated these problems as a purely technical issue but we strove to conform to the western philosophy of non-contradiction and excluded middle: something can not be both true and false and something must be one or the other.   This design philosophy persists today with the concepts of building data warehouses with a single version of truth.

Recently the concepts of big data systems of unstructured data successfully challenge this approach.  We now accept the concepts such as data lakes where the end analyst must confront multiple versions of the truth.   These approaches force us to understand data as being neither or both when it comes to deciding whether something is true or false.   This is not a conquest of Buddhist thought over western philosophy.  Instead it is real world challenge we face in dealing with conflicting data from multiple sources where each is confident they are providing the truth.  Even when we do not explicitly invoke philosophical concepts in the practical consideration of data, our thinking about data is closer to the eastern way of thinking about truth than it is to the western philosophy.

The ineffable

Much of the above discusses the problem of the excluded middle or the non-contradiction in that we do encounter data that is both or neither value of true or false.  The article also discusses the concept of ineffability as another important concept from eastern thought.  As the article states, talking about something that is ineffable makes it a subject of discussion and that makes it no longer ineffable.   We contradict the ineffable state of something when we are able to talk about it.   In the above discussion I alluded to the idea of the null value or locked records inside transactions as being practical manifestations of ineffable conditions.  However, these examples doe not approach the philosophical concept of ineffability because we expect that the null will eventually be populated with a real value and we have an idea where to find the possible values it can have, or we expect the transaction to eventually end.  Null values may be empty but we can at least still talk about what they might become when they are no longer null.

I think data practice demands us to consider the truly ineffable.  This is a more recent challenge as we move toward 3-V (volume, velocity, variety) of big data with the need for rapid if not automated decision making based on analytics of data.  In recent posts, I described how this is leading to degrading the role of human decision makers and leading to an new form of government by data.  To make decisions rapidly with available data, we run the risk of making decisions on spurious correlations that suggest relationships that defy any cognitive justification.   The goal of automated decision making based on data is hampered by our demand for cognitive justification of some relation that may appear in data.   This demand requires deliberation, debate, and consensus that takes time and results in human decisions instead of data decisions.  This demand for time for deliberation is contrary to the needs for high velocity decision making.

Pure data decisions have to accept the possibility that an automated recommendation may be supported by a spurious correlation that lacks any cognitive justification.   In my discussions about government by data, I argued that we may accept the authority of conclusions based on most recent data because of our observations that such correlations tend to continue for a short time or to degrade slowly.  Even inexplicable relations more likely will result in beneficial outcomes and that on balance the rewards of their successful predictions will outweigh the costs of failed predictions.  We can choose to act on inexplicable relations that appear in data.

When we implement a system based on 3-V big data, we can benefit from immediate decisions as long as we frequently replace decisions with newer decisions based on even more recent data.   Evidence of this benefit comes from experience in exploiting high velocity data-driven automated decision making in marketing systems.   This experience provides a clue that we may experience similar benefits when we extend this automation to government a whole.

This new opportunity made possible by data (big data in particular) presents a philosophical challenge in justifying automated decisions that lack even the attempt to cognitively justify the decision.   The data patterns alone justify the decision even when the relationships are laughably nonsensical.   Here is where I think ineffability applies.   The justification of following recommendations based on data alone involves accepting the ineffable.    We accept a recommendation without being able to talk about it.  Increasingly we are doing this as we adopt big data technologies.   When we do so, we accept logical state of ineffable as an alternative to true, false, neither, or both.

Advertisements

One thought on “Databases motivates philosophy with multi-valued logic anticipated by Buddhist thinkers

  1. Pingback: Appreciating biblical stories as proto-journalism | kenneumeister

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s