Jim's Thoughts: Data Integrity, GIGO, and Analysis

I consider it axiomatic that the quality of underlying data is a the primary limiter of analysis quality. If what we think data are solid and they are not, no matter how good a job we do of analyzing the data, the analysis will be flawed. Even if by good luck or some other mechanism we see correct answers, the analysis is still bad.

In fact, the worst case scenario is that the analyst is asked to answer several queries, applies valid analytic techniques, and produced correct answers. If the underlying data are flawed, and the outcome was good simply because the questions missed the areas where the bad data might influence the outcome, this scenario is a disaster. The outcome will send exactly the wrong message -- the underlying data are fine.

This is not the only potential integrity issue the consumer of data faces. Those who compile the data have an inevitable interest in presenting it as solid. I am not suggesting they cheat, but if you consume a substantial quantity of money producing a database for others to query, applying the best data integrity checks you can find, you will want to defend the data. This effect can multiply in a bureaucratic setting, as those who consume the analysis and make recommendations or decisions on it now have a vested interest in defending the data. I am not suggesting corruption here (except in the case of Illinois government) simply human nature.

The way bureaucracies deal with, "whistle blowers" is instructive here. The temptation to go along, to "group think" as I believe Bill Moyers first called the disease, is almost overwhelming. Once the data appear good, they are, or at least the users will act as though they are.

Over the decades, systems folks have developed an entire vocabulary of data protection. We call it, "data integrity." Chief among those defenses are the rules of relational data tables and the locking mechanisms built into SQL databases. Oracle, SAS, and Sybase invested tons of money and work hours assuring that the incoming data are properly placed and keyed, that there are no partial transactions and that when a data element is recovered, it is recovered with proper references across all the tables and servers onto which it was loaded.

This is a big task and the systems that can do it, MySQL, Oracle 11, SAS, an Sybase do it well. There are not a lot of them.

Data and the questions we ask it have a way of exploding. SQL was conceived before anyone thought it likely a production database might have pictures, or whole transaction documents in it. The sheer size of the data, and the storage of it on clouds that have servers essentially out of the sight of the analyst and the developer were not concepts either.

So now we have all this data, and we have an aware business culture. We do not want data, we want information derived from solid analysis that can show managers what is going on, what the trends are and where they should concentrate their efforts. And we want queries based on things like tags on Facebook pictures. We also have a huge volume of data aimed at the clouds, requiring placement in the database.

The response is the "NoSQL" database. "No" in this case is an acronym for "not only" so that the name does not mean that there is no access to SQL, but rather that it is not the dominant method of control. These new databases are designed not for data integrity but rather for high speed cloud processing. The rule used by the designers are that one should reach data integrity, sometime in the processing of a transaction. Not necessarily as is the case in an SQL database at the instant the primary elements is visible.

The NoSQL solution sacrifices some integrity for a major advantage in flexibility, compatibility with cloud services and retrieval speed. In many cases, that is a good trade. Identifying when it is not a good trade, and when analysis from a NoSQL database is at least subject to scrutiny, is a major analysis issue. Determining the need for scalable data structures and the risks associated with a particular solution is a major task for an organization. Often it is one entrusted to consultants. As long as those consultants are not an arm of a sales organization that is a fine choice.

The data questions we asked ourselves a very few years ago, no longer investigate enough to support the needs of users. Users need consultants able to evaluate the risks of cloud based noSQL solutions, and the risk of traditional SQL systems. There are always data integrity risks, regardless of the system chosen. An investigation should identify them for user management and present alternatives.

IT is never dull!

Jim's Thoughts

06 December 2012

Data Integrity, GIGO, and Analysis

No comments:

Profile Information

Blog Archive