"One man’s “magic” is another man’s engineering. “Supernatural” is a null word." – Robert A Heinlein

"If you torture data long enough, it will tell you anything you want!" – Unknown


Tuesday, November 18, 2008

Why Data Cleansing is Not Rational Enough

Many of us use the term “data cleansing.” I have never liked this term because it actually says nothing about the state of the data before cleansing (Was it dirty? A little bit dirty?) and what is the state of the data after cleansing (Less dirty? Totally clean? Or what?) Any improvement in data quality can be considered as cleansing, though the quality remains low. What term reflects the status of the data before, after, and the value it ultimately brings?

In the product data realm, some use the term “rationalizing.” This is much better. It means that there was irrational data ("not in accordance with reason; utterly illogical." Dictionary.com) and after processing, it was rationalized ("proceeding or derived from reason or based on reasoning; agreeable to reason; reasonable." Dictionary.com). But rational is very subjective. What is rational for one person may be irrational for others. Furthermore, the term “rationalized” doesn’t even hint about the potential value.

In generic terms, what we do is take raw, crude data, run it through several processes, and produce high value data that can be considered as a "single version of truth" and as such, can be used across the organization. This process can be easily considered as refinement ("to bring to a finer state or form by purifying." Dictionary.com). Bingo! Data Refinement has it all. It embodies the initial state of the data – raw; the final state – pure; and the value – refined, pure objects are considered to have higher value.

2 comments:

Steve Sarsfield said...

In cleansing, I'd say that what we do is make the data fit for business purpose. If you want to send a mailing, we make the name and address data fit for use for the post office. However, that same exact data may not be an exact fit for the 20 year old CRM system. If you want to fix duplicates in an ERP system, we cleanse it and make the data fit for the ERP system so that we're not carrying too much inventory, and we're able to manage our partners more effectively.
In data governance we're stepping it up a level and trying to get the CRM systems, mailings and the ERP system all working together to have a better understanding about everything in our business. Data governance is more big picture.

Alexander said...

Great post I now have abetter understanding of data cleansing