"One man’s “magic” is another man’s engineering. “Supernatural” is a null word." – Robert A Heinlein

"If you torture data long enough, it will tell you anything you want!" – Unknown


Tuesday, November 18, 2008

Why Data Cleansing is Not Rational Enough

Many of us use the term “data cleansing.” I have never liked this term because it actually says nothing about the state of the data before cleansing (Was it dirty? A little bit dirty?) and what is the state of the data after cleansing (Less dirty? Totally clean? Or what?) Any improvement in data quality can be considered as cleansing, though the quality remains low. What term reflects the status of the data before, after, and the value it ultimately brings?

In the product data realm, some use the term “rationalizing.” This is much better. It means that there was irrational data ("not in accordance with reason; utterly illogical." Dictionary.com) and after processing, it was rationalized ("proceeding or derived from reason or based on reasoning; agreeable to reason; reasonable." Dictionary.com). But rational is very subjective. What is rational for one person may be irrational for others. Furthermore, the term “rationalized” doesn’t even hint about the potential value.

In generic terms, what we do is take raw, crude data, run it through several processes, and produce high value data that can be considered as a "single version of truth" and as such, can be used across the organization. This process can be easily considered as refinement ("to bring to a finer state or form by purifying." Dictionary.com). Bingo! Data Refinement has it all. It embodies the initial state of the data – raw; the final state – pure; and the value – refined, pure objects are considered to have higher value.

Monday, November 10, 2008

The Beginning of Wisdom is To Call Things by Their Right Names (Chinese proverb)

A few days ago, I had an interesting meeting with the CFO of a multinational company. He had recently tried to "optimize" his supply chain, or in other words, to cut costs. He knew that some products were causing him a major headache (and hole in his pocket), and had to do something about them. But which? The next step was to locate the products that account for most of the expenses (a kind of Pareto analysis) and then to find out the stock level, stock policy, average consumption, number of suppliers, the annual volume with each supplier, logistics (storage and transport costs) and so on. By doing so, he thought that he would be able to reduce the inventory and number of suppliers, to negotiate and get better purchasing conditions, and reduce the logistics costs. A good plan, indeed! Well, it is the core of Spending Data Management (SDM) and other supply chain optimization practices.

Unfortunately, in spite of the Oracle Application ERP, data warehouses, BI software, and other goodies that that the company had invested in during the last few years, he couldn't get a reliable picture of the company’s spend. I asked him to send us his product data (in a text file, Excel, or something similar), so we could analyze and evaluate the data quality.

I had a pretty good idea of what to expect, since I’ve seen it many times before. But we needed to put the evidence on the table, so to speak.

Let's take valves as a typical example. We found that they were classified under more than 20 different categories. Here are just a few examples:


  • Industrial Safety – Breathing Equipment – Valve/Diaphragm
  • Control – Control Equipment – Valve
  • Lifting – Winch spares - Engine/Clutch/Relay Spares
  • Liquid/Gas – Brass/Copper/Bronze Parts – Safety Valve
  • Liquid/Gas – Stainless Steel – Pneumatic Valve
  • Control – Control/Tubing Equipment – Electrical Valve
  • Control – Control/Tubing Equipment – Pneumatic Valve
  • Vacuum – Vacuum Installations – Right Angle Valve

In this scenario, ascertaining annual spend on valves, the valve inventory level, valve inventory turnover, and the number of valve suppliers is almost impossible. But, if all valves (irrespective of their usage) were classified under Valve, getting the required information could take a single click.

Most companies have no suitable taxonomy and, as a result, all their product data quality efforts are built on shaky foundations. If exactly the same product is classified under several different categories, decision making regarding spending and supply chain efficiency becomes guesswork.

I’ll talk more about taxonomy in future posts.