"One man’s “magic” is another man’s engineering. “Supernatural” is a null word." – Robert A Heinlein

"If you torture data long enough, it will tell you anything you want!" – Unknown


Monday, October 13, 2008

The Proof of The Pudding is In The Eating

Lately, I met with a potential customer (a small-medium multinational enterprise) who (like many of our customers or potential customers) claimed there was no way the database had duplicate products. He’d established a dedicated team who was responsible for the creation of new item records (SKUs). The team has aimed to keep product descriptions as consistent as possible, inputting the product features in the same order and manner. Furthermore, the company has kept the same team for many years to ensure consistency.

Well, it’s a nice approach, but I was skeptical. I don't believe that any human being, talented as one may be, can manually maintain master data at the same quality level that can be achieved by a suitable computerized system. The proof of the pudding is in the eating, so I asked him to send us some of their data and enable our domain experts to evaluate its quality. They did. On first sight, the data looked really good, relatively speaking – the best I have seen until now. But a deeper analysis by an expert quickly revealed the problems.

Here’s a typical example. The company standardizes its product descriptions to eliminate duplications, listing each product’s diameter, steel code, length, and hardness:

Rod Dia 1" SAE4340 Lng 18' 420BH
Bar Round Dia 25.4MM SNCM8 Lng 6M 45Rc

There’s just one problem. These seemingly different product descriptions are both the same product — but using different measurements and technical standards.

· Rod is Bar Round
· Diameter of 1" is 25.4mm
· US standard SAE4340 is SNCM8 JIS standard
· Length of 18' is 6m
· Hardness 45Rc is 420BH

No one can expect that a team responsible for creating new item records can be an expert in all domains, know all the standards and common abbreviations used in each domain, be able to correctly classify and understand the various technical features relevant to each domain, or even distinguish between the varied descriptions of the same product used by different suppliers. Not to mention the huge obstacle of different languages.

The ultimate advantage of a computerized data quality system is the ability to harness and reuse the domain experts’ knowledge, creating a data quality firewall that prevents the creation of duplicate records.