“The goal is transform data into information, and information into insight.” – Carly Fiorina
I like to compare data quality to an oil refinery. The goal of a refinery is to transform crude oil into refined, clean, and usable oil — after all, no-one will even consider using dirty oil in their cars. It’s the same with crude data: Our goal is to transform it into cleansed, rationalized, and usable information. Using crude (or dirty) data will cause a whole range of short- and long-term problems. Dirty data will prevent the organizational engine from fulfilling its energy potential, making it slow and ineffective – which presents quite a problem in the economic race against other organizations.
Back to the refinery. A refinery consists of two main entities: infrastructure and oil products. The infrastructure enables production (e.g., reactors), transport (e.g., pipes, valves and pumps) and storage (e.g., silos and containers). Oil products are the raw, crude oil and wide array of products and byproducts we all use everyday or are used by petrochemical industries. In a refinery, life is fairly “simple.” Some define what products are required, when and where; some are responsible for process design; some are responsible for the production of high quality products; and others are responsible for the infrastructure to support all these activities. Clearly, the maintenance or engineering department is not responsible for the process design, production, or marketing and distribution. Their job is to provide the required facilities for production, storage, and transportation, and to prevent contamination and reduction of product quality as a result of poorly maintained infrastructure. They are not responsible for the quality of the content.
So here’s my point: IT in a modern organization is the maintenance and engineering in a modern refinery. IT provides us with facilities to produce data (e.g., forms); they are responsible for data storage (e.g., databases); they are responsible for data transportation and delivery (e.g., interfaces, reports, queries), but they are definitely cannot be responsible for the quality of the data produced – the content.
Suppose a young engineer in a factory needs a particular screw for a production process he is working on. He will dip into the ERP system, scan many product descriptions, and a large number of products classified as screws. Among them, he finds the following four descriptions among others:
· DIN 912 10x1x30-2.9 mat304
· ALLEN SCR M10x30 stainless steel
· SOCKET BOLT M10x1 LG30 SS
· M10x1mmx30mm SHCS-SS
Well, none of them seems to be the one he is looking for, according the data in the technical catalog he is using. The next logical action will be to generate a new product in the catalog and to order this desired screw. A more expert engineer might be able see that all the above screws are actually the same, in spite of the totally different descriptions. The outcome is a new product number, new order, and inventory of a product already in stock. The next time an engineer needs this particular screw, he will conduct another search, fail to find exactly what he’s looking for, and probably generate yet another new product number with a different description. (By the way, the above product descriptions were taken from a real customer catalog!)
Can we expect IT to be responsible for the content of the product data? They provided us with the entry form; relevant fields; stored it in the database; and allowed us to restore the product data when asked. But IT cannot be responsible for the content stored in the IT systems. The main problems with data quality are with the quality of the content stored in the IT system. That is why we cannot transform data into information and information into insight.
"One man’s “magic” is another man’s engineering. “Supernatural” is a null word." – Robert A Heinlein
"If you torture data long enough, it will tell you anything you want!" – Unknown
"If you torture data long enough, it will tell you anything you want!" – Unknown
Wednesday, November 14, 2007
Thursday, November 1, 2007
The "install, run and ... poof" magic
I spent many years on the enterprise software side, hardly aware of the existence of something called “data quality.” In fact, like many people in the ERP realm, I probably contributed to the problem, because I was focused on slotting data into the right fields without stopping to consider the actual content. Today, we’re hearing a lot more about data quality, particularly when it comes to customer data, and increasingly when it comes to product data.
It’s not really surprising that the term means different things to different people and is used for varying purposes. Data quality has become more of a marketing slogan than a well structured and defined concept, with many consultants and software companies jumping onto this amorphous bandwagon.
I’ve spent the last three years in the development of computerized systems and best practices to solve the mess we helped to generate over many years. It is tough, complex, and requires a lot of experience and know-how as well as a profound understanding taxonomy and in many technical domains.
Here’s the bad news: there are no real simple solutions to this very complex problem; furthermore, crappy data created by over the years can’t be automatically solved by a magic tool: “Install, run and… poof!”
But here’s the good news: experience and know-how, best practices and methods, suitable software tools and hard work can solve the problem and bring the quality of the data to the right level.
Lately I’ve been seeing more and more promises of magic wands and tools that automatically and painlessly fix all the data quality problems and live happily ever after. Well, I too am looking for such a magical spell book!
In the meanwhile, I thought it will be more practical to share my thoughts with those involved in the data quality realm, bring the complex issue of PDQ down to earth, and maybe save some growing pains.
I do not pretend to be objective – I am biased. I develop systems and practices, run projects all over the world and am confronted with new challenges every day. But I am going to write about the real world, without marketing hot air and without delving into the realm of theoretical concepts.
I will be happy to receive your comments and to publish your thoughts regarding the data quality domain.
It’s not really surprising that the term means different things to different people and is used for varying purposes. Data quality has become more of a marketing slogan than a well structured and defined concept, with many consultants and software companies jumping onto this amorphous bandwagon.
I’ve spent the last three years in the development of computerized systems and best practices to solve the mess we helped to generate over many years. It is tough, complex, and requires a lot of experience and know-how as well as a profound understanding taxonomy and in many technical domains.
Here’s the bad news: there are no real simple solutions to this very complex problem; furthermore, crappy data created by over the years can’t be automatically solved by a magic tool: “Install, run and… poof!”
But here’s the good news: experience and know-how, best practices and methods, suitable software tools and hard work can solve the problem and bring the quality of the data to the right level.
Lately I’ve been seeing more and more promises of magic wands and tools that automatically and painlessly fix all the data quality problems and live happily ever after. Well, I too am looking for such a magical spell book!
In the meanwhile, I thought it will be more practical to share my thoughts with those involved in the data quality realm, bring the complex issue of PDQ down to earth, and maybe save some growing pains.
I do not pretend to be objective – I am biased. I develop systems and practices, run projects all over the world and am confronted with new challenges every day. But I am going to write about the real world, without marketing hot air and without delving into the realm of theoretical concepts.
I will be happy to receive your comments and to publish your thoughts regarding the data quality domain.
Subscribe to:
Posts (Atom)