"One man’s “magic” is another man’s engineering. “Supernatural” is a null word." – Robert A Heinlein

"If you torture data long enough, it will tell you anything you want!" – Unknown


Thursday, January 29, 2009

Taxonomy, Divide and Merge (Part 2)

Following on my last post, I’d like to focus on the question: What can be considered as a good taxonomy? Given the fact that taxonomy is something between art and science, a consensus will be hard to achieve. Luckily, this discussion is focused on the product data realm and based on practical aspects gained through many years of experience working with product data.

A product taxonomy should be practical

We need to have a taxonomy that enables us to search, compare, group, or analyze products quickly and easily. The pure 'academic' approach to defining categories is to group products that share exactly the same attributes, so that each group will constitute a category. This bottom-up approach will result in a long, flat list of categories. This kind of list will not serve us efficiently in searching and navigating through products and many categories will only contain a few products.

The alternative approach, the top-down approach, is based on logically dividing the product world into groups (e.g. hand, power, and machine tools as one group, fasteners as another group) and then continuing to divide those worlds into sub-worlds (e.g. fasteners is sub-divided into screws/bolts, nuts, nails, etc.) and so on. This approach results in a subjective structure and will be prone to errors. The best, practical method is a combination of the two approaches — resulting in an optimized taxonomy which incorporates categories that share the same technical attributes and includes many products as possible.

But let’s go back to the beginning. The first and most important factor in creating a taxonomy is the definition of the categories. A category should always reflect the essence or the nature of the classified object. It sounds trivial, but unfortunately, most of the categories I have come across are based on the usage of the object. For example, we had a customer (an academic research institute) that classified sugar (yes, sugar!) under several categories:

  • Animal food
  • Chemical materials
  • Refreshments
  • Office supplies

But sugar is sugar, whatever you do with it. Another common example is capacitors. A capacitor is a capacitor, but in many organizations they are divided into Electrical Capacitors and Electronic Capacitors, while the products in both categories share the same technical attributes.

(By the way, the main reason why current taxonomies are usage-oriented is because among the first to define and build classification systems were maintenance departments. For them, the best way to classify was according to the usage/facility/machine. They preferred to say "a ball bearing for X machine" than "a ball bearing with diameter D, material M, etc.”)

Notes on the relationship between categories and attributes

Another interesting aspect of categories is the ability to transform attributes into categories and vice versa. This feature enables us to optimize our taxonomy for a given organization or situation. For example, in the public taxonomy UNSPSC, there are several categories of RAM memory: Random Dynamic RAMs, Random RAMs, and Static Random RAMs. The three categories share the most of the same attributes, so if your main business is not RAMs, you may prefer to have a single category of Random Access Memory and define a technical attribute called Type with three values: Random, Dynamic, and Static.

A plate by any other name

Another factor is the name of the category. It’s important to bear in mind that we all have our own perceptions, so if category names are not simple and clear and there are ambiguities, many products will be wrongly classified. Think, for example, of the word “plate.” It may refer to coating, a dish, or board and maybe there are more meanings. It is definitely a bad category name!

In my next post, I’ll take a closer look at the category hierarchy — the second most important factor in creating a good taxonomy.

Wednesday, December 10, 2008

Taxonomy, The Constitution Of Product Data Quality (part 1)

I think the most important, yet undervalued, factor in the product data realm is the taxonomy. The taxonomy is the skeleton and foundation of any reasonable product data quality (PDQ) strategy. A complete, well done taxonomy (and I will elaborate on this later) serves as the core domain knowledge for PDQ and increases its level of quality. Of course, having a good taxonomy cannot, by itself, guarantee high product data quality, but without it, high quality product data cannot be achieved at all – it’s simple as that.

Taxonomy can be easily compared with the constitution. A good constitution is comprehensive, clear, consistent, balanced, practical, updated, and sets limits and borders while also allowing for ad hoc judgments and decision making. A good constitution embodies and accumulates values, positions, culture, experience, common sense, and serves as a guide for the society who creates it. But having a good constitution is not enough – it should be followed, enforced, and continuously maintained. The same with a taxonomy.

To give a bit of background, taxonomy, the study of classification, is the basis for all science. We use taxonomy to structurally group similar things into categories, based on a set of common, category-specific characteristics. Aristotle made one of the earliest attempts to classify two major groups: plants and animals. Plants were separated according to size (structure)–herbs, shrubs, and trees, and animals were grouped according to where they lived–land, sea, or air.

Carolus Linnaeus (1707-1778) was a Swedish naturalist who is considered the "Father of Taxonomy." He set out to examine, describe, classify and name every living species on earth and developed the system by which we name organisms today, which groups species according to shared physical characteristics. His task was to make sense out of chaos, and to devise an organizational system that would sort out any confusion. He grouped species together into genera based upon physical similarities, and then grouped genera into families based upon broader physical similarities, etc.

In essence, taxonomy consists of four elements:

  • Categories (e.g. Manual Wrenches, Screw Drivers)
  • Category hierarchy (e.g., Assembly & Fastening -> Wrenches -> Manual Wrenches)
  • Attributes related to each category (e.g., Manual Wrenches: Opening Size, Overall Length, Handle Type, Jaw Material, etc.)
  • Values related to each category attribute (e.g., Manual Wrenches: Jaw Material: Alloy steel, Cast bronze, Aluminum-Magnesium, etc.)

In my next post, I will discuss "What constitutes a good taxonomy?"

Tuesday, November 18, 2008

Why Data Cleansing is Not Rational Enough

Many of us use the term “data cleansing.” I have never liked this term because it actually says nothing about the state of the data before cleansing (Was it dirty? A little bit dirty?) and what is the state of the data after cleansing (Less dirty? Totally clean? Or what?) Any improvement in data quality can be considered as cleansing, though the quality remains low. What term reflects the status of the data before, after, and the value it ultimately brings?

In the product data realm, some use the term “rationalizing.” This is much better. It means that there was irrational data ("not in accordance with reason; utterly illogical." Dictionary.com) and after processing, it was rationalized ("proceeding or derived from reason or based on reasoning; agreeable to reason; reasonable." Dictionary.com). But rational is very subjective. What is rational for one person may be irrational for others. Furthermore, the term “rationalized” doesn’t even hint about the potential value.

In generic terms, what we do is take raw, crude data, run it through several processes, and produce high value data that can be considered as a "single version of truth" and as such, can be used across the organization. This process can be easily considered as refinement ("to bring to a finer state or form by purifying." Dictionary.com). Bingo! Data Refinement has it all. It embodies the initial state of the data – raw; the final state – pure; and the value – refined, pure objects are considered to have higher value.

Monday, November 10, 2008

The Beginning of Wisdom is To Call Things by Their Right Names (Chinese proverb)

A few days ago, I had an interesting meeting with the CFO of a multinational company. He had recently tried to "optimize" his supply chain, or in other words, to cut costs. He knew that some products were causing him a major headache (and hole in his pocket), and had to do something about them. But which? The next step was to locate the products that account for most of the expenses (a kind of Pareto analysis) and then to find out the stock level, stock policy, average consumption, number of suppliers, the annual volume with each supplier, logistics (storage and transport costs) and so on. By doing so, he thought that he would be able to reduce the inventory and number of suppliers, to negotiate and get better purchasing conditions, and reduce the logistics costs. A good plan, indeed! Well, it is the core of Spending Data Management (SDM) and other supply chain optimization practices.

Unfortunately, in spite of the Oracle Application ERP, data warehouses, BI software, and other goodies that that the company had invested in during the last few years, he couldn't get a reliable picture of the company’s spend. I asked him to send us his product data (in a text file, Excel, or something similar), so we could analyze and evaluate the data quality.

I had a pretty good idea of what to expect, since I’ve seen it many times before. But we needed to put the evidence on the table, so to speak.

Let's take valves as a typical example. We found that they were classified under more than 20 different categories. Here are just a few examples:


  • Industrial Safety – Breathing Equipment – Valve/Diaphragm
  • Control – Control Equipment – Valve
  • Lifting – Winch spares - Engine/Clutch/Relay Spares
  • Liquid/Gas – Brass/Copper/Bronze Parts – Safety Valve
  • Liquid/Gas – Stainless Steel – Pneumatic Valve
  • Control – Control/Tubing Equipment – Electrical Valve
  • Control – Control/Tubing Equipment – Pneumatic Valve
  • Vacuum – Vacuum Installations – Right Angle Valve

In this scenario, ascertaining annual spend on valves, the valve inventory level, valve inventory turnover, and the number of valve suppliers is almost impossible. But, if all valves (irrespective of their usage) were classified under Valve, getting the required information could take a single click.

Most companies have no suitable taxonomy and, as a result, all their product data quality efforts are built on shaky foundations. If exactly the same product is classified under several different categories, decision making regarding spending and supply chain efficiency becomes guesswork.

I’ll talk more about taxonomy in future posts.

Monday, October 13, 2008

The Proof of The Pudding is In The Eating

Lately, I met with a potential customer (a small-medium multinational enterprise) who (like many of our customers or potential customers) claimed there was no way the database had duplicate products. He’d established a dedicated team who was responsible for the creation of new item records (SKUs). The team has aimed to keep product descriptions as consistent as possible, inputting the product features in the same order and manner. Furthermore, the company has kept the same team for many years to ensure consistency.

Well, it’s a nice approach, but I was skeptical. I don't believe that any human being, talented as one may be, can manually maintain master data at the same quality level that can be achieved by a suitable computerized system. The proof of the pudding is in the eating, so I asked him to send us some of their data and enable our domain experts to evaluate its quality. They did. On first sight, the data looked really good, relatively speaking – the best I have seen until now. But a deeper analysis by an expert quickly revealed the problems.

Here’s a typical example. The company standardizes its product descriptions to eliminate duplications, listing each product’s diameter, steel code, length, and hardness:

Rod Dia 1" SAE4340 Lng 18' 420BH
Bar Round Dia 25.4MM SNCM8 Lng 6M 45Rc

There’s just one problem. These seemingly different product descriptions are both the same product — but using different measurements and technical standards.

· Rod is Bar Round
· Diameter of 1" is 25.4mm
· US standard SAE4340 is SNCM8 JIS standard
· Length of 18' is 6m
· Hardness 45Rc is 420BH

No one can expect that a team responsible for creating new item records can be an expert in all domains, know all the standards and common abbreviations used in each domain, be able to correctly classify and understand the various technical features relevant to each domain, or even distinguish between the varied descriptions of the same product used by different suppliers. Not to mention the huge obstacle of different languages.

The ultimate advantage of a computerized data quality system is the ability to harness and reuse the domain experts’ knowledge, creating a data quality firewall that prevents the creation of duplicate records.

Tuesday, April 15, 2008

Humpty Dumpty Words with Tweedledee Logic

"When I use a word," Humpty Dumpty said in a rather scornful tone," it means just what I choose it to mean — neither more nor less."
"The question is," said Alice, "whether you can make words mean so many things."
"The question is," said Humpty Dumpty, "which is to be master — that's all."

"Contrariwise," continued Tweedledee, "if it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." —
Alice through the Looking Glass, Louis Carroll

Many times when I review product data, I can't avoid the impression that the product descriptions were written by Humpty Dumpty following Tweedledee logic. So, “which is to be master?” That’s the challenge of product data quality.

Wednesday, March 26, 2008

Data Quality, Fairy Tales, Dragons and Knights

"At this moment her thoughts were interrupted by a loud shouting of `Ahoy! Ahoy! Check! and a Knight dressed in crimson armour, came galloping down upon her, brandishing a great club. Just as he reached her, the horse stopped suddenly: `You're my prisoner!' the Knight cried, as he tumbled off his horse." - Alice through the Looking Glass, Louis Carroll


A recent post on the Data Governance blog mentioned this entertaining and thoughtful video about a dragon named Data Quality — indeed an innovative way to spread the data quality message. It’s a nice fairy tale with knights, dragons and all that stuff, and there’s even a moral to the story.

But in this fairy-tale, it took a costly knight with a bag full of tricks — and a really long time — before he succeeded in controlling the dragon. But it’s still a fairy tale. The problem is that in real life, there are many such knights, complete with shiny bags and glossy appearance, who all promise to conquer the dragon with almost no effort. Well, real life is no fairy tale. It takes more than a lone knight, no matter how shiny he looks, to conquer the problem. It takes powerful technology and experienced fighters to get control (and keep control) over the Data Quality dragon. No magic, no knights, no shortcuts.