"One man’s “magic” is another man’s engineering. “Supernatural” is a null word." – Robert A Heinlein

"If you torture data long enough, it will tell you anything you want!" – Unknown


Thursday, January 29, 2009

Taxonomy, Divide and Merge (Part 2)

Following on my last post, I’d like to focus on the question: What can be considered as a good taxonomy? Given the fact that taxonomy is something between art and science, a consensus will be hard to achieve. Luckily, this discussion is focused on the product data realm and based on practical aspects gained through many years of experience working with product data.

A product taxonomy should be practical

We need to have a taxonomy that enables us to search, compare, group, or analyze products quickly and easily. The pure 'academic' approach to defining categories is to group products that share exactly the same attributes, so that each group will constitute a category. This bottom-up approach will result in a long, flat list of categories. This kind of list will not serve us efficiently in searching and navigating through products and many categories will only contain a few products.

The alternative approach, the top-down approach, is based on logically dividing the product world into groups (e.g. hand, power, and machine tools as one group, fasteners as another group) and then continuing to divide those worlds into sub-worlds (e.g. fasteners is sub-divided into screws/bolts, nuts, nails, etc.) and so on. This approach results in a subjective structure and will be prone to errors. The best, practical method is a combination of the two approaches — resulting in an optimized taxonomy which incorporates categories that share the same technical attributes and includes many products as possible.

But let’s go back to the beginning. The first and most important factor in creating a taxonomy is the definition of the categories. A category should always reflect the essence or the nature of the classified object. It sounds trivial, but unfortunately, most of the categories I have come across are based on the usage of the object. For example, we had a customer (an academic research institute) that classified sugar (yes, sugar!) under several categories:

  • Animal food
  • Chemical materials
  • Refreshments
  • Office supplies

But sugar is sugar, whatever you do with it. Another common example is capacitors. A capacitor is a capacitor, but in many organizations they are divided into Electrical Capacitors and Electronic Capacitors, while the products in both categories share the same technical attributes.

(By the way, the main reason why current taxonomies are usage-oriented is because among the first to define and build classification systems were maintenance departments. For them, the best way to classify was according to the usage/facility/machine. They preferred to say "a ball bearing for X machine" than "a ball bearing with diameter D, material M, etc.”)

Notes on the relationship between categories and attributes

Another interesting aspect of categories is the ability to transform attributes into categories and vice versa. This feature enables us to optimize our taxonomy for a given organization or situation. For example, in the public taxonomy UNSPSC, there are several categories of RAM memory: Random Dynamic RAMs, Random RAMs, and Static Random RAMs. The three categories share the most of the same attributes, so if your main business is not RAMs, you may prefer to have a single category of Random Access Memory and define a technical attribute called Type with three values: Random, Dynamic, and Static.

A plate by any other name

Another factor is the name of the category. It’s important to bear in mind that we all have our own perceptions, so if category names are not simple and clear and there are ambiguities, many products will be wrongly classified. Think, for example, of the word “plate.” It may refer to coating, a dish, or board and maybe there are more meanings. It is definitely a bad category name!

In my next post, I’ll take a closer look at the category hierarchy — the second most important factor in creating a good taxonomy.