Wikipedia’s approach to categorization

September 22, 2008

I was intrigued by Silver’s posting asking for information on Wikipedia’s approach to categorization. Since I was busy at the time, I hoped that someone else would respond, but no-one has. So in a brief spare moment, I have tried to work out what they’re doing myself.

Let’s say that it’s not obvious! There is plenty of documentation here and here and elsewhere. Perhaps the most significant clue to what they’re doing lies in the latter page. They say:

“Each Wikipedia article can appear in more than one category, and each category can appear in more than one parent category. Multiple categorization schemes co-exist simultaneously. In other words, categories do not form a strict hierarchy or tree structure, but a more general directed acyclic graph (or close to it; see below).”

The ‘see below’ refers to an image showing a representative sample of the category structure and this is where we get somewhat contentious. This image looks to me like a mish-mash of hierarchical and associative relationships (some of which are questionable IMHO) which is far closer to the realm of ‘real world’ perceptions than the neat, clinically precise representations of classic KO. Is this an example perhaps of ‘Freely Faceted Classification’ as described to us by our Italian colleague Claudio Gnoli at our Ranganathan Revisited event in November 2007? Or is it something else?

Taking a specific example, I used the CategoryTree tool to explore a section of the Wikipedia category structure. I specified ‘en.wikipedia.org’ as the Wiki and ‘transportation’ as the category in order to examine how ‘trains’ are represented. I made the facile (but not unwarranted) assumption that ‘trains’ would appear somewhere as a lower-level category of ‘Transportation’. Indeed it does, being reported as Transportation > Rail Transport > Trains.

What you note in passing though is interesting. ‘Transportation’ itself has parent categories ‘Industries’, ‘Technology by type’ and ‘Travel. Fair enough, I suppose, given that we embrace polyhierarchy and acknowledge the need to provide for multiple access routes to specific concepts. However, ‘Rail Transport’ and ‘Public Transport’ occur adjacent to each other at the same level. Hmmm. Some overlap of categories methinks, since I’m not aware of any form of rail transport which isn’t also public (except freight, but that’s out-of-scope for our purposes). But then, if you examine the sub-categories of ‘Public Transport’, you find that the principle of differentiation is quite different and at a higher level.

Screen shots of the CategoryTree hierarchies I examined are provided below so that anyone interested can peruse them before perhaps investigating the question for real online.

Conclusion? The Wikipedia categorization system reflects but does not consistently apply the principles of KO as expounded in the formal literature. It is nevertheless interesting because it might well represent what results when folksonomy meets formal KO and agrees to a compromise.

If anyone has the time and patience to analyze this interesting phenomenon further and comment on it, then I for one would be grateful. And I’m sure Silver Oliver would too, since he and his colleagues at the BBC have invested considerable effort in building a system which utilizes Wikipedia topics as subject identifiers for their own internal use. Obviously, they would like to know if Wikipedia’s categories can be utilized ‘as is’ or whether they need to embark on a categorization exercise of their own.