Wikipedia’s approach to categorization

I was intrigued by Silver’s posting asking for information on Wikipedia’s approach to categorization. Since I was busy at the time, I hoped that someone else would respond, but no-one has. So in a brief spare moment, I have tried to work out what they’re doing myself.

Let’s say that it’s not obvious! There is plenty of documentation here and here and elsewhere. Perhaps the most significant clue to what they’re doing lies in the latter page. They say:

“Each Wikipedia article can appear in more than one category, and each category can appear in more than one parent category. Multiple categorization schemes co-exist simultaneously. In other words, categories do not form a strict hierarchy or tree structure, but a more general directed acyclic graph (or close to it; see below).”

The ‘see below’ refers to an image showing a representative sample of the category structure and this is where we get somewhat contentious. This image looks to me like a mish-mash of hierarchical and associative relationships (some of which are questionable IMHO) which is far closer to the realm of ‘real world’ perceptions than the neat, clinically precise representations of classic KO. Is this an example perhaps of ‘Freely Faceted Classification’ as described to us by our Italian colleague Claudio Gnoli at our Ranganathan Revisited event in November 2007? Or is it something else?

Taking a specific example, I used the CategoryTree tool to explore a section of the Wikipedia category structure. I specified ‘en.wikipedia.org’ as the Wiki and ‘transportation’ as the category in order to examine how ‘trains’ are represented. I made the facile (but not unwarranted) assumption that ‘trains’ would appear somewhere as a lower-level category of ‘Transportation’. Indeed it does, being reported as Transportation > Rail Transport > Trains.

What you note in passing though is interesting. ‘Transportation’ itself has parent categories ‘Industries’, ‘Technology by type’ and ‘Travel. Fair enough, I suppose, given that we embrace polyhierarchy and acknowledge the need to provide for multiple access routes to specific concepts. However, ‘Rail Transport’ and ‘Public Transport’ occur adjacent to each other at the same level. Hmmm. Some overlap of categories methinks, since I’m not aware of any form of rail transport which isn’t also public (except freight, but that’s out-of-scope for our purposes). But then, if you examine the sub-categories of ‘Public Transport’, you find that the principle of differentiation is quite different and at a higher level.

Screen shots of the CategoryTree hierarchies I examined are provided below so that anyone interested can peruse them before perhaps investigating the question for real online.

Conclusion? The Wikipedia categorization system reflects but does not consistently apply the principles of KO as expounded in the formal literature. It is nevertheless interesting because it might well represent what results when folksonomy meets formal KO and agrees to a compromise.

If anyone has the time and patience to analyze this interesting phenomenon further and comment on it, then I for one would be grateful. And I’m sure Silver Oliver would too, since he and his colleagues at the BBC have invested considerable effort in building a system which utilizes Wikipedia topics as subject identifiers for their own internal use. Obviously, they would like to know if Wikipedia’s categories can be utilized ‘as is’ or whether they need to embark on a categorization exercise of their own.

Advertisements

9 Responses to Wikipedia’s approach to categorization

  1. It is an interesting analysis indeed, thank you Bob!

    No, I wouldn’t say Wikipedia system is freely faceted classification, except for the feature of allowing to combine everything with everything (ie a Wikipedia entry can have any number of categories, which are coordinated between them).

    Polyhierarchy can be fine. What seems to be lacking is consistency in facet analysis (as rail transport and public transport should be complementary facets, not brothers within a same array) and in the design of a general structure — I wonder which are the top categories?

  2. Many thanks Bob for such a insightful review of this problem. It is interesting to look at some of the work going on to derive strucutre from the Wikipeida dataset.

    In particular Dbpedia’s work – The DBpedia project currently exposes a number of hierarchical vocabulary structures:

    http://blog.georgikobilarov.com/2008/10/dbpedia-rethinking-wikipedia-infobox-extraction/

    Freebase has also done some work in this area.

    At the BBC we are hoping to do a research piece in exactly this area, how useful are these emerging vocbaulry strcutures for end user browsing.

  3. Bob Bater says:

    Claudio,

    On quick inspection, the effective top-level categories appear to be:

    [+] Topical indexes (2)
    [+] Categories by topic (48)
    [+] Agriculture (50)
    [+] Applied sciences (25)
    [+] Arts (24)
    [+] Belief (10)
    [+] Business (62)
    [+] Chronology (16)
    [+] Computing (43)
    [+] Crafts (25)
    [+] Culture (43)
    [+] Education (49)
    [+] Environment (31)
    [+] Geography (29)
    [+] Health (29)
    [+] History (44)
    [+] Humanities (18)
    [+] Language (14)
    [+] Law (53)
    [+] Mathematics (53)
    [+] Music (56)
    [+] Nature (18)
    [+] People (39)
    [+] Politics (51)
    [+] Science (32)
    [+] Society (67)
    [+] Technology (45)
    [+] Visual arts (40)

    I found these by navigating up from an arbitrary sub-category, choosing the parent each time, until I came to the above list whose single parent is ‘Articles’.

  4. Andreas says:

    Hi all,
    Very interesting blog! I am trying to work on wikipedias categories (with respect to renewable energy). To that end I am developing a tool which works on an offline wikipedia. To reconstruct the category tree from a number of subcategories, the tool automatically does what Bob Bater did (above), also bottom up, but just for a set of “green” categories.
    The tool can visualize the graph or subgraphs and can identify cycles (circular category references) – they exist, although it’s supposed to be a Directed acyclic graph. My idea is to set up an ontology of renewable energy related terms, but fully automated approaches (eg. Heymann-Algorithm) leave room for improvement. So I expect wikipedia can help out a lot – I hope. I am willing to share my efforts, if there is interest and would be happy about any comments from the KOmmunity 😉
    Best, Andreas

    • Zahra says:

      Hi Andreas,
      I need the Wikipedia Category tree (or more precisely ‘Graph’) for my thesis and as you said your are developing a tool which can reconstruct the tree. can you please guide me on this problem? how you construct the tree?
      It would be very nice of u to guide me, in this issue.

      • Andreas says:

        Hi Zahra,

        I started with an initial list of categories relevant for Renewable Energy. A recursive procedure goes through that list and identifies parent categories. this can be done using an offline wikipedia (which is a bit of a hassle and maybe overkill, if all you want is a category tree). Alternatively, I believe DBpedia could be useful. Do you need the whole tree or only for subtopics?
        Once the list of parents is detected, I maintain a data structure with all parent-child relations and call recursively with the parents. If there are no parents for a category x – the recursion base case is reached and I connect x to the root. There is also a tool called category tree, see http://en.wikipedia.org/w/index.php?title=Special:CategoryTree
        but I didnt figure out how to get the source code… Hope that helps – let me know, if you need more information,
        best
        Andreas

        Hope that helps

    • Zahra says:

      Hi Andreas,
      Tank You for your attention and reply. It helped me too much. actualy I needed the whole Category tree. Using DBpedia Category dataset, the whole tree was reconstructed.
      good luck

  5. Polack says:

    Thank you, was a good article. I wish you success, good day DiziFilm-?zle

  6. […] to Categorization,” September 22, 2008, provides useful comments on category issues; see https://iskouk.wordpress.com/2008/09/22/wikipedias-approach-to-categorization/. [18] Olena Medelyan and Cathy Legg, 2008. Integrating Cyc and Wikipedia: Folksonomy Meets […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: