Graph-ted on: Dewey and Wikipedia

February 24th, 2014 by Max

Introduction

Since entering the library world, I’ve been fascinated by the substream of uber-geeks which are the Dewey Decimal Classification (DDC) nerds. It was the enthusiastic and technical-leaning banter of Editor-in-Chief Michael Panzer, that enticed me. When I learned that the DDC is designed to be able to describe all possible concepts, and then also that is not a strict hierarchy, but really a web of complex relations, I was intrigued by the foresight of the approach. I see the DDC as a best effort 100 years before anyone would utter “crowdsourcing,” – which is how a modern analog the Wikipedia Category systems works. Using Graph Theory we develop a new way to look at, and compute with DDC and then connect it to Wikipedia Categories.

(In this post we assume some knowledge of Graph Theory, so if you need a quick primer, it’s at the end: Graph Theory tl;dr.)

Representing the Dewey Decimal Classification in Graph Theory

In order to represent DDC using Graph Theory we have to translate its components into graph theory. That means,

  • Classifications become the “nodes”.
  • Relationships between Classifications become the “edges.”
  • Relation colourings:
    • Notational hierarchy – Red
    • See References – Green
    • Class elsewhere – Blue

We are actually defining a directed graph because the relations we have are not in general reciprocal. That is, if a classification A is below B in hierarchy,  or  A has a “see also” reference to B, B does not necessarily have that reference to A. (Although there are some reciprocal “class elsewhere” relations for example 391 and 646.3  in the 4 hop graph below). Note also that colourings – the mathematical parlance – indicates that relationships are different without assigning the relationship a numerical value.

Visualizations

This has all been quite abstract so far so let’s look at something concrete. We’ll pick classification 394.1, whose caption is the enjoyable “Eating, drinking; using drugs” .  The distance to other nodes is how many edges we travel from our source. We will look at all the connections that are n-hops from 394.1. Below is the evolution of visualizations from distances 2, 3, 4, and 5. By 5 hops you can see that the graph begins to be too complicated to visually represent in a single image.

 

394_1-2hops

394_1-3hops

394_1-4hops

394_1-5hops

Weighting the Graph

One of the powerful aspects of rendering the classification system as a graph is that we can determine the shortest distance between two nodes. However, as mentioned, our colourings have no intrinsic value, so in order to to find the shortest paths, we have to assign weights to colours. This is variable, but for the rest of the examples I am using the settings: Notational Hierarchy = 0.5, See Reference = 1, Class_elsewhere = 2.  That means that our shortest path between 394.1 and 341.33 (“Diplomatic Law”) is 394.1 -> 394-> 395  -> 327.2 -> 341.33, which has a total value of 5.5.

Transforming Wikipedia Categories

Consider a Wikipedia Category. It can contain many Wikipedia  articles, each of which has zero or more citations of books. Some of those book citations include the ISBN, which is precisely what the OCLC  Classify API is hungry for. Classify will transform an ISBN or OCLC number into a DDC, based on how  it is most popularly cataloged in WorldCat. That means we can arrive at a set of classifications associated with the category. From this set it would be instructive if we could somehow average them into one representative classification. In Graph Theory terms this is known as finding the center, and there are several ways to approach this problem. The method we employ here is the betweenness centrality. That is, we calculate the weighted shortest paths between all the classifications associated with the category, and find the nodes that were most touched in all the shortest paths. This is rather abstruse without some examples. We run the algorithm on some categories and below are the top 5 between nodes of:

  • Category:Cooking techniques:
    1. 641 – Food and Drink
    2. 641.5 – Cooking
    3. 641.3 – Food
    4. 613.2 – Dietetics
    5. 641.815 – Bread and bread-like foods
  • Category:Feminism
    1. 305.42 – Social role and status of Women
    2. 305.43 – Women by occupation
    3. 305.9 – People by occupation and miscellaneous social statuses
    4. 305 – Social Groups
    5. 305.4 – Women

This shows there is some basis for proving that this technique works. The centre of the classifications we find are around the same topic of that organizes the classification. We usually finds a strand or run in the system that hover around the main ideas. One caveat though is that for both of these Categories only about 25% of Wikipedia articles actually contain an ISBN citation. Yet of articles that do include at least one ISBN citation, they average 2.7 citations per article. With that in mind we continue to look at some set of articles that are not organized by topic, and are also more ISBN-rich. What better category than Featured Articles?

  • Category:Featured articles
    1. 930 – History of the ancient world c. 499
    2. 940 – History of Europe
    3. 361 – Social problems and welfare in general
    4. 800 – Literature and Rhetoric
    5. 362 – Social welfare problems & services
    6. 580 – Plants (Botany)

Featured articles, of which there were ~4,000 gave lead to 25,000 ISBNs, which related to 17,000 Classifications. There are only approximately 50,000 standard Classifications. This at first would seem impressive coverage, but distribution of the coverage in the graph was not calculated, so its still difficult to say how diverse the featured article set is. Computing on this set was very intensive, in fact just computing the centre of these 17,000 Classifications took 9 hours parallelized over 16 cores. These topics seem to be more abstract, and less interrelated. Without being able to infer some sort of connect, next we run the algorithm not on a category, but on the list of articles that your blogger has ever edited (307 articles).

  • Contributions of User:Maximilianklein
    1. 930 – History of the ancient world c. 499
    2. 800 – Literature and Rhetoric
    3. 302.23 -  Media (Means of communication)
    4. 910.45 – Ocean travel and seafaring adventures
    5. 361 – Social problems and welfare in general

Uncovered here is a latent passion for Seafaring adventures! This is probably owing to the fact that travel articles may have cited ISBN-having travel books which were disproportionately focused around sea-travel. More importantly what arises here is that several of the central Classifications that describe the edit history are similar to the featured articles results. This would seem to suggest that there are some very central nodes that most connections are channelled through if they lie at disparate regions of the graph. The implication is that the problem of classifying an arbitrary group of object remains difficult! Of course these results are only for one possible set of values for our weights dictionary, where these top level channels would be more “expensive” to hop-through. So future research should try and calibrate these values to produce the most varied set of central nodes.

This work is available as an Ipython notebook and on github, and comments welcome
notconfusing

Graph Theory? tl;dr

We are not talking about the colloquial meaning of “graph”, but rather the mathematical concept.

A Graph is a collection of:

  • “Nodes” or “Vertices” – the objects
    • Here letters A through F
  • “Edges” – relations (directed or undirected)
    • Here the arrows between letters

Not talking about this kind of graph. (Image By Inductiveload [Public domain], via Wikimedia Commons)

An example of a “Graph Theory” Graph (By David W. at de.wikipedia (Original text : David W.) [Public domain], from Wikimedia Commons)

 

Related posts:

Leave a Reply