xkcd Wikipedia Steps to Philosophy: About the script

Introduction

In early 2011, episode #903 of the xkcd web comic was published, with the following image title (the little descriptive text box that appears when you hover your mouse cursor over the image):

Wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at "Philosophy".

This comic was apparently not the original source of the theory, but it contributed to the theory's popularity.

Many people were interested in testing this truism, and it seems that it generally holds, although some commentators noticed Wikipedia article edits, inspired by the xkcd comic, that attempted to "game" the path to the Philosophy article, so the extent to which it is an entirely natural phenomenon is a little bit debatable.

Original and alternative scripts

Ryan Elmquist wrote a script [dead link] to test the theory out. Unfortunately, that script is no longer online, so my friend Emil Kirkegaard convinced me to write this replacement. A clone of it is available on his site in case I and/or this site ever go MIA. I've since discovered several other scripts/apps that attempt to test this theory out:

Information and studies

There's a Wikipedia article about this phenomenon, and referenced in that article is a study by Ilmari Karonen, performed on a Wikipedia database dump, which concludes that as at 26 May 2011, 94.52% of Wikipedia articles terminated at Philosophy using this algorithm. Another similar study by Mat Kelcey came up with a similar result: that 95+% of articles terminate at Philosophy.

Miscellaneous notes on and potential problems with the script

The script adds an additional condition to the original two: ignore links to articles in the Image, Filename, File, Special and Portal namespaces (i.e. where the article title starts with "Image:", "Filename:", "File:", "Special:" or "Portal:" respectively). This is simply because the contents of those namespaces aren't encyclopedic articles.

As an heuristic to avoid links in sidebars, and links in informational header panels (because such links are generally in italics, and, in any case, being not a part of the article proper, should not in my opinion be counted as the first link), the script first strips out all <table></table>, <div></div> and <span></span> tags (in the process removing the links within them) from the HTML on each rendered Wikipedia article page before searching for the first link. It then searches for links within <p></p> or <li></li> tags. Should it fail to find a link, it then re-inserts the <table></table>, <div></div> and <span></span> tags and again searches for links within <p></p> or <li></li> tags. Should it still fail to find a link, it then searches again without requiring links to be within <p></p> or <li></li> tags. If it still doesn't find a link, then it gives up. It's possible that this heuristic is flawed and that on some pages it does not correctly select the link that a human would consider to be the "first" on the page. If anyone encounters a situation where this is the case, then please let me know.

Two second delay between fetches

Out of courtesy to Wikipedia and as an anti-hammering measure, the script pauses for two seconds between each article fetch.


Try the script or check out the statistics or changelog pages.