xkcd Wikipedia Steps to Philosophy: Changelog

The following changes have been made to this script to correct bugs, to keep up with Wikipedia changes, or simply to add features:

17 November 2024:
- Updated the detection of the start of an article's content within its HTML due to apparent changes on Wikipedia, adding another possible string, <div class="mw-content-ltr mw-parser-output" lang="en" dir="ltr">, and adjusting to check for the position of the last of all of the possible strings.
- Updated the statistics generation code to use database tables as caches rather than caching HTML pages, because the HTML cache generation had been taking so long that it was timing out.
18 January 2022:
- Updated the regular expression that detects the article heading due to apparent changes on Wikipedia.
13 June 2018:
- [Note that I neglected to upload these changes until 19 July 2019]
- Added another two possible strings that indicate the start of an article's content within its HTML, <div class="mw-parser-output"> and <div id="mw-content-text". This corrects an error where (sometimes?) the chosen first link was incorrect.
- Fixed a minor problem with the write-error-message-and-exit function: ensure that when it is called in the middle of an open unordered list tag, it closes that tag before writing the closing HTML.
23 February 2018:
- Fixed a minor misplacement of a file close function.
- Set the correct modes (permissions) for cached statistics files and directories.
19 March 2017:
- Added a couple of missing closing tags which had been causing validation errors when a database error was displayed.
- Worked around a library error by removing, before querying that URL, everything from a URL including and after any hash.
17 February 2016:
- Updated the statistics code so that existing article titles in the "terminating" table which now redirect to a different article title get recorded with a "number of links in the chain to Philosophy" of zero, to indicate that they are no longer active article titles.
16 December 2014:
- Fixed a bug in the statistics generation which was causing updates to be written to the wrong directory for all pages other than full pages, so that statistics weren't actually updating. This bug was probably introduced in the reworking of the code between 2 May and 10 June of this year. If you viewed the stats between then and the date of this fix (16 December 2014), then they were probably out of date.
- Added to the output the duration of the stats rebuild when the page is left open during rebuild.
10 June 2014:
- Added to the main statistics page the counts of pages most recently in each of the terminating and looping tables, and, based on these counts, the percentage of pages that terminate.
Sometime between 2 May and 10 June 2014:
- Reworked the statistics generation code to dramatically reduce memory requirements (at the cost of speed).
2 May 2014:
- Added a check for another possible string indicating the start of an article's content within its HTML, <div id="bodyContent" class="mw-body-content">.
25 June 2013:
- URL-encoded the initially-input article so that URL-sensitive characters like "?" now work in article titles.
- Added a [wiki] link to articles on statistics pages.
- Fixed up support for forward slashes in input-article titles.
21 June 2013:
- Split the ever-growing statistics page into multiple pages so people can avoid loading excessive content into their browser if they want to.
6 January 2013:
- Adjusted the regular expression detecting the article title to keep up with changes to the output HTML on Wikipedia.
11 May 2012:
- Optimised the database queries on the statistics page so they now run within seconds rather than minutes.
22 April 2012:
- Added checking in the tag-stripping code for a div tag with a class attribute ("hlist") indicating that it encloses a horizontal list, so that that list isn't stripped out - this had been leading to a wrong first link sometimes being chosen.
- Deleted from the statistics in the database all known results that would not have been there had the previous fix been in effect.
31 March 2012:
- Amended the parsing of the HTML to detect nested parentheses - failure to take account of these as, for example, when the value for an HTML attribute within an open parenthesis itself contains parentheses, was sometimes leading to a link being selected that was actually within parentheses.
- Added checking in the tag-stripping code for a div tag with an id attribute indicating that it encloses the entire contents of the page, so that the entire contents aren't stripped out - this had been leading to a wrong first link sometimes being chosen.
- Added "Special" and "File" to the list of namespaces whose occurrence in a link excludes it from being counted.
- Deleted from the statistics in the database all known results that would not have been there had the previous fixes been in effect.
13 March 2012:
- Amended the parsing of the HTML to detect Wikipedia article titles given that Wikipedia is now wrapping titles in an extra tag: <span dir="auto">Article title</span>. This change by Wikipedia had been preventing the script from detecting the Philosophy article, so that it was detecting infinite loops instead of terminations on the Philosophy article. It was also adding these infinite loops to the statistics in the database, and I cleaned these incorrect entries from the database.
Earlier:
- Various undocumented fixes and changes.

Try the script or check out the statistics or about pages.