FUPS: Forum user-post scraper

Enter settings for your XenForo forum

To retrieve your posts: fill in the settings below, optionally after reading the questions and answers below the settings form, then click "Retrieve posts!". A status page will appear, updating progress automatically in a status box. When scraping is complete, the results file(s) will be linked to.

Specifies the forum type (e.g. "phpBB" or "XenForo").
Set this to the base URL of the forum. This is the URL that appears in your browser's address bar when you access the forum, only with everything onwards from (and including) the path of whichever script is being accessed (e.g. /threads or /forums) stripped off. The default URL provided is for the particular XenForo board known as "CivilWarTalk".
Set this to the user ID of the user whose posts are to be extracted. You can find a user's ID by hovering your cursor over a hyperlink to their name and taking note of everything that appears between "/members/" and the next "/" (i.e. this will be something like "my-member-name.12345") in the browser's status bar.
Set this to the datetime of the earliest post to be extracted i.e. only posts of this datetime and later will be extracted. If you do not set this (i.e. if you leave it blank) then all posts will be extracted. This value is parsed with PHP's strtotime() function, so check that link for details on what it should look like. An example of something that will work is: 2013-04-30 15:30.
Set this to the time zone in which the user's posts were made. Valid time zone values are listed starting here. This only applies when "Start From Date+Time" is set above, in which case the value that you supply for "Start From Date+Time" will be assumed to be in the time zone you supply here, as will the date+times for posts retrieved from the forum. It is safe to leave this value set to the default if you are not supplying a value for the "Start From Date+Time" setting.
Check this box if you want FUPS to scrape all images in posts too, and to adjust image URLs to refer the local, downloaded images. Note that images which are attached to posts, but which are not included inline in the post itself, will not be scraped (because FUPS does not yet support the scraping of attachments for XenForo forums).
Check this box if the forum from which you're scraping outputs dates in the non-US ordering dd/mm rather than the US ordering mm/dd. Applies only if day and month are specified by digits and separated by forward slashes.
Enter the number of seconds you wish for FUPS to delay between consecutive requests to the same web host (the minimum is five). This is required so as to avoid hammering other people's web servers.
Set this to that part of the URL for forum thread (topic) pages between the beginning part of the URL, that which was entered above beside "Base forum URL" but followed by a forward slash, and the end part of the URL, the thread id optionally followed by forward slash and page number. By default, this setting should be "threads/", but the XenForo forum software supports changing this default through route filters, and some XenForo forums have been configured in this way such that this setting ("Thread URL prefix") needs to be empty. An example of how to discern this value (it is emboldened) in a typical thread URL with "Base forum URL" set to "http://civilwartalk.com" is: "http://civilwartalk.com/threads/traveller.84936/page-2". Here, the initial base URL plus forward slash is obvious, the thread id part is "traveller.84936" and the optional-forward-slash-followed-by-page-number part is "/page-2". If route filtering were set up on the CivilWarTalk forum such that this setting should be empty, then that same thread URL would have looked like this: "http://civilwartalk.com/traveller.84936/page-2". If, hypothetically, this "Thread URL prefix" setting were to correctly be "topic/here/", then that same thread URL would have looked like this: "http://civilwartalk.com/topic/here/traveller.84936/page-2".

Answers to possible questions

How can I know if a forum is a XenForo forum?

Typically, XenForo forums can be identified by the presence of the text "Forum software by XenForo" in the footer of their forum pages. It is possible, however, that these footer texts have been removed by the administrator of the forum. In this case, the only way to know for sure is to contact your forum administrator.

Does the script work with forums using a language other than English?

Yes, or at least, it's intended to: if you experience problems, please contact me.

Which skins are supported?

Whichever skin(s) is/are default for the CivilWarTalk, ECIGS SA and Skeptiko forums. FUPS' XenForo scraping functionality was originally developed as a paid job to extract posts from the CivilWarTalk forum; since then it has been tested on the other two forums and seems to function fine. If you need support for another XenForo skin, feel free to contact me.

How long will the process take?

It depends on how many posts are to be retrieved, and how many pages they are spread across. You can expect to wait roughly one hour to extract and output 1,000 posts.

Are images supported?

Yes. If you check "Scrape images" (checked by default), then images are downloaded along with the posts. If not, then all relative image URLs are converted to absolute URLs, so images will display in the HTML output files so long as you are online at the time of viewing those files.

Is the downloading of attachments supported?

In general, yes, but not yet for XenForo forums.

Why is this script so slow?

So as to avoid hammering other people's web servers, the script pauses for five seconds between each page retrieval.

Are there any resource issues of which I should be aware?

Yes - because this site is hosted on a shared server, I am limited to a fixed and fairly small number of processes, and each run of FUPS requires two processes, one for the background process doing the scraping, and another for the status web page. For most users, too, the number of posts is significant and the process will run for some time. Please, then, limit yourself to one run of the script at a time, and if you change your mind about wanting to run the script after having clicked "Retrieve posts!", then please click the cancellation link.