FUPS: Forum user-post scraper

Enter settings for your phpBB forum

To retrieve your posts: fill in the settings below, optionally after reading the questions and answers below the settings form, then click "Retrieve posts!". A status page will appear, updating progress automatically in a status box. When scraping is complete, the results file(s) will be linked to.

Specifies the forum type (e.g. "phpBB" or "XenForo").
Set this to the base URL of the forum. This is the URL that appears in your browser's address bar when you access the forum, only with everything onwards from (and including) the filename of whichever script is being accessed (e.g. /index.php or /viewtopic.php) stripped off. The default URL provided is for the particular phpBB board known as "Genius Forums".
Set this to the user ID of the user whose posts are to be extracted. You can find a user's ID by hovering your cursor over a hyperlink to their name and taking note of the number that appears after "&u=" in the URL in the browser's status bar.
Set this to the username corresponding to the above ID. Note that it does not and cannot replace the need for the above ID; that ID is required. In contrast, this setting is not required (i.e. it can be left blank) if the script has permission to view member information on the specified phpBB board, in which case the script will extract it automatically from the member information page associated with the above ID: this will fail if the forum requires users to be logged in to view member information and if you do not provide valid login credentials (which can be specified below), in which case you should specify this setting.
Set this to the username of the user whom you wish to log in as (it's fine to set it to the same value as Extract User Username above), or leave it blank if you do not wish FUPS to log in. Logging in is optional but if you log in then the timestamps associated with each post will be according to the timezone specified in that user's preferences, rather than the board default. Also, some boards require you to be logged in so that you can view posts. If you don't want to log in, then simply leave blank this setting and the next setting.
Set this to the password associated with the Login User Username (or leave it blank if you do not require login).
Set this to the datetime of the earliest post to be extracted i.e. only posts of this datetime and later will be extracted. If you do not set this (i.e. if you leave it blank) then all posts will be extracted. This value is parsed with PHP's strtotime() function, so check that link for details on what it should look like. An example of something that will work is: 2013-04-30 15:30.
Set this to the time zone in which the user's posts were made. Valid time zone values are listed starting here. This only applies when "Start From Date+Time" is set above, in which case the value that you supply for "Start From Date+Time" will be assumed to be in the time zone you supply here, as will the date+times for posts retrieved from the forum. It is safe to leave this value set to the default if you are not supplying a value for the "Start From Date+Time" setting.
Check this box if you want FUPS to scrape all images in posts too, and to adjust image URLs to refer the local, downloaded images. Note that images which are attached to posts, but which are not included inline in the post itself, will not be scraped unless you also check "Scrape attachments" below.
Check this box if you want FUPS to scrape all attachments to posts too. Note however that attachments are not supported on all skins: if the version of the phpBB software that your forum is running is old then FUPS might not scrape attachments even if you do check "Scrape attachments".
Check this box if the forum from which you're scraping outputs dates in the non-US ordering dd/mm rather than the US ordering mm/dd. Applies only if day and month are specified by digits and separated by forward slashes.
Enter the number of seconds you wish for FUPS to delay between consecutive requests to the same web host (the minimum is five). This is required so as to avoid hammering other people's web servers.

Answers to possible questions

How can I know if a forum is a phpBB forum?

Typically, phpBB forums can be identified by the presence of the text "Powered by phpBB" in the footer of their forum pages. It is possible, however, that these footer texts have been removed by the administrator of the forum. In this case, the only way to know for sure is to contact your forum administrator.

Does the script work with forums using a language other than English?

Yes, or at least, it's intended to: if you experience problems, please contact me.

Do I need to supply a login username and password?

Probably not. These are the conditions under which you do:

Is it safe to supply my login username and password?

You will need to use your judgement here. I have attempted to make it as safe as possible without compromising simplicity. Your username and password, along with all other settings, will be stored in one or two files in a private directory (i.e. not accessible via the web) on my web hosting account for no longer than three days (a scheduled task deletes these files periodically; it runs once a day and deletes files more than two days old). In addition, you will be presented with an option after the script runs, or, if you cancel the script, to delete immediately all files associated with your request. I will never look inside the temporary files containing your username/password.

If this doesn't satisfy you, you might consider temporarily changing your password for the script, and then changing it back again once the script has finished.

Is it safe to retrieve posts from a private forum through this script?

Your username and password are as safe as the previous answer describes. The content of your posts (the output file) is slightly less safe in that this output file is publicly accessible - but only to those who know the 32-character random token associated with it, and only until it is deleted either by you after you have saved it, or by the daily scheduled deletion task. As with usernames and passwords, I will never look inside the temporary file containing your posts' content.

Which skins are supported?

Both the prosilver and subsilver skins are supported. The script probably won't work with customised skins, but if you desire support for such a skin (you are getting error messages about regular expressions failing), feel free to contact me. A workaround is to simply set your skin to either prosilver or subsilver in the user control panel of your phpBB forum whilst you are logged in, and then to supply your login credentials in the settings above, optionally reverting your skin back to whatever it was before in the user control panel after running FUPS.

How long will the process take?

It depends on how many posts are to be retrieved, and how many pages they are spread across. You can expect to wait roughly one hour to extract and output 1,000 posts.

Are images supported?

Yes. If you check "Scrape images" (checked by default), then images are downloaded along with the posts. If not, then all relative image URLs are converted to absolute URLs, so images will display in the HTML output files so long as you are online at the time of viewing those files. Note that if you wish to scrape images which are attached to posts then you will need to also check "Scrape attachments" too. Note however that attachments are not supported on all skins: if the version of the phpBB software that your forum is running is old then FUPS might not scrape attachments even if you do check "Scrape attachments".

Is the downloading of attachments supported?

Yes. If you check "Scrape attachments" (checked by default), then attachments are downloaded along with the posts. Note however that attachments are not supported on all skins: if the version of the phpBB software that your forum is running is old then FUPS might not scrape attachments even if you do check "Scrape attachments".

Why is this script so slow?

So as to avoid hammering other people's web servers, the script pauses for five seconds between each page retrieval.

Does this script have any relationship with the PHPBB-Extract script on GitHub?

No, they are separate projects.

Are there any resource issues of which I should be aware?

Yes - because this site is hosted on a shared server, I am limited to a fixed and fairly small number of processes, and each run of FUPS requires two processes, one for the background process doing the scraping, and another for the status web page. For most users, too, the number of posts is significant and the process will run for some time. Please, then, limit yourself to one run of the script at a time, and if you change your mind about wanting to run the script after having clicked "Retrieve posts!", then please click the cancellation link.