FUPS: Forum user-post scraper
Select forum software [1] to continue
What is FUPS?
FUPS is a web app that "scrapes" (downloads) from a specified board running either the XenForo or phpBB forum software either:
- All posts of a specified user, or,
- All posts in a (set of) (sub)forums of the board.
FUPS will download from your specified board all of the relevant posts made to that board satisfying either of the above two conditions - it does this by accessing the forum in the same way your web browser does when you browse the forum manually, only it does so automatically.
When scraping a user's posts, FUPS then sorts those posts by various means, and for each means, produces a file containing a table of contents for all threads the user was involved in, followed by the sorted posts themselves, with headings, and separated by horizontal lines. It returns an HTML page for each of these sorts. If images or files were downloaded, they are made available in the output too. FUPS also provides a JSON data structure for the scraped posts of the user.
When scraping an entire (set of) (sub)forum(s), FUPS outputs the scraped data (threads and posts) in a JSON data structure.
The output files are presented when FUPS finishes scraping, and you can then save these files to disk via your browser, e.g., in Firefox right-click on the file you want to save and choose "Save Link As...", or left-click on the file to open it and then click the "File" main menu option and then click "Save Page As". You can then, if you like, open up HTML files in your word processor and save them in any other format you desire, e.g. ODF, Microsoft Word.
If you're asking, "How can I download all of my posts from a remote XenForo or phpBB forum to my local hard drive?", then this might be the script for you.
[1] How do I know whether my forum software is XenForo or phpBB, or neither?
- XenForo: Typically, XenForo forums can be identified by the presence of the text "Forum software by XenForo" in the footer of their forum pages. It is possible, however, that these footer texts have been removed by the administrator of the forum. In this case, the only way to know for sure is to contact your forum administrator.
- phpBB: Typically, phpBB forums can be identified by the presence of the text "Powered by phpBB" in the footer of their forum pages. It is possible, however, that these footer texts have been removed by the administrator of the forum. In this case, the only way to know for sure is to contact your forum administrator.
Changelog
- 2024-10-08
- Averted a PHP warning when the board title is unset.
- 2024-08-19
- Improved phpBB support by adjusting a couple of 'search_results_page_data' regexes to allow for the case in which the
f=[integer]
query parameter in the 'viewtopic.php' URL does not exist.
- Improved phpBB support by adjusting a couple of 'search_results_page_data' regexes to allow for the case in which the
- 2024-08-17
- Fixed another few PHP 8 warnings re potentially uninitialised array members.
- 2024-05-29
- Fixed another instance of the Serialization of 'CurlHandle' is not allowed error.
- 2023-12-26
- Fixed another PHP 8 warning re a potentially uninitialised array member when scraping a XenForo member's posts.
- 2023-08-23
- Fixed a PHP 8 warning re an uninitialised variable.
- Initialised the debug status earlier, so that debugging can be output during the first call to strtotime_intl() when parsing the "Start From Date+Time" setting.
- 2023-06-12
- Fixed GitHub issue #7: Serialization of 'CurlHandle' is not allowed.
- 2020-12-12
- Added an updated 'post_contents_ext' regular expression to support the scraping of forum threads for phpBB 3.3.2.
- Added support for new phpBB login form fields, namely 'form_token' and 'creation_time'.
- 2019-07-02
- Improved XenForo support, by adjusting the regular expressions that detect posts in search listings, and by adding support for stripping a Vietnamese word from international datetimes.
- 2019-01-25
- Improved code to detect the older phpBB version by not coming up with a false positive merely due to redirects diverting from http to https (or vice versa) and/or to or from the www-prefixed version of the domain, and/or from a path which begins with an extra "/" due to the user ignoring the request to strip the trailing slash off the base forum URL when entering options.
- 2018-11-11
- Improved a couple of phpBB prosilver_3.1.6 regexes:
- 'post_contents_ext' to allow for detection of posts with attachment boxes (though this does not resolve the limitation by which attachments are not downloaded when scraping by forums, i.e., when filling in the "Forum IDs" setting via the web interface or when setting "forum_ids" via the commandline interface).
- 'post_contents' to allow for attachments with a class of "file" on top of "thumbnail", and to allow for posts with signatures or that have been edited.
- Removed the 'post_contents' regex added on 2018-10-25, which is not only unnecessary given these changes but incomplete and thus sometimes gives incorrect results.
- Reordered the phpBB prosilver regexes so that the most recent are topmost.
- Improved a couple of phpBB prosilver_3.1.6 regexes:
- 2018-11-03
- 2018-10-25
- Added a phpBB prosilver thread page regex ('post_contents') which matches on some forums where none of the existing regexes did.
- 2018-09-16
- Amended a phpBB prosilver search page regex to handle empty post subjects.
- Fixed detection of both "Forum IDs" and "Extract User ID" settings being empty for XenForo forums.
- 2018-06-02
- Fixed error reporting when deleting files.
- Fixed a potential security hole by validating the token supplied by the user before deleting files in the output directory.
- Fixed a couple of small errors: a missing parameter to a call to the delete_files_in_dir_older_than_r() function and a misplaced call to closedir().
- 2018-03-05
- Amended a regex to better match posts under the prosilver skin on phpBB 3.1.6 forums (it had been failing in some instances).
- 2018-02-17
- Fixed GitHub issue #1: Php error.
- Improved support for downloading full forums from phpBB forums.
- 2017-08-21
- Added support for scraping entire XenForo forums.
- Added support for the "Extract User Username" setting for XenForo forums.
- 2017-06-06
- Added "skip current topic on resume" functionality.
- 2017-06-05
- Added support (phpBB-only for now) for scraping entire forums.
- Prepended required fields on the options entry page with asterisks.
- 2016-07-03
- Fixed a bug in the detection of the old version of phpBB.
- Fixed an image scraping bug.
- Improved detection of Romanian dates on some versions of phpBB.
- 2015-10-14
- Added support for detecting successful login under the prosilver skin on phpBB boards that redirect to the index page after login.
- 2015-10-03
- Fixed a bug: commandline chaining wasn't working due to 'output_filename' being used instead of 'output_dirname' in make_php_exec_cmd().
- Fixed a bug: empty posts caused an infinite loop.
- 2015-09-30
- Added support for scraping images.
- Reworked the settings code and added a "Consecutive request delay (seconds)" setting.
- 2015-09-29
- Added support for rebasing img/anchor URLs: relative image and anchor URLs in posts are now converted into the correct absolute URLs, so images should now always display (assuming an internet connection) and links in posts should now always direct to the correct place.
- Fixed a bug: post counts were sometimes doubled on phpBB forums due to the similarity of the prosilver.1 and prosilver.2 'search_results_page_data' regexes. Combining these into a single regex fixed the problem.
- 2015-09-28
- Added resumability functionality - now if a page retrieval times out, and the script exits, you can resume it from the point it left off (within two days).
- Fixed a bug: when appending a prefix to create a new output directory when the specified one (via the commandline) already existed, instead a new subdirectory named as the prefix was being created.
- Fixed a bug: prior non-fatal errors weren't being included in the admin emails for fatal errors.
- Fixed a bug: sometimes a preceding "on " interfered with the detection of post dates in phpBB search results.
- 2015-08-04
- Added support for forum character sets other than UTF-8.
- Fixed a bug in a regular expression for detecting phpBB search results for the subSilver skin 2005 vintage.
- 2015-07-25
- Added different download options, including various different sorting options for HTML output, as well as JSON, PHP and serialised PHP formats.
- Fixed a bug: sometimes phpBB search results weren't being detected due to a faulty regular expression ('search_results_not_found') for the newly-added "mobile" skin.
- 2015-07-22
- Added support for an older version of the subsilver skin for phpBB.
- Fixed a bug: the older phpBB variant was not being detected when login credentials weren't supplied.
- Improved error and diagnostic output.
- 2015-07-06
- Added support for the mobile skin for phpBB.
- Fixed a bug: the wrong URL was being constructed for next and previous pages on phpBB post pages (used when the post for an unknown reason isn't on the page it is supposed to be on).
- 2015-02-12
- Fixed a bug: the "Extract User Username" setting for phpBB forums was being ignored.
- 2015-02-11
- Fixed two bugs affecting phpBB forums: user name detection and post contents detection. The user name detection fix applies to certain non-English phpBB forums, in particular to German ones. The post contents detection fix applies to some setups which output HTML with lines ending in CRLF rather than in LF alone.
- 2015-02-04
- Fixed several deficiencies in XenForo scraping (see Git log), which included adding a "Thread URL prefix" XenForo setting, and a generic "Non-US date format" setting.
- Made other small changes to the code, and updated messages and the documentation to no longer suggest the possibility that FUPS can scrape only a single XenForo forum, now that it has been tested on another (which revealed the deficiencies mentioned above, but no fundamental skin incompatibility).
- 2015-01-23
- Made small improvements to admin error messaging and the README.
- 2014-11-14 - 2014-12-16
- Fixed bugs and made small improvements (see Git log).
- 2014-11-13
- Finalised the refactoring of the code into the object-oriented paradigm.
- Made some small changes for better viewing on mobile devices.
- 2014-06-24
- Added support for the XenForo forum software.
- Renamed the project from phpBB-extract to FUPS, and redirected links from /phpBB-extract to /fups.
- 2014-05-30
- Bugfix: posts without titles weren't being identified for the subsilver skin.
- Bugfix: post contents were sometimes being truncated for the subsilver skin.
- Enhancement: increased the odds that posts inexplicably missing from their page will be found by checking the next page as well as the previous page.
- Enhancement: added recognition of login for the subsilver skin.