WP-MIRROR
Purpose
WP-MIRROR is a free utility for building and maintaining a set of mirrors of Wikipedias (aka a Wikipedia farm). The WikiMedia Foundation offers Wikipedias in nearly 300 languages. The user selects the set of languages to be mirrored.
Many users need off-line access, often for reasons of privacy, mobility, and availability. They currently use KIWIX which provides selected articles and thumbnail images. WP-MIRROR builds more complete mirrors including the original size image files.
WP-MIRROR is designed for robustness. WP-MIRROR:
- Asserts hardware and software prerequisites,
- Skips over unparsable pages and bad file names,
- Validates image files and sequesters those that are corrupt,
- Waits for internet access when needed,
- Uses check-pointing to resume after interruption, and
- Uses concurrency to accelerate mirroring of the largest wikipedias.
Most features are configurable, either through command-line options, or via a configuration file (/etc/wp-mirror/local.conf).
Use Cases
WP-MIRROR by default mirrors the simple wiki (Simple English means shorter sentences) which at 40G should fit on most laptops. Users should edit a configuration file (/etc/wp-mirror/local.conf) to specify a set of languages to be mirrored. For example:
Someone who would like to get involved with WikiMedia Foundation, might try:
Students learning English as a Second Language (ESL) might want Simple English side-by-side with their native language Wikipedia:
Someone interested in classical languages might configure:
Access
For user convenience, WP-MIRROR sets up a virtual host (e.g. http://simple.mediawiki.site/) so that the user may access the local mirrors using a web browser.
Process
WP-MIRROR is non-interactive and normally runs in background as a weekly cron job, updating the mirrors whenever the WikiMedia Foundation posts new dump files.
WP-MIRROR maintains the state of each mirror in a transactional database (InnoDB which is the ACID compliant storage engine for MySQL). There are three advantages to this:
- Checkpointing. The state information is Durable (the `D' in ACID). When WP-MIRROR is interrupted (e.g. power failure, laptop is closed, cat walks across the keyboard) the state information serves as a checkpoint. When WP-MIRROR is next started, it picks up where it left off.
- Concurrency. Multiple instances of WP-MIRROR can run concurrently. That is to say, each instance of WP-MIRROR is Isolated (the `I' in ACID) from every other instance. The concurrency feature is intended for desktop use when one is mirroring any of the top ten wikipedias (currently the en de fr nl it pl es ru ja and pt wikipedias).
- Monitoring. WP-MIRROR can also be run in monitor mode (concurrently with instances that are building mirrors). Instances running in monitor mode display the state of each mirror. If a suitable windowing system is present, progress bars are rendered using graphics in a separate window, and otherwise using ASCII characters in a console (see figures below).
Downloading WP-MIRROR
WP-MIRROR can be found on the main GNU server: http://download.savannah.gnu.org/releases/wp-mirror/ (via HTTP).
Documentation
Documentation for WP-MIRROR is available online. You may also find more information about WP-MIRROR by running info wp-mirror or man wp-mirror, or by looking at /usr/share/doc/wp-mirror/, /usr/local/doc/wp-mirror/, or similar directories on your system. A brief summary is available by running wp-mirror --help.
Dependencies
WP-MIRROR 0.2 was developed on a PC with the Debian GNU/Linux 6.0 (squeeze) distribution installed. WP-MIRROR should therefore be easy to install on Debian 6.0 (squeeze) or related distributions such as Ubuntu 11.04 (natty).
Currently, there are no plans to backport WP-MIRROR to earlier distributions.
Installation
Debian GNU/Linux 6.0 (squeeze)
1. Install clisp
(shell)# aptitude install clisp cl-asdf cl-getopt cl-md5
Check that you have CLISP 2.48 or higher.
(shell)$ clisp --version
GNU CLISP 2.48 (2009-07-28) (built 3487543663) (memory 3534965158)
...
Earlier distributions, such as Debian GNU/Linux 5.0 (lenny) and its derivatives such as Ubuntu 10.04 LTS (lucid), provide older versions of CLISP that lack some of the functions called by WP-MIRROR.
2. Configure cl-asdf
All modern language systems come with libraries and packages that provide functionality greatly in excess of that needed for standards compliance. Often these packages are provided by third parties. Lisp systems are no exception. For Debian distributions, third-party libraries and packages are installed under `/usr/share/common-lisp/source/', and symbolic links to said source files are collected under `/usr/share/common-lisp/systems/'.
Another System Definition Facility (ASDF), is the link between a Lisp system and any libraries and packages that it calls. ASDF is not immediately usable upon installation. Your lisp system (CLISP in this case) must first be made aware of its location. ASDF comes with documentation that discusses configuration.
(shell)$ less /usr/share/doc/cl-asdf/README.Debian
(shell)$ lynx /usr/share/doc/cl-asdf/asdf/index.html
Before configuring, first note that WP-MIRROR will be run as `root', so the configuration file `.clisprc', must be put in the root directory, rather than the user's home directory. To configure CLISP to use ASDF, append the following line to `/root/.clisprc'.
(load #P"/usr/share/common-lisp/source/cl-asdf/asdf.lisp")
CLISP should now be ready to use ASDF. Check this by running
(shell)# clisp -q
[1]> *features*
(:ASDF2 :ASDF ...
[2]> (asdf:asdf-version)
"2.011"
3. Unpack WP-MIRROR
(shell)$ gunzip wp-mirror-0.2.tar.gz
(shell)# tar -xpvf wp-mirror-0.2.tar
(shell)# cd wp-mirror-0.2
4. Install WP-MIRROR
The build process spawns over a dozen files. The install process copies them to the appropriate directories and sets permissions. Note that `local.conf' must be manually moved. This is done to avoid inadvertently overwriting any modifications that you might have previously made to `/etc/wp-mirror/local.conf'.
(shell)$ make build
(shell)# make install
(shell)# cp local.conf /etc/wp-mirror/.
5. Trial Run
WP-MIRROR is designed to be launched from a command-line interface (CLI). This is so that one may set up a mirror farm on a remote server accessed via SSH (servers usually do not have a GUI installed).
Open two consoles (terminals), and then in separate consoles (terminals) execute:
(shell)# wp-mirror --mirror
(shell)# wp-mirror --gui
The first time you try this, you will get an error message. This is because WP-MIRROR first asserts hardware and software prerequisites. Hardware prerequisites include adequate memory and disk space, and internet connectivity. Software prerequisites include MySQL and MediaWiki configuration, directory and file permissions, etc. If anything is amiss, WP-MIRROR will exit with a neatly formatted error message. In most cases the error message requests that you install or configure something. So please proceed directly to the next step.
6. System planning
At this point you should pause to study the README file.
(shell)$ less /usr/share/doc/wp-mirror/README
This document contains highly valuable advice that could save you weeks or months of time. Why?
- The default configuration for MySQL is suboptimal for WP-MIRROR. With careful configuration, its performance can be improved by an order of magnitude. This is an indispensable condition for mirroring any of the top ten largest wikipedias.
- The default configuration for MediaWiki uses ImageMagick for resizing images. However, ImageMagick grabs too much system memory and will frequently hang your system. MediaWiki should instead be configured to use GraphicsMagick and RSVG. For the top ten wikipedias, which can have upwards of a million image files, this is a sine qua non.
- CURL sometimes fails to completely download a file. The default configuration lets CURL hang. CURL can instead be configured with a timeout. For the top ten wikipedias, where partial downloads can afflict hundreds or thousands of image files, correct configuration is a must.
- If your internet traffic goes through a caching web proxy, such as polipo, there are additional issues to address.
But the best reason to study the README is this: If you intend to build a mirror of any of the largest wikipedias, you will greatly benefit by first going down the learning curve with a small wikipedia.
- Building a mirror of the `en' wikipedia, which requires over 2T disk space, presents the most demanding case, and may take weeks to complete on a server.
- Building a mirror of the `simple' wikipedia, which requires 40G disk space, can be done in a day on a laptop.
The README treats both projects (`simple' and `en') in detail.
7. Install mysql (if not already done)
(shell)# aptitude install hdparm mysql-client mysql-server
These need to be configured as described in the README.
8. Install mediawiki (if not already done)
(shell)# aptitude install graphicsmagick gv librsvg2-2
(shell)# aptitude install mediawiki mediawiki-extensions mediawiki-math
(shell)# aptitude install php5-suhosin texlive-latex-base tidy
The mediawiki* packages need to be configured as described in the README.
9. Install file handling packages (if not already done)
(shell)# aptitude install bunzip2 curl openssl wget
CURL needs to be configured with a timeout as described in the README.
10. Configure WP-MIRROR
Finally, choose the wikipedias that you wish to mirror. This is done by editting `/etc/wp-mirror/local.conf'. By default, WP-MIRROR builds a mirror of `simple'. The config file contains several additional examples (all commented out). If you like the classics, try
(defparameter *mirror-languages* '("el" "la" "simple"))
If you want to start very small, build a mirror of the `zu' wikipedia (isiZulu). For it has just a few hundred articles, and some nice animal photos (category:isiLwane).
(defparameter *mirror-languages* '("zu"))
Then click a few times on the "Special:Random" link to get an idea of what is there.
Enjoy!
Mailing lists
WP-MIRROR has the following mailing lists:
- wp-mirror-announce is used to announce releases
- wp-mirror-devel is a closed list for developers and testers.
- wp-mirror-list is used to discuss most aspects of WP-MIRROR, including development and enhancement requests, as well as bug reports.
Getting involved
Development of WP-MIRROR, and GNU in general, is a volunteer effort, and you can contribute. For information, please read How to help GNU. If you'd like to get involved, it's a good idea to join the discussion mailing list (see above).
- Test releases
- Trying the latest test release (when available) is always appreciated. Test releases of WP-MIRROR can be found at http://download.savannah.gnu.org/releases/wp-mirror/ (via HTTP).
- Development
- For development sources, issue trackers, and other information, please see the WP-MIRROR project page at savannah.gnu.org.
- Translating WP-MIRROR
- To translate WP-MIRROR's messages into other languages, please see the Translation Project page for WP-MIRROR. If you have a new translation of the message strings, or updates to the existing strings, please have the changes made in this repository. Only translations from this site will be incorporated into WP-MIRROR. For more information, see the Translation Project.
- Maintainer
- WP-MIRROR is currently being maintained by Dr. Kent L. Miller. Please use the mailing lists for contact.
Licensing
WP-MIRROR is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.