pwget - Perl Web URL fetch program


    pwget [URL ...]
    pwget --config $HOME/config/pwget.conf --Tag linux --Tag emacs ..
    pwget --verbose --overwrite
    pwget --verbose --overwrite --Output ~/dir/
    pwget --new --overwrite


General options


Create paths that do not exist in lcd: directives.

By default, any LCD directive to non-existing directory will interrupt program. With this option, local directories are created as needed making it possible to re-create the exact structure as it is in configuration file.

-c|--config FILE

This option can be given multiple times. All configurations are read.

Read URLs from configuration file. If no configuration file is given, file pointed by environment variable is read. See ENVIRONMENT.

The configuration file layout is envlained in section CONFIGURATION FILE


Do a chdir() to DIRECTORY before any URL download starts. This is like doing:

--extract -e

Unpack any files after retrieving them. The command to unpack typical archive files are defined in a program. Make sure these programs are along path. Win32 users are encouraged to install the Cygwin utilities where these programs come standard. Refer to section SEE ALSO.

  .tar => tar
  .tgz => tar + gzip
  .gz  => gzip
  .bz2 => bzip2
  .zip => unzip
-F|--Firewall FIREWALL

Use FIREWALL when accessing files via ftp:// protocol.

-m|--mirror SITE

If URL points to Sourcefoge download area, use mirror SITE for downloading. Alternatively the full full URL can include the mirror information. And example:

    --mirror kent ...

Get newest file. This applies to datafiles, which do not have extension .asp or .html. When new releases are announced, the version number in filename usually tells which is the current one so getting harcoded file with:

    pwget -o -v

is not usually practical from automation point of view. Adding --new option to the command line causes double pass: a) the whole is examined for all files and b) files matching approximately filename program-1.3.tar.gz are examined, heuristically sorted and file with latest version number is retrieved.


Ignore lcd: directives in configuration file.

In the configuration file, any lcd: directives are obeyed as they are seen. But if you do want to retrieve URL to your current directory, be sure to supply this option. Otherwise the file will end to the directory pointer by lcd:.


Ignore save: directives in configuration file. If the URLs have save: options, they are ignored during fetch. You usually want to combine --no-lcd with --no-save


Ignore x: directives in configuration file.

-O|--Output DIR

Before retrieving any files, chdir to DIR.


Allow overwriting existing files when retrieving URLs. Combine this with --skip-version if you periodically update files.

--Proxy PROXY

Use PROXY server for HTTP. (See --Firewall for FTP.). The port number is optional in the call:

-p|--prefix PREFIX

Add PREFIX to all retrieved files.

-P|--Postfix POSTFIX

Add POSTFIX to all retrieved files.


Add iso8601 ":YYYY-MM-DD" prefix to all retrived files. This is added before possible --prefix-www or --prefix.


Usually the files are stored with the same name as in the URL dir, but if you retrieve files that have identical names you can store each page separately so that the file name is prefixed by the site name.    -->   -->
-r|--regexp REGEXP

Retrieve URLs matching REGEXP from configuration file. This cancels --Tag options in the command line.

-R|--Regexp REGEXP

Retrieve file matching at the destination URL site. This is like "Connect to the URL and get all files matching REGEXP". Here all gzip compressed files are found form HTTP server directory:

    pwget -v -R "\.gz"
-A|--Regexp-content REGEXP

Analyze the content of the file and match REGEXP. Only if the regexp matches the file content, then download file. This option will make downloads slow, because the file is read into memory as a single line and then a match is searched against the content.

For example to download Emacs lisp file (.el) written by Mr. Foo in case insesiteve manner:

    pwget -v -R '\.el$' -A "(?i)Author: Mr. Foo" \

Retrieve URL and write to stdout.


Do not download files that have version number and which already exists on disk. Suppose you have these files and you use option --skip-version:


Only file.txt is retrieved, because file-1.1.tar.gz contains version number and the file has not changed since last retrieval. The idea is, that in every release the number in in distribution increases, but there may be distributions which do not contain version number. In regular intervals you may want to load those kits again, but skip versioned files. In short: This option does not make much sense without additional option --new

If you want to reload versioned file again, add option --overwrite.

-T|--Tag NAME [NAME] ...

Search tag NAME from the config file and download only entries defined under that tag. Refer to --config FILE option description. You can give Multiple --Tag switches. Combining this option with --regexp does not make sense and the concequencies are undefined.

Miscellaneous options

-d|--debug [LEVEL]

Turn on debug with positive LEVEL number. Zero means no debug. This option turns on --verbose too.


Print help page in text.


Print help page in HTML.


Print help page in Unix manual page format. You want to feed this output to c<nroff -man> in order to read it.

Print help page.


Run some internal tests. For maintainer or developer only.


Run in test mode.

-v|--verbose [NUMBER]

Print verbose messages.


Print version information.


Automate periodic downloads of files and packages.

Wget and this program

At this point you may wonder, where would you need this perl program when wget(1) C-program has been the standard for ages. Well, 1) Perl is cross platform and more easily extendable 2) You can record file download criterias to configuration files and use perl regular epxressions to select downloads 3) the program can anlyze web-pages and "search" for the download only links as instructed 4) last but not least, it can track newest packages whose name has changed since last downlaod. There is heuristics to determine the newest file or package according to file name skeleton defined in configuration.

This program does not replace pwget(1) because it does not offer as many options as wget, like recursive downloads. Use wget for ad hoc downloads and this utility for files that you monitor periodically.

Short introduction

This small utility makes it possible to keep a list of URLs in a configuration file and periodically retrieve those pages or files with simple commands. This utility is best suited for small batch jobs to download e.g. most recent versions of software files. If you use an URL that is already on disk, be sure to supply option --overwrite to allow overwriting existing files.

While you can run this program from command line to retrieve individual files, program has been designed to use separate configuration file via --config option. In the configuration file you can control the downloading with separate directives like save: which tells to save the file under different name. The simplest way to retreive the latest version of a kit from FTP site is:

    pwget --new --overwite --verbose \

Do not worry about the filename "kit-1.00.tar.gz". The latest version, say, kit-3.08.tar.gz will be retrieved. The option --new instructs to find newer version than the provided URL.

If the URL ends to slash, then directory list at the remote machine is stored to file:


The content of this file can be either index.html or the directory listing depending on the used http or ftp protocol.


Get files from site:

    pwget ..

Get all mailing list archive files that match "gz":

    pwget -R gz

Read a directory and store it to filename YYYY-MM-DD::!dir!000root-file.

    pwget --prefix-date --overwrite --verbose

To update newest version of the kit, but only if there is none at disk already. The --new option instructs to find newer packages and the filename is only used as a skeleton for files to look for:

    pwget --overwrite --skip-version --new --verbose \

To overwrite file and add a date prefix to the file name:

    pwget --prefix-date --overwrite --verbose \

To add date and WWW site prefix to the filenames:

    pwget --prefix-date --prefix-www --overwrite --verbose \

Get all updated files under default cnfiguration file's tag KITS:

    pwget --verbose --overwrite --skip-version --new --Tag kits
    pwget -v -o -s -n -T kits

Get files as they read in the configuration file to the current directory, ignoring any lcd: and save: directives:

    pwget --config $HOME/config/pwget.conf /
        --no-lcd --no-save --overwrite --verbose \

To check configuration file, run the program with non-matching regexp and it parses the file and checks the lcd: directives on the way:

    pwget -v -r dummy-regexp
    pwget.DirectiveLcd: LCD [$EUSR/directory ...]
    is not a directory at /users/foo/bin/pwget line 889.



The configuration file is NOT Perl code. Comments start with hash character (#).


At this point, variable expansions happen only in lcd:. Do not try to use them anywhere else, like in URLs.

Path variables for lcd: are defined using following notation, spaces are not allowed in VALUE part (no directory names with spaces). Varaible names are case sensitive. Variables substitute environment variabales with the same name. Environment variables are immediately available.

    VARIABLE = /home/my/dir         # define variable
    VARIABLE = $dir/some/file       # Use previously defined variable
    FTP      = $HOME/ftp            # Use environment variable

The right hand can refer to previously defined variables or existing environment variables. Repeat, this is not Perl code although it may look like one, but just an allowed syntax in the configuration file. Notice that there is dollar to the right hand> when variable is referred, but no dollar to the left hand side when variable is defined. Here is example of a possible configuration file contant. The tags are hierarchically ordered without a limit.

Warning: remember to use different variables names in separate include files. All variables are global.

Include files

It is possible to include more configuration files with statement

    INCLUDE <path-to-file-name>

Variable expansions are possible in the file name. There is no limit how many or how deep include structure is used. Every file is included only once, so it is safe to to have multiple includes to the same file. Every include is read, so put the most importat override includes last:

    INCLUDE <etc/pwget.conf>             # Global
    INCLUDE <$HOME/config/pwget.conf>    # HOME overrides it

A special THIS tag means relative path of the current include file, which makes it possible to include several files form the same directory where a initial include file resides

    # Start of config at /etc/pwget.conf
    # THIS = /etc, current location
    include <THIS/pwget-others.conf>
    # Refers to directory where current user is: the pwd
    include <pwget-others.conf>
    # end

Configuraton file example

The configuration file can contain many <directoves:>, where each directive end to a colon. The usage of each directory is best explained by examining the configuration file below and reading the commentary near each directive.

    #   $HOME/config/pwget.conf F- Perl pwget configuration file
    ROOT   = $HOME                      # define variables
    CONF   = $HOME/config
    UPDATE = $ROOT/updates
    DOWNL  = $ROOT/download
    #   Include more configuration files. It is possible to
    #   split a huge file in pieces and have "linux",
    #   "win32", "debian", "emacs" configurations in separate
    #   and manageable files.
    INCLUDE <$CONF/pwget-other.conf>
    INCLUDE <$CONF/pwget-more.conf>
    tag1: local-copies tag1: local      # multiple names to this category
        lcd:  $UPDATE                   # chdir directive
        #  This is show to user with option --verbose
        print: Notice, this site moved YYYY-MM-DD, update your bookmarks
    tag1: external
      lcd:  $DOWNL
      tag2: external-http save:/dir/dir/page.html
      tag2: external-ftp save:xx-file.txt.gz login:foo pass:passwd x:
        lcd: $HOME/download-kit new:
      tag2: package-x
        lcd: $DOWNL/package-x
        #  Person announces new files in his homepage, download all
        #  announced files. Unpack everything (x:) and remove any
        #  existing directories (xopt:rm) pregexp:\.tar\.gz$ x: xopt:rm
    # End of configuration file pwget.conf


All the directives must in the same line where the URL is. The programs scans lines and determines all options given in line for the URL. Directives can be overriden by command line options.


Currently only conv:text is available.

Convert downloaded page to text. This option always needs either save: or rename:, because only those directives change filename. Here is an example: cnv:text save:file.txt pregexp:\.html cnv:text rename:s/html/txt/

A text: shorthand directive can be used instead of cnv:text.


Download file only if the content matches REGEXP. This is same as option --Regexp-content. In this example directory listing Emacs lisp packages (.el) are downloaded but only if their content indicates that the Author is Mr. Foo: cregexp:(?i)author:.*Foo pregexp:\.el$

Set local download directory to DIRECTORY (chdir to it). Any environment variables are substituted in path name. If this tag is found, it replaces setting of --Output. If path is not a directory, terminate with error. See also --Create-paths and --no-lcd.


Ftp login name. Default value is "anonymous".


This is relevant to sourceforge, which does not allow direct downloads with links like Visit the page and selct the announced mirror that can be seen from the URL which includes string "use_mirror=site"

An example: new: mirror:kent

Get newest file. This variable is reset to the value of --new after the line has been processed. Newest means, that an ls() command is run in the ftp, and something equivalent in HTTP "ftp directories", and any files that resemble the filename is examined, sorted and heurestically determined according to version number of file which one is the latest. For example files that have version information in YYYYMMDD format will most likely to be retrieved right.

Time stamps of the files are not checked.

The only requirement is that filename must follow the universal version numbering standard for released kits:

    FILE-VERSION.extension      # de facto VERSION is defined as [\d.]+
    file-19990101.tar.gz        # ok
    file-1999.0101.tar.gz       # ok
    file-         # ok
    file1234.txt                # not recognized. Must have "-"
    file-0.23d.tar.gz           # warning ! No letters allowed 0.23d

Files that have some alphabetic version indicator at the end of VERSION are not handled correctly. Bitch the developer and persuade him to stick to the de facto standard so that files can be retrieved intelligently.

overwrite: o:

Same as turning on --overwrite


Download the HTTP page or apply command to it. A simple example, the contact page name "index.html", "welcome.html" etc. is not known: page: save:foo-homepage.html

More about page: directive and downloading difficult packages

REMEMBER: All the regular epxression used in the configuration file have a limitation of keeping together. This means that there must be no space characters in the regular expressions, because it will terminate reading the item. Like if you write

    pregexp:(this regexp )

It must be written:


Read the HTTP url page "as is" and parse page content. You need this directive if the archive is not stored in HTTP server directory (similar to ftp dir), but the maintainer has set up a separate HTML page where the details how to get archive is explained.

In order to find the information from the page, you must also supply some other directives to guide searching and constructing the correct file name:

1) A page regexp directive pregexp:ARCHIVE-REGEXP matches the A HREF filename location in the page.

2) Directive file:DOWNLOAD-FILE tells what is the template to use to construct the downloadable file (for the new: directive).

3) Directive vregexp:VERSION-REGEXP matches the exact location in the page from where the version information is extracted. The default regexp looks for line that says "The latest version 1.4.2". The regexp must return submatch 2 for the version number.

To put all together, an example shows more this in action. The following example should all be PUT ON ONE LINE, while it has been splitted to separate lines for legibility. The presented configuration line is explaind in next paragraphs.

Contact absolute page: at and search A HREF urls in the page that match pregexp:. In addition, do another scan and search the version number in the page from thw position that match vregexp: (submatch 2).

After all the pieces have been found, use template file: to make the retrievable file using the version number found from vregexp:. The actual download location is combination of page: and A HREF pregexp: location. Here is the whole "one line" definition in the configuration file:
    pregexp: package.tar.gz
    vregexp: ((?i)latest.*?version.*?\b([\d][\d.]+).*)
    file: package-1.3.tar.gz

Still not clear? Look at this complete HTML page where the above directives apply:

    The latest version of package is <B>2.4.1</B> It can be
    downloaded in several forms:
        <A HREF="download/files/package.tar.gz">Tar file</A>
        <A HREF="download/files/">ZIP file

For this example it is assumed that package.tar.gz is actually a symbolic link to the latest standard release file package-2.4.1.tar.gz. From this page the actual download location would have been So why not simply download package.tar.gz? Because then the program can't decide if the version at the page is newer than one stored on disk from the previous download. With version numbers in the file names, it can.


It is possible to add rename: directive to change the final name of the saved file to the above cases. Sometimes people put version number to "plain" files, that are not archives, like


the .el files are Emacs editor packages files and it would be very inconvenient for Emacs users to refer to those with any other name than plain "file.el". To write a complete line to find such files from a page and save them in plain name, see below. Lines have been broken for legibility:

It effectively says "See if there is new version of something that looks like file.el-1.1 and save it under name file.el by deleting the extra version number at the end of original filename".


THIS IS NOT FOR FTP directories. Use directive regexp: for FTP.

This is more general instruction than the page: and vregexp: explained above.

Instruct to download every URL on HTML page matching pregexp:RE. In typical situation the page maintainer lists his software in the development page. This example would download every tar.gz file mentined in a page. Note, that the REGEXP is matched against the A HREF link content, not the actual text that you see on the page: page:find pregexp:\.tar.gz$

You can also use additional regexp-no: directive if you want to exclude files after the pregexp: has matched a link. page:find pregexp:\.tar.gz$ regexp-no:this-packet

For FTP logins. Default value is


Print associated message to user requesting matching tag name. This directive must in separate line inside tag.

    tag1: linux
      print: this download site moved 2002-02-02, check your bookmarks. new:

The print: directive for tag is shown only if user turns on --verbose mode:

    pwget -v -T linux

Rename each file using PERL-CODE. The PERL-CODE must be full perl program with no spaces anywhere. Following variables are available during the eval() of code:

    $ARG = current file name
    $url = complete url for the file
    The code must return $ARG which is used for file name

For example, if page contains links to .html files that are in fact text files, following statement would chnage the file extensions: page:find pregexp:\.html rename:s/html/txt/

You can also call function MonthToNumber($string) if the filename contains written month name, like <2005-February.mbox>.The function will convert the name into number. Many mailing list archives can be donwloaded cleanly this way.

    #  This will download SA-Exim Mailing list archives: pregexp:\.txt$ rename:$ARG=MonthToNumber($ARG)

Here is a more complicated example: pregexp:mbox.*\d$ rename:my($y,$m)=($url=~/year=(\d+).*month=(\d+)/);$ARG="$y-$m.mbox"

Let's break that one apart. You may spend some time with this example since the possiblilities are limitless.

    1. Connect to page
    2. Search page for URLs matching regexp 'mbox.*\d$'. A found link would
       could be
    3. The found link is put to $ARG, which can be used to extract suitable
       mailbox name with perl code that is evaluated. The resulting name must
       apear in $ARG. Thus the code effectively extract two items from the
       link to form a mailbox name:
        my ($y, $m) = ( $url =~ /year=(\d+).*month=(\d+)/ )
        $ARG = "$y-$m.mbox"
        => 2004-12.mbox

Just remember, that there must not be any spaces in the code that follows rename: directive.


Get all files in ftp directory matching regexp. Directive save: is ignored.


After the regexp: directive has matched, explude files that match directive regexp-no:


This option is for interactive use. Retrieve all files from HTTP or FTP site which match REGEXP.


Save file under this name to local disk.


Downloads can be grouped under tagN so that e.g. option --Tag1 would start downloading files from that point on until next tag1 is found. There are currently unlimited number of tag levels: tag1, tag2 and tag3, so that you can arrange your downlods hierarchially in the configuration file. For example to download all Linux files rhat you monitor, you would give option --Tag linux. To download only the NT Emacs latest binary, you would give option --Tag emacs-nt. Notice that you do not give the level in the option, program will find it out from the configuration file after the tag name matches.

The downloading stops at next tag of the same level. That is, tag2 stops only at next tag2, or when upper level tag is found (tag1) or or until end of file.

    tag1: linux             # All Linux downlods under this category
        tag2: sunsite    tag2: another-name-for-this-spot
        #   List of files to download from here
        #   List of files to download from here
    tag1: emacs-binary
        tag2: emacs-nt
        tag2: xemacs-nt
        tag2: emacs
        tag2: xemacs

Extract (unpack) file after download. See also option --unpack and --no-extract The archive file, say .tar.gz will be extracted the file in current download location. (see directive lcd:)

The unpack procedure checks the contents of the archive to see if the package is correctly formed. The de facto archive format is


In the archive, all files are supposed to be stored under the proper subdirectory with version information:


IMPORTANT: If the archive does not have a subdirectory for all files, a subdirectory is created and all items are unpacked under it. The defualt subdirectory name in constructed from the archive name with currect date stamp in format:


If the archive name contains something that looks like a version number, the created directory will be constructed from it, instead of current date.

    package-1.43.tar.gz    =>  package-1.43

Like directive x: but extract the archive as is, without checking content of the archive. If you know that it is ok for the archive not to include any subdirectories, use this option to suppress creation of an artificial root package-YYYY.MMDD.


This options tells to remove any previous unpack directory.

Sometimes the files in the archive are all read-only and unpacking the archive second time, after some period of time, would display

    tar: package-3.9.5/.cvsignore: Could not create file: Permission denied
    tar: package-3.9.5/BUGS: Could not create file: Permission denied

This is not a serious error, because the archive was already on disk and tar did not overwrite previous files. It might be good to inform the archive maintainer, that the files have wrong permissions. It is customary to expect that distributed kits have writable flag set for all files.


Here is list of possible error messages and how to deal with them. Turning on --debug will help to understand how program has interpreted the configuration file or command line options. Pay close attention to the generated output, because it may reveal that a regexp for a site is too lose or too tight.

ERROR {URL-HERE} Bad file descriptor

This is "file not found error". You have written the filename incorrectly. Double check the configuration file's line.


Variable PWGET_PL_CFG can point to the root configuration file in which you can use include directives to read more configuration files. The configuration file is read at startup if it exists.

    export PWGET_PL_CFG=$HOME/conf/pwget.conf     # /bin/hash syntax
    setenv PWGET_PL_CFG $HOME/conf/pwget.conf     # /bin/csh syntax


C program wget(1) and from the the Libwww Perl library you find scripts lwp-download(1) lwp-mirror(1) lwp-request(1) lwp-rget(1)

Win32 Cygwin unix utilities at


Latest version of this file is at Project homepage at


CPAN/Administrative CPAN/Web


LWP::UserAgent Net::FTP


HTML::Parse HTML::TextFormat HTML::FormatText

These modules are dynamically loaded only if directive cnv:text is used. Otherwise these modules are not loaded.

Crypt::SSLeay This module is loaded only if HTTPS scheme is encountered.






Copyright (C) 1996-2009 Jari Aalto. This program is free software; you can redistribute it and/or modify it under the same terms of Gnu General Public License v2 or any later version.