TXR: a Pattern Matching
Language (Not Just) for
Convenient Text Extraction

Kaz Kylheku <kaz@kylheku.com>

Quick Links

What is it?

TXR ("texer" or "tee ex ar") is a new and growing language oriented toward processing text, packaged as a utility (the txr command) that runs in POSIX environments and on Microsoft Windows.

Working with TXR is different from most text processing programming languages. Constructs in TXR aren't imperative statements, but rather pattern-matching directives: each construct terminates by matching, failing, or throwing an exception. Searching and backtracking behaviors are implicit.

 The development of TXR began when I needed a utility to be used in shell programming which would reverse the action of a "here-document". Here-documents are a programming language feature for generating multi-line text from a boiler-plate template which contains variables to be substituted, and possibly other constructs such as  various functions that generate text. Here-documents appeared in the Unix shell decades ago, but most of today's web is basically a form of here-document, because all non-trivial websites generate HTML dynamically, substituting variable material into templates on the fly. Well, in the given situation I was programming in, I didn't want here documents as much as "there documents": I wanted to write a template of text containing variables, but not to generate text but to do the reverse: match the template against existing text which is similar to it, and capture pieces of it into variables. So I developed a utility to do just that: capture these variables from a template, and then generate a set of variable assignments that could be eval-ed in a shell script.

That was sometime in the middle of 2009. Since then TXR has become a lot more powerful. It has features like structured named blocks with nonlocal exits, structured exception handling, pattern matching functions, and numerous other features.  TXR is powerful enough to parse grammars, yet simple to use on trivial tasks.

For things that can't be easily done in the pattern matching language, TXR has a built-in Lisp dialect, which supports goodies like first class functions with closures over lexical environments, I/O (including string and list streams), hash tables with optional weak semantics, and arithmetic with arbitrary precision ("bignum") integers.

Examples Online

There is a growing body of TXR examples on the Rosetta Code site, where you can compare them to solutions in other programming languages.

and others.

Rudimentary Concepts

A file containing UTF-8 text is already a TXR query which: almost. Care has to be taken to escape the meta-character @ which introduces all special syntax. This is done by writing it twice: @@ stands for a single literal @.  Thus, a text file which contains no @ signs, or whose @ signs are properly escaped by being doubled twice is a pattern match. So for instance:

Four score and
seven years ago
our fathers brought forth,

is a TXR query which matches the text itself. Actually, it matches more than just itself. It matches any text which begins with those three lines. Thus it also matches this text

Four score and
seven years ago
our fathers brought forth,
upon this continent

furthermore, spaces actually have a special meaning in TXR. A single space denotes a match for one or more spaces. So our query also matches this text, which is a convenient behavior.

Four   score   and
seven years ago
our fathers brought forth,
upon this continent

We can tighten the query so that it matches exactly three lines, and only single spaces in the first line.

Four@\ score@\ and
seven years ago
our fathers brought forth,
@(eof)

Here the @ character comes into play. The syntax @\space syntax encodes a literal space which doesn't have the "match one or more spaces" meaning. The @(eof) directive means "match the empty data set, consisting of no lines".

Variables are denoted as identifiers preceded by @, and  match pieces of text in mostly intuitive ways (and sometimes not so intuitive). Suppose we change the above to this:

Four@\ score@\ and
seven @units ago
our @relatives brought forth,
@(eof)

Now if this query is matched against the original file, the variable units will capture the character string "years" and relatives will capture "fathers". Of course, it matches texts which have words other than these, such as seven months ago, or our mothers brought forth.

From this departure point, things get rapidly complicated.

Advanced Example

Here is a TXR query which matches an arithmetic expression grammar, consisting of numbers, identifiers, basic arithmetic operators (+ - * /) and parentheses. The expression is supplied as a command line argument (this is done by @(next :args) which redirects the pattern matching to the argument vector).

Note that most of this code is not literal text. All of the pieces shown in color are special syntax. The @# os -> optional space text is a comment:

@(next :args)
@(define os)@/ */@(end)@# os -> optional space
@(define mulop)@(os)@/[*\/]/@(os)@(end)
@(define addop)@(os)@/[+\-]/@(os)@(end)
@(define number)@(os)@/[0-9]+/@(os)@(end)
@(define ident)@(os)@/[A-Za-z]+/@(os)@(end)
@(define factor)@(cases)(@(expr))@(or)@(number)@(or)@(ident)@(end)@(end)
@(define term)@(some)@(factor)@(or)@(factor)@(mulop)@(term)@(or)@(addop)@(factor)@(end)@(end)
@(define expr)@(some)@(term)@(or)@(term)@(addop)@(expr)@(end)@(end)
@(cases)
@ (expr)
@ (output)
parses!
@ (end)
@(or)
@ (expr)@bad
@ (output)
error starting at "@bad"
@ (end)
@(end)

Sample runs from Unix command line:

$ txr expr.txr 'a + (3 * b/(c + 4))'
parses!
$ txr expr.txr 'a + (3 * b/(c + 4)))'
error starting at ")"
$ txr expr.txr 'a + (3 * b/(c + 4)'
error starting at "+ (3 * b/(c + 4)"

As you can see, this program matches the longest prefix of the input which is a well-formed expression. The expression is recognized using the simple function call @(expr) which could be placed into the middle of a text template as easily as a variable.  The @(cases) directive is used to recognize two situations: either the argument completely parses, or there is stray material that is not recognized, which can be captured into a variable called bad. The grammar itself is straightforward.

Look at the grammar production for factor. It contains two literal characters: the parentheses around @(expr). The syntax coloring reveals them to be what they are: they stand for themselves.

The ability to parse grammars happened in TXR by accident. It's a consequence of combining pattern matching and functions. In creating TXR, I independently discovered a concept known as PEGs: Parsing Expression Grammars.

Note how the program easily deals with lexical analysis and higher level parsing in one grammar: no need for a division of the task into "tokenizing" and "parsing".  Tokenizing is necessary with classic parsers, like LALR(1) machines, because these parsers normally have only one token of lookahead and avoid backtracking. So they are fed characters instead of tokens, they cannot do very much due to running into ambiguities arising from complex tokens. By itself, a classic parser cannot decide whether "i" is the beginning of the C "int" keyword, or just the start of an identifier like "input".It needs the tokenizer to scan these (done with a regular language based on regular expression) and do the classification, so the parser sees a KEYWORD or IDENT token.

Embedded Lisp

Just like the TXR pattern matching primitves are embedded in plain text, within the pattern matching language, there is an embedded Lisp dialect. Here is one way to tabulate a frequency histogram of the letters A-Z:

@(do (defvar h (make-hash nil nil t)))
@(collect :vars ())
@(coll :vars ())@\
@{letter /[A-Za-z]/}@(filter :upcase letter)@\
@(do (inc (gethash h letter 0)))@\
@(end)
@(end)
@(do (dohash (key value h)
(format t "~a: ~a\n" key value)))

Where do I get it, et cetera

Releases and snapshots can be pulled directly from the git repository here. A HTML-ized version of the manual page, which may be out of date, is here. And, let's not forget the Savannah project page.

To build the program, you need a C compiler, a yacc utility (I've never tried anything but GNU Bison and Berkeley Yacc) and GNU flex. (Flex extensions are used: in particular start conditions). A few POSIX features are required from the host platform, like the popen function, and <dirent.h>. These are available on Windows through the MinGW compiler and environment.

The configure script and Makefile are geared toward a gcc and glibc environment, and rely on some GNU make features. Building for Windows therefore requires a GNU environment: MinGW. There is an issue with GNU flex on MinGW, requiring the following argument to the configure script: libflex="-L/usr/lib -lfl".

The program itself does some inherently non-portable things in the area of garbage-collected memory management. I build it with Ubuntu and Debian Linux for x86 and x86_64. I've run it on a MIPS Linux system (n32 ABI) and x86 based Linux systems, both 32 and 64 bit, but not recentl versions. If you have porting issues, contact the TXR mailing list!