TXR: a Pattern Matching
Language (Not Just) for
Convenient Text Extraction

Kaz Kylheku <kaz@kylheku.com>

Quick Links

What is it?

TXR is a fusion of different streams of thought. It is influenced by concepts from text processing languages such as awk or perl, pattern matching concepts from logic/AI programming, Lisp and functional languages. It is relatively new. Development began around September 2009.

TXR is a pragmatic, convenient text munging language. Or rather, has one built in. Similarly to other text processing tools, it has certain convenient implicit behavior with regard to input handling. Whereas, say, awk implicitly reads a file, breaking it into records and fields which are accessible as positional variables, TXR has quite a different way of making input handling implicit: namely via a nested, recursive pattern matching language which binds variables. This approach still handles delimited fields with relative convenience, but generalizes into handling messy, loosely structured data, or data which exhibits different regularities in different sections, etc. Constructs in TXR (the pattern language) aren't imperative statements, but rather pattern-matching directives: each construct terminates by matching, failing, or throwing an exception. Searching and backtracking behaviors are implicit. It has features like structured named blocks with nonlocal exits, structured exception handling, named pattern matching functions, and numerous other features.  TXR's pattern language is powerful enough to parse grammars, yet simple to use in an ad-hoc way on trivial tasks.

TXR also has the "brains" that the designers of other pragmatic, convenient text munging languages have neglected to put in: a built in, powerful functional and imperative language, with all the goodies: garbage collection, bignum integers; symbols; lists (eager and lazy); vectors; lexical closures; exception handling; dynamic nonlocal exists; named blocks; hash tables with optional weak semantics; streams; strings; quasiliterals; quasiquotes; and regular expressions with advanced operators like intersection, complement, difference.

In spite of all this, TXR is a clean language with an unambiguous syntax which makes a only a judicious and economic use of terse notations. TXR programs are compact, but not to the point of looking like the noise on a serial line when a connection is broken.

Examples

Here is a collection of TXR Solutions to a number of problems from Rosetta Code.

Rudimentary Concepts

A file containing UTF-8 text is already a TXR query which: almost. Care has to be taken to escape the meta-character @ which introduces all special syntax. This is done by writing it twice: @@ stands for a single literal @.  Thus, a text file which contains no @ signs, or whose @ signs are properly escaped by being doubled twice is a pattern match. So for instance:

Four score and
seven years ago
our fathers brought forth,

is a TXR query which matches the text itself. Actually, it matches more than just itself. It matches any text which begins with those three lines. Thus it also matches this text

Four score and
seven years ago
our fathers brought forth,
upon this continent

furthermore, spaces actually have a special meaning in TXR. A single space denotes a match for one or more spaces. So our query also matches this text, which is a convenient behavior.

Four   score   and
seven years ago
our fathers brought forth,
upon this continent

We can tighten the query so that it matches exactly three lines, and only single spaces in the first line.

Four@\ score@\ and
seven years ago
our fathers brought forth,
@(eof)

Here the @ character comes into play. The syntax @\space syntax encodes a literal space which doesn't have the "match one or more spaces" meaning. The @(eof) directive means "match the empty data set, consisting of no lines".

Variables are denoted as identifiers preceded by @, and  match pieces of text in mostly intuitive ways (and sometimes not so intuitive). Suppose we change the above to this:

Four@\ score@\ and
seven @units ago
our @relatives brought forth,
@(eof)

Now if this query is matched against the original file, the variable units will capture the character string "years" and relatives will capture "fathers". Of course, it matches texts which have words other than these, such as seven months ago, or our mothers brought forth.

As you can see, the basic concept in simple patterns like this very much resembles a "here document": it's a template of text with variables. But of course, this "here document" runs backwards! Rather than generating text by substituting variables, it does the opposite: it matches text and extracts variables. The need for a "here document run backwards" was what prompted the initial development of TXR!

From this departure point, things get rapidly complicated. The pattern language has numerous directives expressing parallel matching and iteration. Many of the directives work in vertical (line oriented) and horizontal (character oriented) modes. Pattern functions can be defined (horizontal and vertical) and those can be recursive, allowing grammars to be parsed.

Simple Collection/Generation Example

The following query reads a stream of comma-separated pairs and generates a HTML table. A complete version with sample data is given here.

@(collect)
@char,@speech
@(end)
@(output :filter :to_html)
<table>
@  (repeat)
  <tr>
     <td>@char</td>
     <td>@speech</td>
  </tr>
@  (end)
</table>
@(end)

Grammar Parsing Example

Here is a TXR query which matches an arithmetic expression grammar, consisting of numbers, identifiers, basic arithmetic operators (+ - * /) and parentheses. The expression is supplied as a command line argument (this is done by @(next :args) which redirects the pattern matching to the argument vector).

Note that most of this code is not literal text. All of the pieces shown in color are special syntax. The @; os -> optional space text is a comment.

@(next :args)
@(define os)@/ */@(end)@; os -> optional space
@(define mulop)@(os)@/[*\/]/@(os)@(end)
@(define addop)@(os)@/[+\-]/@(os)@(end)
@(define number)@(os)@/[0-9]+/@(os)@(end)
@(define ident)@(os)@/[A-Za-z]+/@(os)@(end)
@(define factor)@(cases)(@(expr))@(or)@(number)@(or)@(ident)@(end)@(end)
@(define term)@(some)@(factor)@(or)@(factor)@(mulop)@(term)@(or)@(addop)@(factor)@(end)@(end)
@(define expr)@(some)@(term)@(or)@(term)@(addop)@(expr)@(end)@(end)
@(cases)
@  (expr)
@  (output)
parses!
@  (end)
@(or)
@  (expr)@bad
@  (output)
error starting at "@bad"
@  (end)
@(end)

The grammar productions above represented by horizontal pattern functions. Horizontal pattern functions are denoted visually by a horizontal syntax: their elements are written side by side on a single logical line. Horizontal function definitions can be broken into multiple physical lines and indented, with the help of the @\ continuation sequence, which consumes all leading whitespace from the following line, like this:

@(define term)@\
  @(some)@\
    @(factor)@\
  @(or)@\
    @(factor)@(mulop)@(term)@\
  @(or)@\
    @(addop)@(factor)@\
  @(end)@\
@(end)

Sample runs from Unix command line:

$ txr expr.txr 'a + (3 * b/(c + 4))'
parses!
$ txr expr.txr 'a + (3 * b/(c + 4)))'
error starting at ")"
$ txr expr.txr 'a + (3 * b/(c + 4)'
error starting at "+ (3 * b/(c + 4)"

As you can see, this program matches the longest prefix of the input which is a well-formed expression. The expression is recognized using the simple function call @(expr) which could be placed into the middle of a text template as easily as a variable.  The @(cases) directive is used to recognize two situations: either the argument completely parses, or there is stray material that is not recognized, which can be captured into a variable called bad. The grammar itself is straightforward.

Look at the grammar production for factor. It contains two literal characters: the parentheses around @(expr). The syntax coloring reveals them to be what they are: they stand for themselves.

The ability to parse grammars happened in TXR by accident. It's a consequence of combining pattern matching and functions. In creating TXR, I independently discovered a concept known as PEGs: Parsing Expression Grammars.

Note how the program easily deals with lexical analysis and higher level parsing in one grammar: no need for a division of the task into "tokenizing" and "parsing".  Tokenizing is necessary with classic parsers, like LALR(1) machines, because these parsers normally have only one token of lookahead and avoid backtracking. So they are fed characters instead of tokens, they cannot do very much due to running into ambiguities arising from complex tokens. By itself, a classic parser cannot decide whether "i" is the beginning of the C "int" keyword, or just the start of an identifier like "input".It needs the tokenizer to scan these (done with a regular language based on regular expression) and do the classification, so the parser sees a KEYWORD or IDENT token.

Embedded Lisp

Just like the TXR pattern matching primitves are embedded in plain text, within the pattern matching language, there is an embedded Lisp dialect. Here is one way to tabulate a frequency histogram of the letters A-Z, using the pattern language to extract the letters from the input, and TXR Lisp to tabulate:

@(do (defvar h (make-hash nil nil t)))
@(collect :vars ())
@(coll :vars ())@\
  @{letter /[A-Za-z]/}@(filter :upcase letter)@\
  @(do (inc (gethash h letter 0)))@\
@(end)
@(end)
@(do (dohash (key value h)
       (format t "~a: ~a\n" key value)))

Here is an approach using purely TXR Lisp. The function lazy-char-stream takes an open stream and returns a lazy list of characters. The lazy list is generated using the gen operator: (gen x y) means that whenever the lazy list is probed to a new depth, the expression x is evaluated, and if it yields true, then y is evaluated to produce a list item. This lazy list of characters can be conveniently processed using the each operator. The expression (inc [h (chr-toupper ch) 0]) is a shorthand equivalent to (inc (gethash h (chr-toupper ch) 0)).

@(do (defun lazy-char-stream (s)
       (let (ch) (gen (set ch (get-char s)) ch)))

     (let ((h (make-hash nil nil t))
           (s (open-file "/usr/share/dict/words" "r")))
         (each ((ch (lazy-char-stream s)))
           (if (chr-isalpha ch)
             (inc [h (chr-toupper ch) 0])))
         (dohash (key value h)
           (format t "~a: ~a\n" key value))))

Source Code

Releases and snapshots can be pulled directly from the git repository.

To build the program, you need a C compiler, a yacc utility (I've never tried anything but GNU Bison and Berkeley Yacc) and GNU flex. (Flex extensions are used: in particular start conditions). A few POSIX features are required from the host platform, like the popen function, and <dirent.h>. These are available on Windows through the MinGW compiler and environment.

The configure script and Makefile are geared toward a gcc and glibc environment, and rely on some GNU make features. Building for Windows therefore requires a GNU environment: MinGW. There is an issue with GNU flex on MinGW, requiring the following argument to the configure script: libflex="-L/usr/lib -lfl".

If you have porting issues, contact the TXR mailing list!

Binary Downloads

Pre-compiled builds of TXR 65 are available for these platforms:

OS OS Version Arch MD5 Checksum File
Cygwin 1.7.13 i686 87d203b7d38d166e42c411b48c49e68a txr-65-Cygwin-1.7.13-i686
FreeBSD 9.0 amd64 405a34b00d7ac24e10a5b20393ca7e82 txr-65-FreeBSD-9.0-amd64
MinGW 1.0.17 i686 6ffa1a4eb4c6659991448d1e299089da txr-65-MinGW-1.0.17-i686.exe
NetBSD 5.1 amd64 0acba413e1de0930846c50e8311574ac txr-65-NetBSD-5.1-amd64
OSX Lion i386 4996ad83844fca8c91972bab667c8d8e txr-65-OSX-Lion-i386
Ubuntu 11.04 i686 9907b938699b61351c5fbd5524becc23 txr-65-Ubuntu-11.04-i686

Make a Donation

If you find TXR to be a valuable tool in your arsenal, here is one way to show your appreciation and support! Developing stuff like this takes countless hours.