TXR: an Original, New
Programming Language for
Convenient Data Munging

Kaz Kylheku <kaz@kylheku.com>

Quick Links

Help Needed

The TXR project is looking for hackers to develop features, such as:

TXR has clean, easy to understand and maintain internals that are a pleasure to work with. Be sure to read the HACKING guide.

What is it?

TXR is a pragmatic, convenient tool ready to take on your daily hacking challenges with its dual personality: its whole-document pattern matching and extraction language for scraping information from arbitrary text sources, and its powerful data-processing language to slice through problems like a hot knife through butter. Many tasks can be accomplished with TXR "one liners" directly from your system prompt.

TXR is a fusion of many different ideas, a few of which are original, and it is influenced by many languages, such as Common Lisp, Scheme, Awk, M4, POSIX Shell, Prolog, Ruby, Python, Arc, Clojure, S-Lang and others. It is relatively new: the project started in 2009.

Similarly to some other data processing tools, it has certain convenient implicit behavior with regard to input handling, via its pattern-based text extraction language. A comparison to the Awk language may be drawn here: whereas Awk implicitly reads a file, breaking it into records and fields which are accessible as positional variables, TXR has quite a different way of making input handling implicit: namely via a nested, recursive pattern matching notation which binds variables. This approach still handles delimited fields with relative convenience, but generalizes into handling messy, loosely structured data, or data which exhibits different regularities in different sections, etc. Constructs in TXR (the pattern language) aren't imperative statements, but rather pattern-matching directives: each construct terminates by matching, failing, or throwing an exception. Searching and backtracking behaviors are implicit. It has features like structured named blocks with nonlocal exits, structured exception handling, named pattern matching functions, and numerous other features.  TXR's pattern language is powerful enough to parse grammars, yet simple to use in an ad-hoc way on trivial tasks.

TXR also has the "brains" that the designers of other pragmatic, convenient data munging languages have neglected to put in: a built in, powerful functional and imperative language, with lots of features, such as:

This embedded language, TXR Lisp, maintains strong ties to the Lisp family of languages, while its design also pays attention to newer scripting languages which have emerged in the last ten to twenty years, and takes cues from functional languages.

Examples

Here is a collection of TXR Solutions to a number of problems from Rosetta Code.

Rudimentary Concepts

A file containing UTF-8 text is already a TXR query which matches itself: almost. Care has to be taken to escape the meta-character @ which introduces all special syntax. This is done by writing it twice: @@ stands for a single literal @.  Thus, a text file which contains no @ signs, or whose @ signs are properly escaped by being doubled twice is a pattern match. So for instance:

Four score and
seven years ago
our fathers brought forth,

is a TXR query which matches the text itself. Actually, it matches more than just itself. It matches any text which begins with those three lines. Thus it also matches this text

Four score and
seven years ago
our fathers brought forth,
upon this continent

furthermore, spaces actually have a special meaning in TXR. A single space denotes a match for one or more spaces. So our query also matches this text, which is a convenient behavior.

Four   score   and
seven years ago
our fathers brought forth,
upon this continent

We can tighten the query so that it matches exactly three lines, and only single spaces in the first line.

Four@\ score@\ and
seven years ago
our fathers brought forth,
@(eof)

Here the @ character comes into play. The syntax @\space syntax encodes a literal space which doesn't have the "match one or more spaces" meaning. The @(eof) directive means "match the empty data set, consisting of no lines".

Variables are denoted as identifiers preceded by @, and  match pieces of text in mostly intuitive ways (and sometimes not so intuitive). Suppose we change the above to this:

Four@\ score@\ and
seven @units ago
our @relatives brought forth,
@(eof)

Now if this query is matched against the original file, the variable units will capture the character string "years" and relatives will capture "fathers". Of course, it matches texts which have words other than these, such as seven months ago, or our mothers brought forth.

As you can see, the basic concept in simple patterns like this very much resembles a "here document": it's a template of text with variables. But of course, this "here document" runs backwards! Rather than generating text by substituting variables, it does the opposite: it matches text and extracts variables. The need for a "here document run backwards" was what prompted the initial development of TXR!

From this departure point, things get rapidly complicated. The pattern language has numerous directives expressing parallel matching and iteration. Many of the directives work in vertical (line oriented) and horizontal (character oriented) modes. Pattern functions can be defined (horizontal and vertical) and those can be recursive, allowing grammars to be parsed.

Simple Collection/Generation Example

The following query reads a stream of comma-separated pairs and generates a HTML table. A complete version with sample data is given here.

@(collect)
@char,@speech
@(end)
@(output :filter :to_html)
<table>
@  (repeat)
  <tr>
     <td>@char</td>
     <td>@speech</td>
  </tr>
@  (end)
</table>
@(end)

Grammar Parsing Example

Here is a TXR query which matches an arithmetic expression grammar, consisting of numbers, identifiers, basic arithmetic operators (+ - * /) and parentheses. The expression is supplied as a command line argument (this is done by @(next :args) which redirects the pattern matching to the argument vector).

Note that most of this code is not literal text. All of the pieces shown in color are special syntax. The @; os -> optional space text is a comment.

@(next :args)
@(define os)@/ */@(end)@; os -> optional space
@(define mulop)@(os)@/[*\/]/@(os)@(end)
@(define addop)@(os)@/[+\-]/@(os)@(end)
@(define number)@(os)@/[0-9]+/@(os)@(end)
@(define ident)@(os)@/[A-Za-z]+/@(os)@(end)
@(define factor)@(cases)(@(expr))@(or)@(number)@(or)@(ident)@(end)@(end)
@(define term)@(some)@(factor)@(or)@(factor)@(mulop)@(term)@(or)@(addop)@(factor)@(end)@(end)
@(define expr)@(some)@(term)@(or)@(term)@(addop)@(expr)@(end)@(end)
@(cases)
@  (expr)
@  (output)
parses!
@  (end)
@(or)
@  (expr)@bad
@  (output)
error starting at "@bad"
@  (end)
@(end)

The grammar productions above represented by horizontal pattern functions. Horizontal pattern functions are denoted visually by a horizontal syntax: their elements are written side by side on a single logical line. Horizontal function definitions can be broken into multiple physical lines and indented, with the help of the @\ continuation sequence, which consumes all leading whitespace from the following line, like this:

@(define term)@\
  @(some)@\
    @(factor)@\
  @(or)@\
    @(factor)@(mulop)@(term)@\
  @(or)@\
    @(addop)@(factor)@\
  @(end)@\
@(end)

Sample runs from Unix command line:

$ txr expr.txr 'a + (3 * b/(c + 4))'
parses!
$ txr expr.txr 'a + (3 * b/(c + 4)))'
error starting at ")"
$ txr expr.txr 'a + (3 * b/(c + 4)'
error starting at "+ (3 * b/(c + 4)"

As you can see, this program matches the longest prefix of the input which is a well-formed expression. The expression is recognized using the simple function call @(expr) which could be placed into the middle of a text template as easily as a variable.  The @(cases) directive is used to recognize two situations: either the argument completely parses, or there is stray material that is not recognized, which can be captured into a variable called bad. The grammar itself is straightforward.

Look at the grammar production for factor. It contains two literal characters: the parentheses around @(expr). The syntax coloring reveals them to be what they are: they stand for themselves.

The ability to parse grammars happened in TXR by accident. It's a consequence of combining pattern matching and functions. In creating TXR, I independently discovered a concept known as PEGs: Parsing Expression Grammars.

Note how the program easily deals with lexical analysis and higher level parsing in one grammar: no need for a division of the task into "tokenizing" and "parsing".  Tokenizing is necessary with classic parsers, like LALR(1) machines, because these parsers normally have only one token of lookahead and avoid backtracking. So they are fed characters instead of tokens, they cannot do very much due to running into ambiguities arising from complex tokens. By itself, a classic parser cannot decide whether "i" is the beginning of the C "int" keyword, or just the start of an identifier like "input".It needs the tokenizer to scan these (done with a regular language based on regular expression) and do the classification, so the parser sees a KEYWORD or IDENT token.

Embedded Lisp

Just like the TXR pattern matching primitves are embedded in plain text, within the pattern matching language, there is an embedded Lisp dialect. Here is one way to tabulate a frequency histogram of the letters A-Z, using the pattern language to extract the letters from the input, and TXR Lisp to tabulate:

@(do (defvar h (hash :equal-based)))
@(collect :vars ())
@(coll :vars ())@\
  @{letter /[A-Za-z]/}@(filter :upcase letter)@\
  @(do (inc [h letter 0]))@\
@(end)
@(end)
@(do (dohash (key value h)
       (format t "~a: ~a\n" key value)))

Here is an approach using purely TXR Lisp. Now while some aspects of this may appear, to Lisp programmers, if not entirely familiar, then at least clear. For instance, it is probably obvious that open-file opens a file and returns a stream, and that the let construct binds that stream to the variable s. Note the gun operator. Its name stands for "generate until nil": it returns a lazy list, possibly infinite, whose elements are formed by repeated calls to the enclosed expression, in this case (get char s). This lazy list of characters can then be conveniently processed using the each operator. The square bracket expression (inc [h (chr-toupper ch) 0]) is a shorthand equivalent for (inc (gethash h (chr-toupper ch) 0)) which means increment the value in hash table h corresponding to the key (chr-toupper ch) (the character ch converted to upper case). If the entry does not exist, then it is created and initialized with 0 then incremented.

@(do (let ((h (hash))
           (s (open-file "/usr/share/dict/words" "r")))
       (each ((ch (gun (get-char s))))
         (if (chr-isalpha ch)
           (inc [h (chr-toupper ch) 0])))
       (let ((sorted [sort (hash-pairs h) > second]))
         (each ((pair sorted))
           (tree-bind (key value) pair
              (put-line `@key: @value`))))))

Source Code

Releases and snapshots can be pulled directly from the git repository.

To build the program, you need a C compiler, a yacc utility (I've never tried anything but GNU Bison and Berkeley Yacc) and GNU flex. (Flex extensions are used: in particular start conditions). A few POSIX features are required from the host platform, like the popen function, and <dirent.h>. These are available on Windows through the MinGW compiler and environment.

The configure script and Makefile are geared toward a gcc and glibc environment, and rely on some GNU make features. Building for Windows therefore requires a GNU environment: MinGW. There is an issue with GNU flex on MinGW, requiring the following argument to the configure script: libflex="-L/usr/lib -lfl".

If you have porting issues, contact the TXR mailing list!

Licensing

TXR is truly free software because it is distributed under a variation of the two-clause BSD license which allows pretty much every kind of free use.

Make a Donation

If you find TXR to be a valuable tool in your arsenal, here is one way to show your appreciation and support! Developing stuff like this takes countless hours.

Binary Downloads

Compiled builds of TXR 96 are available in the TXR 96 file area at SourceForge.

Compiled builds of TXR 90 are available for these platforms:
OS OS Version Arch MD5 Checksum File
Cygwin 1.7.25 i686 ee273d385ce9cd186dbdd5a55ae6109f txr-90-Cygwin-1.7.25-i686.exe
MinGW 1.0.17 i686 039cb93d6a5e7c68f5bc055a5a4ac55d txr-90-MinGW-1.0.17-i686.exe
OSX 10.7.3 (Lion) i386 872840a59b031a00615b4578a668e8f8 txr-90-OSX-10.7.3-i386
Ubuntu 11.04 i686 eec8ec9c411dec6fe9456eca84e566b3 txr-90-Ubuntu-11.04-i686
Debian 5.0.8 amd64 0552f1dfc7eb5fd231b44174d8660ae3 txr-90-Debian-5.0.8-amd64
Solaris 10 i686 4f0bc2af6d8e615ff2c8d58fe69b810d txr-90-Solaris-10-i686
FreeBSD 9 i686 3fb60b4b902939bcc207d869eb19dd51 txr-90-FreeBSD-9-i686

Pre-compiled builds of TXR 89 are availablelfor these platforms:
OS OS Version Arch MD5 Checksum File
Cygwin 1.7.25 i686 b9b885d82710bafa7514030817b5c2c5 txr-89-Cygwin-1.7.25-i686.exe
MinGW 1.0.17 i686 89de6c8cfa12f10cfd4f318f89425353 txr-89-MinGW-1.0.17-i686.exe
OSX 10.7.3 (Lion) i386 c2f787cfaed7cba6755fb218b130935f txr-89-OSX-10.7.3-i386
Ubuntu 11.04 i686 a67140136948d37b77eaacd04dba943c txr-89-Ubuntu-11.04-i686
Debian 5.0.8 amd64 000c49c762f44ee04bc23936094a4a08 txr-89-Debian-5.0.8-amd64
Solaris 10 i686 92e693a0b4d0aa5d8bc87e4a8a49764c txr-89-Solaris-10-i686
FreeBSD 9 i686 ad2f830234a2214522bb29bcadfaf45f txr-89-FreeBSD-9-i686

Pre-compiled builds of TXR 88 are availablelfor these platforms:
OS OS Version Arch MD5 Checksum File
Cygwin 1.7.25 i686 a79ef9f820369bbb5777f96973246c94 txr-88-Cygwin-1.7.25-i686.exe
MinGW 1.0.17 i686 6d484dc5a1eef02cb3bc702900aaf5f3 txr-88-MinGW-1.0.17-i686.exe
OSX 10.7.3 (Lion) i386 eb93923efbb47efbe246b4213409d6a4 txr-88-OSX-10.7.3-i386
Ubuntu 11.04 i686 a0695161a20a30404d9ca531452dd4de txr-88-Ubuntu-11.04-i686
Debian 5.0.8 amd64 307e94294c3174343e6ca9ac4fdfa462 txr-88-Debian-5.0.8-amd64

Pre-compiled builds of TXR 87 are availablelfor these platforms:
OS OS Version Arch MD5 Checksum File
Cygwin 1.7.25 i686 7437bbbd6a4317469df7b1bb1eb652b2 txr-87-Cygwin-1.7.25-i686.exe
MinGW 1.0.17 i686 7eefdf5d81315848e884b678cd766637 txr-87-MinGW-1.0.17-i686.exe
OSX 10.7.3 (Lion) i386 77fa58e52cbb513164bf5d9feeb52bc4 txr-87-OSX-10.7.3-i386
Ubuntu 11.04 i686 ae52fc55c3a5a47c5fb5ab7eb2a1d360 txr-87-Ubuntu-11.04-i686
Debian 5.0.8 amd64 c2bc5b1607e50c2f22835c0ee42d4cc7 txr-87-Debian-5.0.8-amd64

Pre-compiled builds of TXR 86 are available for these platforms:
OS OS Version Arch MD5 Checksum File
Cygwin 1.7.25 i686 9337d1c5925fe0399c2a4bba785a3862 txr-86-Cygwin-1.7.25-i686.exe
MinGW 1.0.17 i686 7ee9cc008f7161a3e42bdda9629f5b76 txr-86-MinGW-1.0.17-i686.exe
OSX 10.7.3 (Lion) i386 c29297f449eb6b1a0167ffb7a0b49d99 txr-86-OSX-10.7.3-i386
Ubuntu 11.04 i686 4c96d17314a9140c5a8b439a445404f5 txr-86-Ubuntu-11.04-i686
Debian 5.0.8 amd64 4f20db7b7bd23207babcbfa8be9b9f99 txr-86-Debian-5.0.8-amd64

Pre-compiled builds of TXR 85 are available for these platforms:
OS OS Version Arch MD5 Checksum File
Cygwin 1.7.25 i686 310d669e2014ec1440291cabea272fce txr-85-Cygwin-1.7.25-i686.exe
MinGW 1.0.17 i686 e05ac8705fa8b8f91e9ec8b7c7366599 txr-85-MinGW-1.0.17-i686.exe
OSX 10.7.3 (Lion) i386 5c19d416416b214d7c6591d82c4b2753 txr-85-OSX-10.7.3-i386
Ubuntu 11.04 i686 b128c6f99a0aeb6ab92002e24f5500bc txr-85-Ubuntu-11.04-i686
Debian 5.0.8 amd64 fe9f6d04cd018a20aad60eda8a484a0c txr-85-Debian-5.0.8-amd64

Pre-compiled builds of TXR 84 are available for these platforms:
OS OS Version Arch MD5 Checksum File
Cygwin 1.7.25 i686 461e1dbb717afd4b6f198813db3932c6 txr-84-Cygwin-1.7.25-i686.exe
MinGW 1.0.17 i686 a7c04d2d8620c74cdee9a2440a7ab38a txr-84-MinGW-1.0.17-i686.exe
OSX 10.7.3 (Lion) i386 ba10cf7473fc953cd8f65c9cb43d085f txr-84-OSX-10.7.3-i386
Ubuntu 11.04 i686 a2383401ceea586daca1abf77583663f txr-84-Ubuntu-11.04-i686
Debian 5.0.8 amd64 8dd661bd00eb0d77d1bf320638ac4db6 txr-84-Debian-5.0.8-amd64

Pre-compiled builds of TXR 83 are available for these platforms:
OS OS Version Arch MD5 Checksum File
Cygwin 1.7.25 i686 64cf19dbb6d6589fbbaad70368b8a9f5 txr-83-Cygwin-1.7.25-i686.exe
MinGW 1.0.17 i686 d837657da67f2fb0b54a370b18337c59 txr-83-MinGW-1.0.17-i686.exe
OSX 10.7.3 (Lion) i386 70510e636358a8fc06c22e55bf50bb65 txr-83-OSX-10.7.3-i386
Ubuntu 11.04 i686 03187df919b5d690377e24f71e96b068 txr-83-Ubuntu-11.04-i686
Debian 5.0.8 amd64 bfd5e938d0e4c2a1a75f53103fa45bcd txr-83-Debian-5.0.8-amd64

Pre-compiled builds of TXR 82 are available for these platforms:
OS OS Version Arch MD5 Checksum File
Cygwin 1.7.25 i686 08403862ba11eda4e72b1c1ee117bcc0 txr-82-Cygwin-1.7.25-i686.exe
MinGW 1.0.17 i686 ef225bee5cfce090d562233fa1ab54b5 txr-82-MinGW-1.0.17-i686.exe
OSX 10.7.3 (Lion) i386 80f582bcb044e7eacabc98a1adf0651f txr-82-OSX-10.7.3-i386
Ubuntu 11.04 i686 fb5bc961c221d6a53ba4c40b3dde2b0e txr-82-Ubuntu-11.04-i686
Debian 5.0.8 amd64 186c11db5d20244e392878096b2a2d80 txr-82-Debian-5.0.8-amd64

Pre-compiled builds of TXR 81 are available for these platforms:
OS OS Version Arch MD5 Checksum File
Cygwin 1.7.25 i686 a7db6d5c90be638db767890de02e7a66 txr-81-Cygwin-1.7.25-i686.exe
MinGW 1.0.17 i686 683bf4879446a91237a034fe036bc859 txr-81-MinGW-1.0.17-i686.exe
OSX 10.7.3 (Lion) i386 0e3258785002218867b27be73e0c672a txr-81-OSX-10.7.3-i386
Ubuntu 11.04 i686 f96a2a83baf0a73de4567ebf7b89aa78 txr-81-Ubuntu-11.04-i686
Debian 5.0.8 amd64 cfaa55fee210bbafa8f7d2b3a61d24cf txr-81-Debian-5.0.8-amd64

Pre-compiled builds of TXR 80 are available for these platforms:
OS OS Version Arch MD5 Checksum File
Cygwin 1.7.25 i686 2511c384561f08ba60179d007fbdf441 txr-80-Cygwin-1.7.25-i686.exe
MinGW 1.0.17 i686 1e788cc43a76a4daf4ebf034185b756f txr-80-MinGW-1.0.17-i686.exe
OSX 10.7.3 (Lion) i386 f4ec5149199ee66a854f490bf4f26f15 txr-80-OSX-10.7.3-i386
Ubuntu 11.04 i686 e1af8713358e14d583c639f69bf03ecb txr-80-Ubuntu-11.04-i686
Debian 5.0.8 amd64 8704f52d2e4ce4687ce3c87679ac0a6f txr-80-Debian-5.0.8-amd64

Pre-compiled builds of TXR 79 are available for these platforms:
OS OS Version Arch MD5 Checksum File
Ubuntu 11.04 i686 74aceb43040372992aee32b469757080 txr-79-Ubuntu-11.04-i686
Debian 5.0.8 amd64 6ea2e8223070185ad55f3e932ff998c5 txr-79-Debian-5.0.8-amd64