txr: a Pattern Matching
Utility for
Convenient Text Extraction

Kaz Kylheku <kkylheku@gmail.com>

What is it?

txr is an interpreter for a query language (the txr language), a more-or-less complete description of which is given in the manual . A txr query matches text and extracts pieces by binding them to variables that are embedded in the query. txr can output the raw bindings gathered from the data, or substitute them into a template-driven report.

Working with txr is different from text processing programming languages, as well as from grammar-driven tools.

To develop a txr query, the user might start with sample data. The raw data itself is already a txr query which matches itself—almost: care must be taken to escape some characters which have a special meaning to txr. What follows is an iterative process of identifying the parts that need to be turned into query elements such as variables in order to summarize the variations in the data so that the query generalizes to other instances.

How about an example?

Instead of Hello, world, how about something more advanced? One tool that I dislike in Unix and Linux is the ps utility for listing processes. I've been using Unix since 1989 and Linux since 1993, and I'm not dumb; yet, whenever I need ps to do something slightly out of the ordinary, I have to resort to the man page, and then I still can't get it to do what I want half the time.

With txr, we can easily make a quick and dirty ps utility (which relies on the /proc filesystem on Linux). Here is what the query looks like. This might be saved in a file called ps.txr:


@(next)$/proc
@(collect)
@{process /[0-9]+/}
@ (next)/proc/@process/status
Name:@\t@name
State:@\t@state (@state_desc)
@(skip)
Tgid:@\t@tgid
Pid:@\t@proc_id
PPid:@\t@parent_id
@(bind pid proc_id)
@(bind ppid parent_id)
@(skip)
Uid:@\t@uid@\t@/.*/
Gid:@\t@gid@\t@/.*/
@ (next)$/proc/@process/task
@ (collect)
@thr
@ (end)
@ (bind thread thr)
@ (some)
@ (next)/etc/passwd
@ (skip)
@user:@pw:@uid:@/.*/
@ (or)
@ (bind user uid)
@ (end)
@(end)
@(output)
USER PID PPID S NAME THREADS
@ (repeat)
@{user 8} @{proc_id -5} @{parent_id -5} @state @{name 16} @(rep)@thr, @(first)@(last)@thr@(single)~@(end)
@ (end)
@(end)

Now we can run the query like this:


shell$ txr ps.txr

We get output which looks like this:


USER PID PPID S NAME THREADS
root 1 0 S init ~
root 2 1 S ksoftirqd/0 ~
root 3 1 S events/0 ~
root 4 3 S khelper ~
root 5 3 S kacpid ~
root 16 3 S kblockd/0 ~
root 29 3 S aio/0 ~
root 17 1 S khubd ~
root 2954 2953 S bash ~
[ ... ]
root 16134 1887 S sshd ~
kaz 16136 16134 S sshd ~
kaz 16137 16136 S bash ~
kaz 3628 2175 S slrn ~
root 3721 1963 S crond ~
root 3722 3721 S run-parts ~
root 3723 3722 S 00-logwatch ~
root 3724 3722 S awk ~
root 3940 3723 S mail ~
root 4049 3723 S zz-disk_space ~
root 4051 4049 S df ~
root 4052 4049 S grep ~
kaz 4266 1 S ssh-agent ~
kaz 4331 16137 S vim ~
kaz 4426 31908 R txr ~

The txr query works by processing the numeric entries under the /proc directory, reading the /proc/<pid>/status file of each process, and the list of threads under /proc/<pid>/tasks. The user ID's are resolved by matching through the /etc/passwd file.

Now, watch what you can also do, given the above query. Suppose you only want to see the processes for user kaz:


shell$ txr -Duser=kaz ps.txr
USER PID PPID S NAME THREADS
kaz 2551 2184 S gnome-session ~
kaz 2625 1 S dbus-launch ~
kaz 2626 1 S dbus-daemon-1 2627
kaz 2631 1 S gconfd-2 ~
[ ... ]
kaz 2685 1 S metacity ~
kaz 2689 1 S gnome-panel ~
kaz 2691 1 S nautilus 2696, 2708, 2709, 2710, 2711, 2712, 2713, 2714, 2715
kaz 2693 1 S gnome-volume-ma ~
kaz 2695 1 S eggcups ~
kaz 2698 1 S gnome-vfs-daemo 2699
[ ... ]
kaz 4331 16137 S vim ~
kaz 4442 31908 R txr ~

What happened? By specifying -Duser=kaz, we have established a binding for the user variable. This existing binding constrains the query; where the variable occurs, it now has to match the text kaz. Of course, you can constrain other things. For instance, how about all bash processes:


shell$ txr -Dname=bash ps.txr
USER PID PPID S NAME THREADS
kaz 31908 31907 S bash ~
kaz 27685 27684 S bash ~
kaz 2175 2174 S bash ~
root 2954 2953 S bash ~
kaz 16137 16136 S bash ~

Combine the two: all bash processes belonging to user kaz:


shell$ txr -Dname=bash -Duser=kaz ps.txr
USER PID PPID S NAME THREADS
kaz 31908 31907 S bash ~
kaz 27685 27684 S bash ~
kaz 2175 2174 S bash ~
kaz 16137 16136 S bash ~

How about answering the question, which process has thread 2709?


shell$ txr -Dthread=2709 ps.txr
USER PID PPID S NAME THREADS
kaz 2691 1 S nautilus 2696, 2708, 2709, 2710, 2711, 2712, 2713, 2714, 2715

 What does all that stuff mean?

Let's break it down, piece by piece, starting with the fiirst line:


@(next)$/proc

This @(next) is a directive. It means, switch from pattern matching the current input source and switch to the one specified on the right. The $ means, this specification is a directory, which is to be read as if it were a text file containing file names. So, basically, here, the /proc directory listing is opened as a source of data for pattern matching. Since the very first directive in this query opens a data source, there do not have to be any data sources named on the txr command line.


Next we have:


@(collect)

This begins a collect block, terminated somewhere by @(end). The collect directive searches for all matches of a pattern, and collects all variable bindings which the pattern finds into lists. The pattern begins with:


@{process /[0-9]+/}

Remember, the current input source is the /proc directory itself. First of all, this pattern matches an entire line. Patterns are implicitly anchored to the beginning and end of a line. This is a regular-expression based variable binding. The input line must consist of nothing but digits, and must not be empty, since the regular expression /[0-9]+/ means match one or more digits.

What happens if the input line has non-digits? After all, /proc has entries other than process ID's. In this case, the pattern match fails. This matching failure causes the surrounding @(collect) directive to search for another match. If the match succeeds, then the variable process is bound to the process ID, and the matching continues:


@ (next)/proc/@process/status

Now, a new input source is opened. However, the original input source is not forgotten. Remember, we are inside a large collect, which is pulling process ID's from the /proc directory. The processing of this new input stream is nested within that operation.

Here, we are opening the status proc entry for the given process whose ID we just extracted. Inside this status entry, we try to match:


Name:@\t@name
State:@\t@state (@state_desc)
@(skip)
Tgid:@\t@tgid
Pid:@\t@proc_id
PPid:@\t@parent_id
@(bind pid proc_id)
@(bind ppid parent_id)
@(skip)
Uid:@\t@uid@\t@/.*/
Gid:@\t@gid@\t@/.*/

Some explanations are in order. Firstly @\t is a way of encoding a literal tab character in the query. Of course, naked tabs are allowed too. The @name, @state, et cetera are simple variable matches which extract text based matching surrounding context. The @(skip) directive triggers a search: non-matching lines are skipped until the remainder of the query matches. Text like Name: and Tgid: stands for itself: it must exactly match the data. The syntax @/.*/ is a regular expression match. Here it is being used to match some trailing portion of a line. This is necessary because a txr query line must always match an entire input line. Matching a line only partially is a mismatch. The @(bind ...) directives are a noteworthy. They don't match anything in the input, but they are nevertheless a kind of match. For example, @(bind ppid parent_id) means to bind the ppid variable to the value of the parent_id variable. If ppid has no binding already, then it simply becomes a clone of parent_id. However, a curious thing happens if ppid already has a binding (like from the txr command line via the -D option). If there is already a binding for the variable on the left, then the @(bind) will behave like a failed pattern match, unless the two variables match exactly. So thanks to this particular @(bind) we can restrict the query to report only processes which have a particular parent process id.

What our query does next is to open the task subdirectory, to enumerate the list of threads. This is straightforward:


@ (next)$/proc/@process/task
@ (collect)
@thr
@ (end)

Note that here we have a collect nested within a collect. So what will come out at the end of the overall query is a nested list: a list of lists of thread ID's. Each process has its own list of threads, and since there is a list of processes collected, there is a parallel list of lists of threads to go with those processes. Note the space between the @ and (. This space is ignored, and allows some formatting which resembles indentation, which improves the clarity of queries. Spaces outside of a directive, of course, are literal text which must match something.


@ (bind thread thr)

This bind allows the thread variable to be used as an external constraint to only report a process which has a particular thread. The next section introduces a new directive, @(some):


@ (some)
@ (next)/etc/passwd
@ (skip)
@user:@pw:@uid:@/.*/
@ (or)
@ (bind user uid)
@ (end)

The query material within @(some) can be divided into subclauses by one or more @(or) directives. Each of these is independently matched against the input, starting at the same position. The rule is imposed that at least one of the clauses must produce a match. If none of the clauses produces a match, then the @(some) block fails. Here is a twist, though: the clauses are actually processed in order, and later clauses can see the bindings caught by earlier clauses.

So, what is going on above is that the fist subclause opens up the password file, and then searches it (thanks to @(skip)) for the pattern: @user:@pw:@uid:@/.*/ . Note that the @uid variable is one which was already bound; the query extracted it from the /proc/<pid>/status file. It holds the numeric user ID of the process. The other variables are yet unbound. So, this query will match the first line of the password file which has the password entry for the given numeric user ID, and then extract from that line the user name, and encrypted password (which we are not really interested in, but why not: extracting is so easy with txr!) But what if the user ID is not found in the password file? That's what the second clause is for. Here, we see another use for the bind directive: it provides a fallback strategy. If the @user variable is not successfully bound in the first clause, then @(bind user uid) will simply bind it to the numeric ID. What if @user is successfully pulled from the password file? In that case, the bind will fail, since the two bound variables do not match. But the other clause was successful, so the overall @(some) block is a successful match. And that brings us to the end of the pattern matching section of the query:


@(end)
@(output)

The @(end) terminates the @(collect), and @(output) introduces an output clause. Output clauses may be mixed into the query; they don't have to be at the end. The clause begins with some literal text, which is output as is:


USER PID PPID S NAME THREADS

Next, a repeat directive will iterate over all of the list variables embedded in it and repeat the text for successive values pulled from all of the lists in parallel:


@ (repeat)
@{user 8} @{proc_id -5} @{parent_id -5} @state @{name 16} @(rep)@thr, @(first)@(last)@thr@(single)~@(end)
@ (end)

Here, special syntax is used for formatting fixed-with fields, either left or right adjusted. (The same syntax can be used in pattern matching also to extract text from fixed-width fields.)

The thread ID's are printed by an embedded @(rep), which is a within-the-line flavor of @(repeat). It contains some subclauses: @(first), @(last) and @(single). The thread ID's is normally printed followed by a comma and space: @(rep)@thr, . But the @(first) clause is empty, which means that the first thread ID is not printed. This is done because the first thread ID is the same as the process ID, and thus not interesting. In other words, the thread list always has one or more items, even if the process is not multi-threaded. The @(last) directive provides an alternate way to generated the last item of the thread list, which is output without a trailing comma. Finally, the @(single) directive provides text to output if the list has only one item. A tilde is output. Thus in the process listing, the THREADS column will contain a tilde for all processes which have no threads (other than their primary thread).

Finally, the output directive is terminated. This is the last line, bringing the query to a close.


@(end)

Where do I get it, et cetera

Releases and snapshots can be pulled directly from the git repository here. A HTML-ized version of the manual page, which may be out of date, is here. And, let's not forget the Savannah project page.

To build the program, you need a C compiler, a yacc utility (I've never tried anything but GNU Bison) and GNU flex. (Flex extensions are used, like start conditions). A few POSIX features are required from the host platform, like the popen function, and <dirent.h>. txr is a utility for POSIX systems, and thus txr queries may open pipes to and from external commands, and use shell syntax.

The configure script and Makefile are geared toward a gcc and glibc environment, and rely on some GNU make features.

The program itself does some inherently non-portable things in the area of garbage-collected memory management. I've run in on a MIPS Linux system (n32 ABI) and x86 based Linux systems, both 32 and 64 bit. Not sure how the stack-scanning garbage collector will interact with the register windows on a SPARC system, but the setjmp invocation in the gc function should, in theory, trigger a register window flush. (I looked at the SPARC setjmp code in glibc 2.5 to verify that it does so on glibc-based SPARC systems! On some other systems, it may be necessary to do both a setjmp and a longjmp.)

Building txr

txr has a tidy, hand-written configure scrip, which has numerous options that can be listed with ./configure --help. For instance, it can be configured to build with a C++ compiler instead of  C++ compiler, or with special support for running under Valgrind.

The script does not follow the --build --host --target triplet convention for cross-compiling; rather, directory and name prefixes for the toolchain commands can be specified.

Example build and install session:


shell:~$ tar xzf txr-033.tar.gz
shell:~$ cd txr-033
shell:~/txr-033$ ./configure --prefix=/home/kaz
+---------------------+
| Configuring txr 033 |
+---------------------+
Checking for GNU Make ... yes (3.81 found)
Checking installation paths:
$(install_prefix)$(prefix)=/home/kaz ... okay
$(install_prefix)$(bindir)=$(prefix)/bin ... okay
(make variable derivation)
$(install_prefix)$(datadir)=$(prefix)/share/txr ... okay
(make variable derivation)
$(install_prefix)$(mandir)=$(prefix)/share/man ... okay
(make variable derivation)
Checking source directory /home/kaz/txr ... okay
warning: its recommended to build in a separate directory
generating config.make ...
Checking whether your C compiler can make a simple executable ... okay
Checking what C type we have for integers wider than "long" ... "long long"
Checking what C integer type can hold a pointer ... "int"
Checking how to declare inline functions ... "static __inline__"
Checking for yacc program ... "yacc"
regenerating config.make ...

The configuration seems to have been successful. That doesn't mean it's
correct. Please check the above output for any problems, and verify that the
contents of the generated files config.make and config.h are sane for the
target platform.

The next step is one of these two.

Method 1: if you have permissions to the installation directories:
make install

Method 2: if you have to or want to be another user such as root to install:
make
su <userid>
(give password, if asked)
make install

shell:~/txr-033$ make
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o txr.o txr.c
flex parser.l
if yacc -v -d parser.y ; then true ; else rm y.tab.h ; false ; fi
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o lex.yy.o lex.yy.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -Dlint -c -o y.tab.o y.tab.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o match.o match.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o lib.o lib.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o regex.o regex.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o gc.o gc.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o unwind.o unwind.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o stream.o stream.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o hash.o hash.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o utf8.o utf8.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -o txr txr.o lex.yy.o y.tab.o match.o lib.o regex.o gc.o unwind.o stream.o hash.o utf8.o -lfl

shell:~/txr-033$ make install
mkdir -p /home/kaz/bin
cp -f txr /home/kaz/bin
chmod 0755 /home/kaz/bin/txr
touch -r txr /home/kaz/bin/txr
mkdir -p /home/kaz/share/man/man1
cp -f /home/kaz/txr/txr.1 /home/kaz/share/man/man1
chmod 0444 /home/kaz/share/man/man1/txr.1
touch -r /home/kaz/txr/txr.1 /home/kaz/share/man/man1/txr.1