txr:
a Pattern Matching<kkylheku@gmail.com>txr
is an interpreter for a query language (the
), a
more-or-less complete description of which is given in the manual .
A txr
languagetxr query matches text and extracts pieces by binding
them
to variables that are embedded in the query. txr can
output the
raw bindings gathered from the data, or substitute them into
a template-driven report.
Working with txr is different from text processing
programming
languages, as well as from grammar-driven tools.
To develop a txr query, the user might start with
sample data. The
raw data itself is already a txr query which matches
itself—almost:
care must be taken to escape some characters which have a special
meaning to txr. What follows is an iterative process of
identifying the
parts that need to be turned into query elements such as variables in
order to summarize the variations in the data so that the query
generalizes to other instances.
Instead of Hello, world
, how about something more advanced?
One
tool that I dislike in Unix and Linux is the ps utility
for
listing processes. I've been using Unix since 1989 and Linux since
1993,
and I'm not dumb; yet, whenever I need ps to do something
slightly out of the ordinary, I have to resort to the man page, and
then
I still can't get it to do what I want half the time.
With txr, we can easily make a quick and dirty ps
utility (which
relies
on the /proc filesystem on Linux). Here is what the query
looks like. This might be saved in a file called ps.txr:
@(next)$/proc
@(collect)
@{process /[0-9]+/}
@ (next)/proc/@process/status
Name:@\t@name
State:@\t@state (@state_desc)
@(skip)
Tgid:@\t@tgid
Pid:@\t@proc_id
PPid:@\t@parent_id
@(bind pid proc_id)
@(bind ppid parent_id)
@(skip)
Uid:@\t@uid@\t@/.*/
Gid:@\t@gid@\t@/.*/
@ (next)$/proc/@process/task
@ (collect)
@thr
@ (end)
@ (bind thread thr)
@ (some)
@ (next)/etc/passwd
@ (skip)
@user:@pw:@uid:@/.*/
@ (or)
@ (bind user uid)
@ (end)
@(end)
@(output)
USER PID PPID S NAME THREADS
@ (repeat)
@{user 8} @{proc_id -5} @{parent_id -5} @state @{name 16} @(rep)@thr, @(first)@(last)@thr@(single)~@(end)
@ (end)
@(end)
Now we can run the query like this:
shell$ txr ps.txr
We get output which looks like this:
USER PID PPID S NAME THREADS
root 1 0 S init ~
root 2 1 S ksoftirqd/0 ~
root 3 1 S events/0 ~
root 4 3 S khelper ~
root 5 3 S kacpid ~
root 16 3 S kblockd/0 ~
root 29 3 S aio/0 ~
root 17 1 S khubd ~
root 2954 2953 S bash ~
[ ... ]
root 16134 1887 S sshd ~
kaz 16136 16134 S sshd ~
kaz 16137 16136 S bash ~
kaz 3628 2175 S slrn ~
root 3721 1963 S crond ~
root 3722 3721 S run-parts ~
root 3723 3722 S 00-logwatch ~
root 3724 3722 S awk ~
root 3940 3723 S mail ~
root 4049 3723 S zz-disk_space ~
root 4051 4049 S df ~
root 4052 4049 S grep ~
kaz 4266 1 S ssh-agent ~
kaz 4331 16137 S vim ~
kaz 4426 31908 R txr ~
The txr query works by processing the numeric entries
under the /proc
directory, reading the /proc/<pid>/status file of each process,
and the list of threads under /proc/<pid>/tasks. The user ID's
are resolved by matching through the /etc/passwd file.
Now, watch what you can also do, given the above query. Suppose you
only want to see the processes for user kaz:
shell$ txr -Duser=kaz ps.txr
USER PID PPID S NAME THREADS
kaz 2551 2184 S gnome-session ~
kaz 2625 1 S dbus-launch ~
kaz 2626 1 S dbus-daemon-1 2627
kaz 2631 1 S gconfd-2 ~
[ ... ]
kaz 2685 1 S metacity ~
kaz 2689 1 S gnome-panel ~
kaz 2691 1 S nautilus 2696, 2708, 2709, 2710, 2711, 2712, 2713, 2714, 2715
kaz 2693 1 S gnome-volume-ma ~
kaz 2695 1 S eggcups ~
kaz 2698 1 S gnome-vfs-daemo 2699
[ ... ]
kaz 4331 16137 S vim ~
kaz 4442 31908 R txr ~
What happened? By specifying -Duser=kaz, we have
established a binding for the user variable. This
existing binding constrains the query; where the variable occurs, it
now has to match the text kaz. Of course, you can
constrain other things. For instance, how about all bash
processes:
shell$ txr -Dname=bash ps.txr
USER PID PPID S NAME THREADS
kaz 31908 31907 S bash ~
kaz 27685 27684 S bash ~
kaz 2175 2174 S bash ~
root 2954 2953 S bash ~
kaz 16137 16136 S bash ~
Combine the two: all bash processes belonging to user
kaz:
shell$ txr -Dname=bash -Duser=kaz ps.txr
USER PID PPID S NAME THREADS
kaz 31908 31907 S bash ~
kaz 27685 27684 S bash ~
kaz 2175 2174 S bash ~
kaz 16137 16136 S bash ~
How about answering the question, which process has thread 2709?
shell$ txr -Dthread=2709 ps.txr
USER PID PPID S NAME THREADS
kaz 2691 1 S nautilus 2696, 2708, 2709, 2710, 2711, 2712, 2713, 2714, 2715
Let's break it down, piece by piece, starting with the fiirst line:
@(next)$/proc
This @(next) is a directive. It means, switch from
pattern matching the current input source and switch to the one
specified on the right. The $ means, this specification
is a directory, which is to be read as if it were a text file
containing file names. So, basically, here, the /proc
directory listing is opened as a source of data for pattern matching.
Since the very first directive in this query opens a data source, there
do not
have to be any data sources named on the txr command
line.
Next we have:
@(collect)
This begins a collect block, terminated somewhere by @(end).
The collect directive searches for all matches of a pattern, and
collects
all variable bindings which the pattern finds into lists. The pattern
begins with:
@{process /[0-9]+/}
Remember, the current input source is the /proc directory
itself. First of all, this pattern matches an entire line. Patterns
are implicitly anchored to the beginning and end of a line.
This is a regular-expression based variable binding. The input
line must consist of nothing but digits, and must not be empty,
since the regular expression /[0-9]+/ means match one or
more digits.
What happens if the input line has non-digits? After all, /proc
has entries other than process ID's. In this case, the pattern match
fails. This matching failure causes the surrounding @(collect)
directive to search for another match. If the match succeeds, then
the variable process is bound to the process ID, and
the matching continues:
@ (next)/proc/@process/status
Now, a new input source is opened. However, the original input
source
is not forgotten. Remember, we are inside a large collect, which is
pulling process ID's from the /proc directory. The
processing of this new input stream is nested within that operation.
@(next) syntax shown in this document is now
obsolescent. The recommended way is @(next
`/proc/@process/status`, which uses a quasiliteral. This
directive throws an exception if it is not able to open the file, a
behavior which can be overridden
with the :nothrow keyword so that the situation behaves
like a failed match.Here, we are opening the status proc entry for the given process whose ID we just extracted. Inside this status entry, we try to match:
Name:@\t@name
State:@\t@state (@state_desc)
@(skip)
Tgid:@\t@tgid
Pid:@\t@proc_id
PPid:@\t@parent_id
@(bind pid proc_id)
@(bind ppid parent_id)
@(skip)
Uid:@\t@uid@\t@/.*/
Gid:@\t@gid@\t@/.*/
Some explanations are in order. Firstly @\t is a way
of encoding a literal tab character in the query. Of course, naked
tabs are allowed too. The @name, @state,
et cetera are simple variable matches which extract text based
matching surrounding context. The @(skip) directive
triggers a search: non-matching lines are skipped until the remainder
of the query matches. Text like Name: and
Tgid: stands for itself: it must exactly match the data.
The syntax @/.*/ is a regular expression match. Here it
is
being used to match some trailing portion of a line. This is necessary
because a txr query line must always match an entire
input line.
Matching a line only partially is a mismatch. The @(bind
...) directives are a noteworthy. They don't
match anything in the input, but they are nevertheless a kind of match.
For example, @(bind ppid parent_id) means to bind the
ppid variable to the value of the parent_id
variable. If ppid has no binding already, then it
simply becomes a clone of parent_id. However,
a curious thing happens if ppid already has a binding
(like from the txr command line via the -D
option). If there is already a binding for the variable on the left,
then the @(bind) will behave like a failed pattern match,
unless the two variables match exactly. So thanks to this particular
@(bind) we can restrict the query to report only processes
which have a particular parent process id.
parent_id? The
answer is that we need that variable to contain a list, because the @(output)
clause at the end of the query
expects it. If we impose a binding for parent_id from
outside of the @(collect), the variable's value will be
an atom rather than a list.What our query does next is to open the task subdirectory, to enumerate the list of threads. This is straightforward:
@ (next)$/proc/@process/task
@ (collect)
@thr
@ (end)
Note that here we have a collect nested within a collect. So what will
come out
at the end of the overall query is a nested list: a list of lists of
thread
ID's. Each process has its own list of threads, and since there is a
list of
processes collected, there is a parallel list of lists of threads to go
with
those processes. Note the space between the @ and (.
This space is ignored, and allows some formatting which resembles
indentation,
which improves the clarity of queries. Spaces outside of a directive,
of
course, are literal text which must match something.
@ (bind thread thr)
This bind allows the thread variable to be used as an
external constraint to only report a process which has a particular
thread. The next section introduces a new directive,
@(some):
@ (some)
@ (next)/etc/passwd
@ (skip)
@user:@pw:@uid:@/.*/
@ (or)
@ (bind user uid)
@ (end)
The query material within @(some) can be divided into
subclauses by one or more @(or) directives. Each of these
is independently matched against the input, starting at the same
position. The rule is imposed that at least one of the clauses must
produce a match. If none of the clauses produces a match, then the
@(some) block fails. Here is a twist, though: the clauses
are actually processed in order, and later clauses can see the bindings
caught by earlier clauses.
So, what is going on above is that the fist subclause opens up the
password file, and then searches it (thanks to @(skip))
for the pattern: @user:@pw:@uid:@/.*/ .
Note that the @uid variable is one which was already
bound; the query extracted it from the /proc/<pid>/status
file. It holds the numeric user ID of the process. The other variables
are yet unbound. So, this query will
match the first line of the password file which has the password
entry for the given numeric user ID, and then extract from that line
the user name, and encrypted password (which we are not really
interested
in, but why not: extracting is so easy with txr!)
But what if the user ID is not found in the password file? That's what
the second clause is for. Here, we see another use
for the bind directive: it provides a fallback strategy. If the
@user variable is not successfully bound in the first
clause, then @(bind user uid) will simply bind it
to the numeric ID. What if @user is successfully
pulled from the password file? In that case, the bind will fail,
since the two bound variables do not match. But the other clause
was successful, so the overall @(some) block
is a successful match. And that brings us to the end of the pattern
matching section of the query:
@(end)
@(output)
The @(end) terminates the @(collect),
and @(output) introduces an output clause.
Output clauses may be mixed into the query; they don't have
to be at the end. The clause begins with some literal text,
which is output as is:
USER PID PPID S NAME THREADS
Next, a repeat directive will iterate over all of the list variables embedded in it and repeat the text for successive values pulled from all of the lists in parallel:
@ (repeat)
@{user 8} @{proc_id -5} @{parent_id -5} @state @{name 16} @(rep)@thr, @(first)@(last)@thr@(single)~@(end)
@ (end)
Here, special syntax is used for formatting fixed-with fields, either left or right adjusted. (The same syntax can be used in pattern matching also to extract text from fixed-width fields.)
The thread ID's are printed by an embedded @(rep), which
is a within-the-line flavor of @(repeat). It contains
some
subclauses: @(first), @(last)
and @(single). The thread ID's is normally printed
followed by a comma and space: @(rep)@thr, .
But the @(first) clause is empty, which means that the
first thread ID is not printed. This is done because the first
thread ID is the same as the process ID, and thus not interesting.
In other words, the thread list always has one or more items,
even if the process is not multi-threaded. The @(last)
directive provides an alternate way to generated the last item
of the thread list, which is output without a trailing comma.
Finally, the @(single) directive provides text to
output if the list has only one item. A tilde is output.
Thus in the process listing, the THREADS column will contain a tilde
for all processes which have no threads (other than their
primary thread).
Finally, the output directive is terminated. This is the last line, bringing the query to a close.
@(end)
Releases and snapshots can be pulled directly from the git repository here. A HTML-ized version of the manual page, which may be out of date, is here. And, let's not forget the Savannah project page.
To build the program, you need a C compiler, a yacc utility
(I've never tried anything but GNU Bison) and GNU flex.
(Flex extensions are used, like start conditions).
A few POSIX features are required from the host platform, like the
popen function, and <dirent.h>. txr
is
a utility for POSIX systems, and thus txr queries may
open pipes to and
from
external
commands, and use shell syntax.
The configure script and Makefile are geared toward a gcc and glibc environment, and rely on some GNU make features.
The program itself does some inherently non-portable things in
the
area of garbage-collected memory management. I've run in on a MIPS
Linux system (n32 ABI) and x86 based Linux systems, both 32 and 64 bit.
Not sure how the
stack-scanning garbage collector will interact with the register
windows
on a SPARC system, but the setjmp invocation in the gc
function should, in theory, trigger a register window flush. (I looked
at the SPARC setjmp code in glibc 2.5 to verify that it does so on
glibc-based SPARC systems! On some other systems, it may be necessary
to do both a setjmp and a longjmp.)
txrtxr has a tidy,
hand-written configure scrip, which
has numerous options that can be listed with ./configure --help.
For instance, it can be configured to build with a C++ compiler instead
of C++ compiler, or with special support for running under
Valgrind.
The script does not follow the --build --host --target
triplet convention for cross-compiling; rather, directory and name
prefixes for the toolchain commands can be specified.
Example build and install session:
shell:~$ tar xzf txr-033.tar.gz
shell:~$ cd txr-033
shell:~/txr-033$ ./configure --prefix=/home/kaz
+---------------------+
| Configuring txr 033 |
+---------------------+
Checking for GNU Make ... yes (3.81 found)
Checking installation paths:
$(install_prefix)$(prefix)=/home/kaz ... okay
$(install_prefix)$(bindir)=$(prefix)/bin ... okay
(make variable derivation)
$(install_prefix)$(datadir)=$(prefix)/share/txr ... okay
(make variable derivation)
$(install_prefix)$(mandir)=$(prefix)/share/man ... okay
(make variable derivation)
Checking source directory /home/kaz/txr ... okay
warning: its recommended to build in a separate directory
generating config.make ...
Checking whether your C compiler can make a simple executable ... okay
Checking what C type we have for integers wider than "long" ... "long long"
Checking what C integer type can hold a pointer ... "int"
Checking how to declare inline functions ... "static __inline__"
Checking for yacc program ... "yacc"
regenerating config.make ...
The configuration seems to have been successful. That doesn't mean it's
correct. Please check the above output for any problems, and verify that the
contents of the generated files config.make and config.h are sane for the
target platform.
The next step is one of these two.
Method 1: if you have permissions to the installation directories:
make install
Method 2: if you have to or want to be another user such as root to install:
make
su <userid>
(give password, if asked)
make install
shell:~/txr-033$ make
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o txr.o txr.c
flex parser.l
if yacc -v -d parser.y ; then true ; else rm y.tab.h ; false ; fi
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o lex.yy.o lex.yy.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -Dlint -c -o y.tab.o y.tab.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o match.o match.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o lib.o lib.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o regex.o regex.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o gc.o gc.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o unwind.o unwind.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o stream.o stream.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o hash.o hash.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -c -o utf8.o utf8.c
gcc -I. -I/home/kaz/txr -ansi -D_POSIX_C_SOURCE=2 -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -g -o txr txr.o lex.yy.o y.tab.o match.o lib.o regex.o gc.o unwind.o stream.o hash.o utf8.o -lfl
shell:~/txr-033$ make install
mkdir -p /home/kaz/bin
cp -f txr /home/kaz/bin
chmod 0755 /home/kaz/bin/txr
touch -r txr /home/kaz/bin/txr
mkdir -p /home/kaz/share/man/man1
cp -f /home/kaz/txr/txr.1 /home/kaz/share/man/man1
chmod 0444 /home/kaz/share/man/man1/txr.1
touch -r /home/kaz/txr/txr.1 /home/kaz/share/man/man1/txr.1