NAME

sla – an interpreter for push-down transducers

SYNOPSIS

sla autoseq <infile >outfile

DESCRIPTION

sla is a simple tool to apply sequences of push-down transducers (PDTs) to text files. It works as a filter only. sla should normally come along with Tensile interpreter.

The PDT stuff described below is also applicable to Tensile (see sl(1)).

WHAT ARE THOSE PDTS?

The following is just a brief introduction. If you're completely unfamiliar with the automata theory, you'd better learn something more from other sources.

Push-down transducers are a class of formal automata, abstract mathematical structures which may change their internal state and produce some output depending on the input. One of the most generic (and most well-known) examples of automata is the Turing's machine.

Strictly speaking, the term automata is properly applicable only to those structures which has no output and can only decide whether a given input sequence belong to a certain class or not (the class which is usually called a language is said to be recognized by the automaton. Those devices which can produce some output are properly called transducers but the latter term is not very wide spread, so further we will use the both as mutual synomyms.

The most primitive automata class is finite-state automata. Their state and output are determined by a single character of the input and the current state, the sets of input characters and inner states both being finite.

Push-down automata have in addition an internal stack of unlimited length. The topmost character in the stack determines the behaviour of an automaton alongside with its previous state and the input character. After changing the state and emitting the output, the topmost stack character is replaced by an arbitrary (poss. empty) string of characters. The set of tasks that can be modelled with push-down automata/transducers is much larger than that of finite-state automata so the latter should be clearly distinguished from the former. However, in practice, push-down automata are often called finite-state.

Automata are an ideal device to deal with complex string transformations. Well-known regular expressions are nothing but another notation for finite-state (outputless) automata. However, they may be quite hard to learn, so they are not very popular nowadays (I mean, as a tool for text processing). I personally know the only project which uses this conception – namely, Omega project by John Plaice anf Yannis Haralambous.

EXPRESSIONS

Automata expressions are much like C expressions. However, in the current version they have less precedence levels that the latter. Expressions may contain the following operators (in descending precedence)

There are yet three special operators (`@', `:', and application) which will be described later. Parentheses may be used in a common way. Primitive expressions are integral constant strtol(3) syntax), character constants (C syntax), register references, and macro names

RANGES

(aka tables) Ranges may have two syntactic forms: constant ranges and range references.

Range references have the form [=range-name], where range-name is defined by a preceding .range directive

Constant ranges are a comma-separated list of segments, enclosed in brackets. Each segment has the following structure: start-end/dither*multi where every part except for start may be omitted: if end is omitted, it is implied equal to start; if dither or multi is omitted, 1 is implied This means ``take multi times from start to end each dither element''

start, end, dither, multi should be integral or character constants; there should be no spaces around -, /, *

To check whether a value is in a range, @ operator is used. It has the same precedence as unary operator; its left operand is a range, and the right is some expression. It returns a zero-base index of the first occurrence of a value within the range; or -1 if the value is outside the range.

To select n-th item from a range, application operator is used; it has no symbolic representation; just write a range followed by an expression. The application has the same priority as @.

REGISTERS

The memory of an automaton consists of a stack, a current state, an input characters and 8 temporary registers. They are accessed in the following way:

The registers may be assigned to with a : operator; its left arg is an expression and its right args is a register number. It has the same precedence as @

NOTE:0 and 1 regs may be assigned, but this would not really change the state or the stack; they are just shadows of the actual values.
NOTE:temporary registers (2..9) are persisent in the sense that they do keep their value across transitions

TRANSITIONS

Each automaton is essentially a set of transitions. A transition definition has the following syntax:
transition ::state-expr `;' eat filter-expr1 `;' filter-expr2 `;' newstate-expr `;' `{' expr-list1 `}' `{' expr-list2 `}'
eat ::null

`^'

expr-list ::null

expr-or-string expr-list

expr-or-string ::expr `;'

quoted-string

A state-expr is evaluated at parse time yielding the number of a state. It must non-negative.

If eat is `^', an input character of the automaton will not change after transition. If filter-exprs are always true, this is equivalent to empty-string transition.

Filter-exprs are evaluated (at run time) in a short-circuit manner as if connected by `&&' operator. If they're both true, a transition for a given state is executed.

NOTE:Transitions are searched in the order of their definitions. After a matching transition is found, search is complete.
By convention, filter-expr1 shall check conditions related to input character, while filter-expr2 – those related to stack and temporary registers. This is not obligatory, however.

When a transition is executed, expressions from expr-list1 are evaluated and their values are passed to output. If a list item is a string, it is considered equivalent to a list of character constants.

Then a topmost stack item is eliminated, and expr-list2 is evaluated.

NOTE:Here string items are, too, equivalent to sequence of characters. As a consequence, a string will be pushed into the stack reversed.

After all, the state is changed to the value of newstate-expr.

NOTE:Topmost stack element and the stack depth are cached in `$1' and '$#', so when executing expr-list2, `$1' and `$#' will hold value the topmost stack element and stack depth had at the beginning of a transition, not the actual values.

DIRECTIVES

.define name expression;

defines a macro name which is expanded to expression
NOTE:no evaluation occur at the point of definition
NOTE:macros are expanded as if their body is surrounded by parentheses.

.include "filename"

includes a file filename
NOTE:filename may be in fact surround by arbitrary characters, but they need be the same – .include <... is not allowed.

.start expression;

declares the state whose number is the value of expression as a default start state

.final expression;

declares the state whose number is the value od expression as final.

.if (.ifndef) name ... .else ... .endif

Conditionally processes a piece of file depending on whether a macro name is defined

.ditto expression;

copies the most recently defined transition to a state with the number which is equal to expression
BUG:implementation is broken

.range name range-constant[;]

defines a range name whose value is range-constant

.end

stops processing the current file. Mostly useful for automata definitions embedded into Tensile programs.

AUTOMATA SEQUENCES

Syntax

Automata sequences has the following syntax:
autoseq ::null

branch

branch `|' autoseq

branch ::item

item `&' branch

item ::ismaster automaton param-spec outfunc
ismaster ::null

`*'

automaton ::auto-name state-spec

`{' extension-name extension-param '}'

param-spec ::null

`[' params `]'

params ::= param

param `,' params

param ::= reg-spec `=' number-or-name
reg-spec ::= digit

name

number-or-name ::number

char-constant

name

auto-name ::name

apos-string

state-spec ::null

`:' name

extension-name ::any characters except ] and
extension-param ::null

any characters except ]

outfunc ::null

`$$'

`$1'

`$0'

`$#'

Arbitrary spaces are allowed between tokens

Evaluation

A sequence is evaluated right to left, the input of an automata branch being the output of the next one (in evaluation order).

A branch is evaluated left to right, the output of the branch being the first non-empty output. However, all the branches are always executed.

An output value of an automaton can be modified by an output function which now can substitute a current state, a topmost stack element or a stack depth for a real output value.

When a sequence is initialized or reset, all its automata states are set to the value of state-spec or to a value defined by .start if the former is missing. Temporary registers and stack may be initialized by additional params. A param designator may be either a digit or a macro name which expands to a register reference. A param value may be either a number or a character constant, or a macro name which is get evaluated at parse time.

By default, when polling a state of a sequence, the last branch of the last sequence item (i. e. the rightmost one) is used. This can be changed by prefixing a given item with `*'.

SEE ALSO

sl(1), strtol(3)

AUTHOR

Artem V. Andreev

COPYRIGHT

Copyright © 2001, 2002 Artem V. Andreev. See
sl(1) for details