TEI ED W41
Notes on Direct Interpretation of Syntax Trees
C. M. SperbergMcQueen
(University of Illinois at Chicago)
Anne BrueggemannKlein (ABK) has shown how to create finitestate
automata from regular expressions, by deriving a deterministic
finitestate automaton (the Gluschkov automaton) from the syntax tree
for the regular expression. This is extremely useful work. Its major
drawback for SGML parsing is that SGML content models which include the
'&' operator require translation into a canonical regular expression
which does not use the '&' operator; since an '&'expression with n
elements can be satisfied in any of n! different ways, this translation
leads to combinatorial explosion in the number of elements in the
canonical regular expression (and thus in the syntax tree and the FSA
derived from it).
We hope to find a way to work directly from the syntax tree of the
regular expression or SGML content model, in such a way as to simplify
the handling of '&'. This will require an automaton slightly more
complex than a standard FSA, which we will refer to as the `machine'.
Consider the element declaration
and a syntax tree derived from it, with ten nodes labeled 1 to 10.
1 ,

+++
  
2 + 3 * 4 ?
  
5 b 6 , 7 c

+++
  
8 d 9 e 10 f
ABK derives from this an automaton of six states (labeled Qi, b, c, d,
e, f), with an alphabet ¤b, c, d, e, f‡, an initial state Qi, a set of
final states last = ¤b, f, c‡, and a transition function delta:
1 delta(Qi,b) = b
2 delta(b,b) = b
delta(b,d) = d
delta(b,c) = c
3 delta(d,e) = e
4 delta(e,f) = f
5 delta(f,c) = c
6 delta(c,c) = c
If we interpret the syntax tree directly, we have a machine comprising
ten nodes (one for each node of the syntax tree), each bearing some
state variables.
ABK generates the FSA by calculating, for each node of the syntax tree,
the following variables:
nullable : Boolean; /* can node be satisfied by null string? */
first : set of node; /* which nodes can be first? */
last : set of node; /* which nodes can be last? */
follow(n) : set of node; /* which nodes can be follow n within this
expression? */
ABK's code for generating these variables can be paraphrased thus:
select
when label.n = '[emptyset]' then do
nullable.n = false;
first.n = ''
last.n = ''
end
when label.n = '[nullstring]' then do
nullable.n = true;
first.n = ''
last.n = ''
end
when label.n = '[literal]' then do
nullable.n = false;
first.n = value.n
last.n = value.n
val = value.n
follow.n.val = ''
end
when label.n = '' then do
left = child.n.1; right = child.n.2
nullable.n = nullable.(left) or nullable(right)
first.n = first.left first.right /* guaranteed disjoint */
last.n = last.left last.right /* guaranteed disjoint */
end
when label.n = ',' then do
left = child.n.1; right = child.n.2
nullable.n = nullable.(left) and nullable(right)
ln = last.left
do while ln <> ''
parse var ln node ln
val = value.ln
follow.n.val = follow.left first.right /* disjoint */
end
if nullable.left
then first.n = first.left first.right
else first.n = first.left
if nullable.right
then last.n = last.left last.right
else last.n = last.right
end
when label.n = '*' then do
nCh = child.n.1
nullable.n = true
ln = last.nCh
do while ln <> ''
parse var ln nTemp ln
val = value.ln
follow.n.val = follow.child.val first.child
end /* do ln */
first.n = first.nCh
last.n = last.nCh
end
otherwise call error 'Unknown node label: ' label.n '!'
end /* select */