ML-W-38: Simple Transduction Example SGML to LaTex

Lou Burnard

25 April 1991

Table of Contents

1. Program Overview

This program has the very simple and pragmatic objective of taking documents drafted in something like SGML and formatting them with LaTeX. Early versions of the program were written to work with the TEIDOC0 dtd developed for P1 by myself and Michael; it has now been generalised somewhat to accept any tagset, but no very principled approach has been taken to supporting every possible feature of SGML. In particular, the program does no validation but assumes the document has already been checked; only a limited number of attributes are acted on; and there are a couple of dirty tricks used -- for example, comments are assumed to be closed on the line in which they are opened (they are simply echoed to the terminal rather than being translated, as I suppose they should, to LaTeX comments).

The program is driven by a `dictionary file' which defines the mapping to be applied between SGML tags and LaTeX tags. It does a little more than simply translating the tags however: an action code may be defined for any tag to specify an additional action, as follows:

The program in its current incarnation does not support the - action. Not yet having found any use for it, I never got round to implementing it.

The dictionary file also contains substitution strings for any entity references in the document.

The program does its best to render `safe' any of the (numerous) characters which LaTeX finds upsetting; this is done in a separate scan of the input strings carried out after tags have been identified. It also tries, not very successfully, to put spaces in where LaTeX will not subsequently remove them, e.g. between LaTeX tags and following content.

If tags or entities are found in the document that do not exist in the dictionary, the user is given the option to define them, and to update the dictionary.

Three tags have a special effect: <xmp>, which is translated into \begin{verbatim}, also switches off suppression of LaTeX special characters (though not if they appear on the same input record as the <xmp> itself -- I told you this was a simple minded program); <eg>, which is translated into \begin{verse}, also causes the LaTeX hard-line tag \\ to be added to the end of each input line; <div> and <head> tags are the really insanitary ones - they are translated into the appropriate LaTeX \section, \subsection etc. and any following <head> tag is simply removed. Any uninterrupted sequence of digits, stops and spaces at the beginning of the head tag is silently removed (this is to cater for Michael's habit of explicitly numbering titles etc. to make the SGML eye-readable). The tag <h1> is treated as a synonym for <div1><head>, and likewise for <h2> etc.

The only attribute values acted on are ID and TARGET: these are used to generate the LaTeX equivalent `labels', so that cross-references at least work.

The source code for the program follows, in Appendix 1. An example dictionary file, corresponding with the DTD used for the present document, is given in Appendix 2.

2. Discussion

Is this program reversible? Clearly not, since there are many LaTeX tags for which no descriptive equivalent exists, and even more which might be used ambiguously in a number of situations. As a trivial example -- what is one to do with the sequence \it some string \rm ? It might just be emphasis (although LaTeX has an \em tag, it doesn't require you to use it for all occasions when emphasis is intended, only those where you might want to change the typeface of the surrounding body text) or it might be a citation or .

I found particularly aggravating the LaTex conventions for dealing with spaces following tags. These are even worse than the SGML rules about record separators, believe it or not. Otherwise, writing this program and getting it to produce reasonably acceptable output was very easy indeed.

3. The program source

*  runs with any MACRO spitbol implementation
*  check the NEWFILE routine for system dependent filenames
*  Lou Burnard, OUCS, March 1991
     &trim = &anchor = 1
     osp = span(' ') | null
     los =  'abcdefghijklmnopqrstuvwxyz'
* pattern to recognise and remove comments
     comment_pat = ('.' | '<!--') notany('.') rem . terminal
* pattern to find tags
     p = break('<') . s1 '<'   break('>') . tag '>'
* pattern to break attributes out of a tag
     attr_pat = osp
.         break('=') . a_name '=' osp
.    (  ("'" break("'") . a_value "'")
.     | (break(' ') . a_value span(' '))
.     | rem . a_value)
* -- open next data file and next output file
file new_file()                                      :f(close)
* -- output any LaTeX initialisation stuff
     out = '\documentstyle[11pt,A4]{article}'
     out = '\frenchspacing'
     out = '\def\tag#1{{\boldmath\bf #1}}'
     out = '\def\smartitalicx{\ifx\next,\else'
.       '\ifx\next-\else\ifx\next.\else\/\fi\fi\fi}'
     out = '\def\smartitalic#1{{\it'
.         '#1}\futurelet\next\smartitalicx}'
* -- read in next line from data file
read str = in                                       :f(eofile)
     str comment_pat                                  :s(read)
     str = specials(str)
     str = differ(verse) str '\\'
* -- get next tag from input string
check     str p =                                    :f(write)
     new_str =  new_str s1
     attrs =
     tag break(' ') . tag span(' ')  rem . attrs
* kludge city -- bad guys dont look
     leq(tag,'xmp')                            :s(do_verbatim)
     verse = leq(tag,'eg') 1
     verse = leq(tag,'/eg')
     tag ('div' | any('hH')) span('123') rpos(0)
.                                                      :f(eok)
     str '<head>' =
     str span(' 0123456789.') =      :(eok)
     out = '\begin' '{verbatim}' str
in_verb str = in                                 :f(done_verb)
     str breakx('<') . s len(1) '/xmp>' =
.                                                :s(done_verb)
     out = str                                      :(in_verb)
done_verb  out = s ; s =  ;
        out = '\end' '{verbatim}'                     :(check)
* end of kludge city
eok  new_tag = tagcheck(tag)
     terminal = leq(new_tag,'*')
.         'No action taken for ' tag ' tag'
.                                                    :s(check)
*-- here we deal with attributes
chk_attr ident(attrs)                                :s(colon)
     attrs attr_pat      =
* Extract ID and TARGET attribute values for use as labels
* Other attributes are ignored
     a_name ('id' | 'target')                    :s($(a_name))
     terminal = 'Attribute ' a_name '=' a_value ' ignored'
.                                                  :(chk_attr)
id   label = a_value                               :(chk_attr)
target  new_tag = new_tag a_value '}'              :(chk_attr)
* -- check action code on new tag
colon     new_tag ':' =                             :f(shriek)
* implied close at end of line
     new_tag = new_tag str '}'
     str =
     ident(label)                                    :s(store)
     new_tag = new_tag ' \label{' label '}'
     label =
shriek  new_tag '!'                                  :f(store)
* bang out the tag name as well as its content
     new_tag = '\bf ' tag ': \rm '
* - here we buffer it
store     new_str = differ(new_tag)  new_str new_tag
.                                                     :(check)
* - and here we output it
write     out = new_str str
     new_str =                                         :(read)
eofile    terminal = 'File ended'
*    out = '\end{document}'
     yn('Rewrite dictionary?')                         :f(end)
     t = sort(convert(t,'array'))
     i = 0
ti   i = i + 1
     out = t<i,1> ':' t<i,2>                            :s(ti)
     out = '*ENTITIES*'
     et = sort(convert(et,'array'))   :f(end)
     i = 0
ti2  i = i + 1
     out = et<i,1> '=' et<i,2>                   :s(ti2)f(end)
     s break('$&%#{}\^|') . s1 len(1) . c = :f(no_spec)
* look for entity refs
     c '&'                                         :f(not_ent)
     s2 =
     s (span(ups los) (';' | '.') ) . s2 =
.                                                 :f(keep_amp)
     s2 = entcheck(s2)                            :f(keep_amp)
     specials = specials s1 s2                     :(specials)
     specials = specials s1 ' \& '
     s = s2 s               :(specials)
* look for latex specials
not_ent   c any('$&%#{}_') = '\' c                :s(sp_store)
* anything else can be done with \verb
     c any('\^<>|') = '\verb+' c '+'
sp_store  specials = specials s1  c
.                                                  :(specials)
no_spec   specials = specials s                      :(return)
     tag = replace(tag, ups , los)
     tagcheck =  replace(t<tag>,'_',' ')
     terminal = ident(tagcheck)
.         'What sort of a tag is a ' tag '???'
.                                                   :f(return)
     tagcheck = terminal
.    yn('Shall I add ' tagcheck ' to dictionary?')
.                                                      :s(add)
     t<tag> = '<' tag '>'                            :(return)
add  t<tag> = tagcheck                               :(return)
* strip off last ; or . before looking it up
     str rtab(1) . str
     entcheck = et<str>
     terminal = ident(entcheck) 'Undefined entity ' str
.                                                   :f(return)
     yn('Do you want to define this entity?')
.                                                 :f(skip_ent)
     terminal = str '=?'
     entcheck = terminal
     et<str> = differ(entcheck) entcheck
.                                                   :s(return)
skip_ent entcheck = str ';'                          :(return)
new_file  endfile(3)
     terminal = 'Filename?'
     f = terminal                                  :f(freturn)
     ident(f)                                      :s(freturn)
     input(.in,3,f)                               :f(new_file)
* filename tweaked for VMS and PC only
     f ((break(']') | pos(0)) break('.')) . f
.                                            :f(funnyfilename)
     in '<' break(' >') . doctype                   :f(no_dtd)
     new_dic(doctype)                :f(freturn)
     output(.out,4,f '.tex')                           :f(wot)
     terminal = 'Output to ' f '.tex'
.                                                    :(return)
no_dtd  terminal = 'Input file must begin with <xxx>'
        terminal = '          xxx identifies the doctype '
* -- open a new dictionary file and load its contents
*    parameter is the name of the file
          dic_name = s '.dic'
          input(.din,7,dic_name)                    :f(no_dic)
* t is the table for tags and et for entity names
     t = table()  ; et = table()
     nt = ne = 0
load_t    s = din                                   :f(loaded)
     s break(':') . t1 ':' rem . t2            :f(do_entities)
     nt = nt + 1 ; t<t1> = t2                        :(load_t)
     s '*ENTITIES*'                                  :f(dud_t)
load_t2 s = din                                     :f(loaded)
     s break('=') . t1 '=' rem . t2                  :f(dud_t)
     ne = ne + 1 ; et<t1> = t2
dud_t     terminal = 'dictionary starts ' s
.         ' - what nonsense is this?'               :(freturn)
no_dic    terminal = 'Where is dictionary ' dic_name  '???'

.                                                   :(freturn)
loaded      terminal = nt ' tags, ' ne
.    ' entities loaded from ' dic_name
.                                                    :(return)
     terminal = s
     terminal any('yY')                   :s(return)f(freturn)

4. Example dictionary file

action:  \linebreak \begin{flushright} \scACTION:\rf
body:\begin{document} \maketitle

HTML generated 18 May 1998