Text Encoding Initiative

Committee on Text Representation


Stig Johansson

University of Oslo

8 February 1989

To be amenable to computer processing, texts must be properly encoded. The user must know 1) what the text is, 2) how the various textual features are encoded, and 3) whether there are extra-textual, interpretive features and, if so, how they are encoded. Standards or, at least, some basic guidelines will simplify the task both for producers and users of machine-readable texts. The three points above correspond to three of the working committees of the Text Encoding Initiative. This committee addresses the problem of text representation, i.e. the second point.

Conventional printed texts use standard alphabets and typographical conventions to structure the text. In devising guidelines for machine-readable texts, it is natural to take conventions from printed texts as a starting-point and suggest ways of expressing typographical distinctions in machine-readable texts. The committee on text representation will handle features for which there are accepted typographical conventions. Topics within the field of this committee include the marking or encoding of:

  1. alphabets, including diacritics
  2. change of language or alphabet
  3. significant typographical shifts (e.g. italics, boldface)
  4. punctuation
  5. hyphenation (including declaration of how hyphenation is treated)
  6. headings, paragraphs, and other devices marking the logical structure of texts
  7. lineation (on page, in column, in logical subdivision, etc)
  8. pagination
  9. quotations and dialogue in fiction
  10. figures, tables, and illustrations and their captions
  11. mathematical and other special symbols
  12. foot-notes
  13. editorial additions, deletions, or corrections
  14. editorial apparatus (apparatus criticus)
  15. layout of the text and other physical characteristics
At the outset the committee will primarily be concerned with points 1 through 9 and will cater for the types of texts most commonly held in existing text archives (unillustrated non-fiction, fiction in critical and popular editions). Only alphabetic languages will be considered at this stage. Subcommittees will be set up for points 1 and 6.

If the encoding will in some respects result in loss of information compared with the printed text (e.g. as regards physical characteristics), in others it may well go beyond it. For example, provision may be made for disambiguation of features such as: capitalisation to represent names and sentence openings, full stop to mark abbreviations and end of sentences, apostrophe vs end-of-quote, italics to mark emphasis vs foreign words or expressions.

The suggested guidelines for text representation should ultimately be able to handle texts originally produced in machine-readable form as well as machine- readable versions of printed texts and unprinted texts (such as letters and diaries). The special problems of spoken texts (with the exception of the International Phonetic Alphabet, IPA, which will be treated as a character set) and of dictionaries will for pragmatic reasons be taken up in the Committee on Text Analysis and Interpretation.

In devising coding conventions it is essential to study existing schemes and attempt to discover the consensus of the textual computing community. Existing standards will be honoured, wherever possible. It is expected that the suggested Markup Language (SGML) defined by the international standard ISO 8879, unless the needs of textual research will make it impossible to conform strictly to SGML.

In a second document I will come back to practical matters connected with the work of the committee (division of work, meetings, timetable, financial arrangements).


8 February 1989
Stig Johansson