Live long and prosper! Lessons from the TEI

1. 1986 was a long time ago…

  • The first computer virus – Brain – appears, in the USA
  • Construction of the channel tunnel begins
  • The Soviet Union launches space station Mir
  • Disaster at Chernobyl
  • Olaf Palme assassinated
  • Records of the year: Raising Hell (Run DMC)… Graceland (Paul Simon)… Группа крови (Виктор Цой)

2. …but we used computers then

  • Corpus linguistics
  • Databases on CD ROM
  • Largescale lexical resources already existed (eg TLF, TLG, LASLA…)
  • Digital lexicography (e.g. OED)
  • Document management systems (e.g. TeX, Scribe, tRoff..)
    • some proprietary (and expensive), some research
  • Text archives
  • Hypertext theory

But there was no world wide web and not many desktop pcs…

3. Birth of the Text Encoding Initiative

  • Spring 1987: European workshops on standardisation of historical data (J.P. Genet, M Thaller)
  • Autumn 1987: NEH funds an exploratory international workshop on the feasibility of defining "text encoding guidelines"
Vassar College, Poughkeepsie
Figure 1. Vassar College, Poughkeepsie

4. Today's question:

  • So the TEI is very old!
  • It comes from a time before the Web, before the DVD, the mobile phone, cable tv, or Microsoft Excel
  • Not much in computing survives 5 years, never mind 20
  • What relevance can it possibly have today?
  • Why is it still here, and how has it survived?

5. Is the TEI still relevant?

  • With XML everyone can create their own markup system and still share data!
  • In the Semantic Web, XML systems will all understand each other's data!
If we have
  • historical data marked up with a Historical Markup Language
  • linguistic data marked up with a Linguistic Markup Language
  • metadata marked up with a Metadata Markup Language
how will we integrate resources or ask interesting questions?

Haven't we been here before?

6. Relevance 1

The TEI provides
  • a language-independent framework for defining markup languages
  • a very simple consensus-based way of organizing and structuring textual (and other) resources…
  • … which can be enriched and personalized in highly idiosyncratic or specialised ways
  • a very rich library of existing specialised components
  • an integrated suite of standard stylesheets for delivering schemas and documentation in various languages and formats
  • a large and active open source style user community

7. Relevance 2

Why would you want those things?
  • because we need to interchange resources
    • between people
    • (increasingly) between machines
  • because we need to integrate resources
    • of different media types
    • from different technical contexts
  • because we need to preserve resources
    • cryogenics is not the answer!
    • we need to preserve metadata as well as data

8. The virtuous circle of encoding

9. The scope of intelligent markup

Even within the original scope of the TEI we have
  • basic structural and functional components
  • diplomatic transcription, images, annotation
  • links, correspondence, alignment
  • data-like objects such as dates, times, places, persons, events (named entity recognition)
  • meta-textual annotations (correction, deletion, etc)
  • linguistic analysis at all levels
  • contextual metadata of all kinds
  • … and so on and so forth

Is it possible to delimit encyclopaedically all possible kinds of markup?

10. Reasons for attempting to define a common framework

  • re-usability and repurposing of resources
  • modular software development
  • lower training costs
  • ‘frequently answered questions’ — common technical solutions for different application areas

The TEI was designed to support multiple views of the same resource

11. Old Skool TEI

  • A traditional (if large) research project with soft funding, driven by academic curiosity
  • a codification of best practice, with no formal maintenance method
  • uncertain licencing and development practices
  • perceived as unmanageably complex except by the priesthood — or simultaneously as too simple for real scholarly work
  • lack of specific tools to do something with a TEI text
  • failure to market the advantages of rich markup

12. TEI New

  • Proper open source licence, with visible development on Sourceforge
  • Architecture rethought to facilitate expansion and integration with other systems
  • Self documenting, each release fully validated, delivered using standard mechanisms
  • Publicly available processing tools managed together with the Guidelines
  • Active developer community, wiki, etc. Test files, exemplars, regular updates…
  • New governance structure, new tools, new modules…

13. Three important things about TEI P5

  1. Being a good digital citizen:
    • Support for multiple schema languages and namespaces
    • Reliance on XML, and hence on Unicode
    • Validation of attributes and datatyping
    • Use of W3C pointers and paths
  2. Making it flexible:
    • ODD: a single specification language for developers, users, and teachers, integrating schema and documentation;
    • Verifiable conformance
  3. Old annoyances removed and some new topics added

14. One Specification Language

  • A set of TEI documents is described by an ODD, which is itself a TEI document that combines:
    • references to existing declarations
    • formal declarations for elements and attributes
    • documentation and usage notes
  • Underlying this:
    • a conceptual model which abstracts from specific elements to generic classes
    • a modular architecture for combining sets of definitions
  • specifications are chainable; modifications are written in ODD with ODD as input and output
  • Roma is one interface to this: there will be others

15. For example

An ODD file is a valid TEI document, containing descriptive prose, and a <schemaSpec> element to define the schema it documents

<div> <head>Our Project Manual</head> <p>In this project we use the basic TEI structures with a few minor modifications to exclude elements we do not need</p> <schemaSpecident="TEI-minimal"start="TEI"> <moduleRefkey="tei"/> <moduleRefkey="header"/> <moduleRefkey="core"/> <moduleRefkey="textstructure"/> <!-- We don't need these drama elements: --> <elementSpecident="sp"mode="delete"module="core"/> <elementSpecident="speaker"mode="delete"module="core"/> </schemaSpec> </div>

16. Support for many schema languages

  • TEI schemas can be generated for
    • XML DTD language
    • ISO RELAX NG language
    • W3C Schema Language
  • Content models are defined using RELAX NG syntax
  • Datatypes are defined in terms of W3C datatypes
  • Some facilities (e.g. alternation, namespaces) cannot be expressed in DTD
  • Additional constraints can be expressed in Schematron

17. Two reasons why standards fail

  • The theory is not yet ripe
  • The "not invented here" attitude: the community of users is too diverse

18. Coping with partially-baked ideas

In a TEI ODD, you can …
  • constrain the domain of a value list
  • enforce Schematron rules about e.g. co-dependency
  • provide new elements in your own namespace
  • remove (non-mandatory) child elements

19. New elements

A schema is a grammar. How can you add new terminals to an existing syntax?

  • All content models are expressed indirectly, by reference to element classes rather than elements
  • Hence adding a new element is simply a matter of saying which class/es it belongs to

The TEI schema is also enriched with semantics. How can you explain what a new element means?

  • Class membership also conveys some semantics
  • ODD includes detailed documentation

20. Coping with the NIH Syndrome

  • TEI P5 has extensive I18N features for translation of …
    • schema objects
    • schema documentation
  • See Roma at http://www.tei-c.org/Roma/
  • TEI is hospitable to other namespaces
    • You can use SVG for graphics, MathML for math, Word Table markup if you like
    • (but note this doesn't solve the Other Overlap Problem)
  • ODD also includes an <equiv> element for mapping to external ontologies

21. For example

Embedding SVG within TEI:
<figure> <svg xmlns="http://www.w3.org/2000/svg" width="6cm"height="5cm"viewBox="6 3 6 5"> <ellipse xmlns="http://www.w3.org/2000/svg" style="fill:#ffffff" cx="9.75" cy="6.35" rx="2.75" ry="2.35"/></svg> </figure>
A user-defined attribute:
<div xmlns:my="http://www.example.org/ns/nonTEI"> <pn="12"my:topic="rabbits">Flopsy, Mopsy, Cottontail, and Peter…</p> </div>

An NVDL processor can validate a document using multiple namespace schemas

22. Conformance issues

A document is TEI Conformant if and only if it …
  • is a well-formed XML document
  • can be validated against a TEI Schema, that is, a schema derived from the TEI Guidelines
  • conforms to the TEI Abstract Model
  • uses the TEI Namespace (and other namespaces where relevant) correctly
  • is documented by means of a TEI Conformant ODD file which refers to the TEI Guidelines

Or if it can be transformed automatically using some TEI-defined procedures into such a document (it is TEI-conformable)

Standardization should not mean ‘Do what I do’, but rather ‘Explain what you do in terms I can understand’

23. Evolution works!

  1. Make modifications in your own namespace
  2. Document them in an ODD
  3. Propose them to the TEI Council as amendments or feature requests
  4. TEI P5 now has a 6 month release cycle…

Visit http://www.tei-c.org for more background info

Visit http://tei.sf.net to download