Using the basic TEI structural elements

1. TEI Infrastructure

  • The TEI encoding scheme consists of a number of modules
  • These declare XML elements and their attributes
  • An element's declaration assigns it to one (or more) model classes
  • Another part declares its possible content and attributes with reference to these classes
  • This indirection allows strength and flexibility
  • It makes it easy to add/exclude new elements by referencing existing classes

2. What is a module?

  • A convenient way of grouping together a number of element declarations
  • These are usually on a related topic or specific application
  • Most chapters focus on elements drawn from a single module, which that chapter then defines
  • A TEI Schema is created by selecting modules and add/removing elements from them as needed

3. Modules

Module name Chapter
analysis Simple Analytic Mechanisms
certainty Certainty and Responsibility
core Elements Available in All TEI Documents
corpus Language Corpora
dictionaries Dictionaries
drama Performance Texts
figures Tables, Formulae, and Graphics
gaiji Representation of Non-standard Characters and Glyphs
header The TEI Header
iso-fs Feature Structures
linking Linking, Segmentation, and Alignment
msdescription Manuscript Description
namesdates Names, Dates, People, and Places
nets Graphs, Networks, and Trees
spoken Transcriptions of Speech
tagdocs Documentation Elements
tei The TEI Infrastructure
textcrit Critical Apparatus
textstructure Default Text Structure
transcr Representation of Primary Sources
verse Verse

4. The Imaginary Punch Project

  • Punch is a famous English humorous journal, published regularly between 1841 and 1992: see http://www.punch.co.uk/historyofpunch.html.
  • A project plans to make available fully marked up texts of the journal, in conjunction with page images...
    • for social historians
    • for librarians
    • for linguists
  • How will the TEI help? And which parts of the TEI will we use?

5. Looking at Punch, what do we need to mark up?

  • issue information and page number for reference purposes
  • "chunks" or divisions of text, which may contain a picture, a poem, some prose, some drama, or a combination
  • within the chunks, we can identify formal units such as
    • a picture, a caption
    • stanzas, lines
    • paragraphs
    • speeches and stage-directions
  • and more...

6. TEI tags for the high level structure

We will treat each issue as a single <text> element, and each identifiable chunk within it as a <div> element of a particular type (e.g. cartoon, verse, prose)

For example, page 1 has two divisions,
<pb n="1"/> <div type="cartoon">....</div> <div type="poem">  <head>Progress</head>.... </div>
page 2 also has two, of different types:
<pb n="2"/> <div type="prose">  <head>The enchanted castle</head>.... </div> <div type="snippet">  <head>Correspondence</head>.... </div>

7. Why divisions rather than pages?

Because a division can start on one page (page 5 for example) and finish on another (page 6)

We use an empty element <pb> to mark the boundary between pages, rather than enclosing each page in a <div type="page">.

<pb n="5"/> <div type="cartoon">...</div> <div type="review">  <head>Egypt in Venice</head>... <pb n="6"/> ... </div> <div type="cartoon">...</div> <div type="verse">  <head>Enigma</head>... </div> <div type="snippets">...</div>

The sequence in which divisions appear is rather arbitrary.

8. Divisions can contain divisions...

<div type="snippets">  <div type="snippet">Curiously....Chancellor</div>  <div type="snippet">Men for the Antarctic... Canadians</div> </div>
  • TEI also provides division elements with names that indicate their degree of nesting (<div1>, <div2> etc.) which some people prefer
  • Divisions must always tessellate: once "down" a level, you cannot pop "up" again within the same division.

9. Floating text

As mentioned above, <div>s must tesselate over the entire text
<div1>  <p> ... </p>  <div2>   <p> ... </p>  </div2>  <div2>   <p> ... </p>  </div2> </div1>
is valid but
<div1>  <p> ... </p>  <div2>   <p> ... </p>  </div2>  <p> ... </p> </div1>
is not valid.

A special <floatingText> element is available for "interruptions"

10. What are divisions made of?

(apart from other smaller divisions)

  • <head> (heading)
  • <p> (paragraph)
  • <sp> (speech, contains any of the foregoing, also <stage> and <speaker>)
  • <list> (contains <head>, <label>, <item>)
  • <table>, (contains <row> containing <cell>) ...
  • <l> (verse line) optionally grouped into <lg> (line group) stanzas
  • <figure> (contains <graphic>, <figDesc>, <head>...)

11. For example....

Page 3 contains a figure and a dialogue...
<div type="cartoon">  <figure>   <head>When the ships come home</head>   <figDesc>A man in Turkish dress lounges on a sofa,      smoking a cigarette and consulting a book      labelled "Naval ledger". Another man, in      traditional Greek costume, stands beside him,      also reading a notebook.</figDesc>   <graphic url="Graphics/page3.jpg"/>  </figure>  <sp>   <speaker>Greece.</speaker>   <p> Isn't it time we started fighting again?</p>  </sp>  <sp>   <speaker>Turkey.</speaker>   <p> Yes, I daresay. How soon could you begin?</p>  </sp>  <sp>   <speaker>Greece.</speaker>   <p> Oh, in a few weeks.</p>  </sp>  <sp>   <speaker>Turkey.</speaker>   <p> No good for me. Shan't be ready till      the autumn.</p>  </sp> </div>

12. For example...

The militants' tariff (on Page 15) contains headings, paragraphs, and a table...
<div type="prose">  <head>THE MILITANTS' TARIFF.</head>  <head rend="right">Etna Lodge, W.</head>  <p>Mrs. Bangham Smasher, having entered into partnership with the    Misses Burnham Blazer, as General Agents of Destruction, begs to    inform the public that the firm will be prepared to execute    commissions of all kinds, at the shortest notice, on the very    moderate terms given below : —  </p>  <table>   <row role="label">    <cell/>    <cell>£</cell>    <cell>s.</cell>    <cell>d.</cell>   </row>   <row>    <cell>For breaking windows, per window ...</cell>    <cell>0</cell>    <cell>7</cell>    <cell>6</cell>   </row>   <row>    <cell>For howling, kicking, or biting during service        in church, per howl, kick, or bite ...</cell>    <cell>0</cell>    <cell>10</cell>    <cell>6</cell>   </row>   <row>    <cell>For sitting on doorsteps of obnoxious persons,        per hour, if fine ...</cell>    <cell>0</cell>    <cell>15</cell>    <cell>0</cell>   </row>   <row>    <cell>For sitting on doorsteps of obnoxious persons,        per hour, if wet ...</cell>    <cell>1</cell>    <cell>1</cell>    <cell>0</cell>   </row> <!-- ... -->  </table> </div>

13. Global attributes

Some features (potentially) apply to everything:
  • identity
  • language
  • rendition
TEI provides global attributes for these:
  • @xml:id provides a unique identifier for any element;
  • @n provides a name or number for any element
  • @xml:lang specifies the language of any element, using an ISO standard code
  • @rend and @rendition provide ways of specifying the visual appearance (rendition) of any element

14. For example...

Egypt in Venice (on Page 5) begins with two headings, one in French....
<div type="prosexml:lang="enxml:id="I1914-07-01_05_02">  <head>Egypt in Venice.</head>  <head xml:lang="frrend="it">"La Légende de Joseph."</head>  <p>Those who know the kind of attractions that the    Russian ballet offers in so many of its themes ....</p> </div>
Each stanza of the poem on page 10 has a last line which is significantly indented:
<lg>  <l>There were eight pretty walkers who went up a hill;</l>  <l>They were Jessamine, Joseph and Japhet and Jill,</l>  <l>And Allie and Sally and Tumbledown Bill,</l>  <l rend="indent">And Farnaby Fullerton Rigby.</l> </lg>

15. Macrostructure 1

All the issues of Punch for one year make up a volume. We could regard the volume as a single <text>, and each issue as a <div> within it. Or we could use the <group> element:
<text xml:id="v147">  <front> <!-- introductory materials for volume 147 here -->  </front>  <group>   <text xml:id="I1914-07-01">    <body> <!-- first issue (1 July) -->    </body>   </text>   <text xml:id="I1914-07-15">    <body> <!-- second issue (15 July) -->    </body>   </text> <!-- etc... -->  </group>  <back> <!-- volume index, appendix etc. -->  </back> </text>

16. Macrostructure 2

As well as the texts, we have detailed metadata about each volume, and images of its pages. These are the three parts of a canonical TEI document:
<TEI>  <teiHeader> <!-- required; provides metadata -->  </teiHeader>  <facsimile> <!-- the text, represented in image form -->  </facsimile>  <text> <!-- the text, transcribed and marked up -->  </text> </TEI>

17. Macrostructure 3

If many such documents are grouped together to form a corpus (rather than a collection), it may be useful to factor out the metadata they have in common:
<teiCorpus>  <teiHeader> <!-- shared metadata -->  </teiHeader>  <TEI>   <teiHeader> <!-- specific metadata -->   </teiHeader>   <text> <!-- ... -->   </text>  </TEI>  <TEI>   <teiHeader> <!-- specific metadata -->   </teiHeader>   <text> <!-- ... -->   </text>  </TEI> </teiCorpus>

18. What kinds of metadata?

For the Punch Project and for any other comparable project, we will need a place for such information as
  • identification of the resource itself ("what is this thing?")
  • statements of responsibility ("who did what when?")
  • indication of source ("what was this derived from?")
  • publication statement ("how is this item distributed and by whom?")
  • declaration of encoding practice ("what do the codes we added mean?")

The TEI Header supports all these, and more.

19. The TEI Header

The TEI header was designed with two goals in mind
  • needs of bibliographers and librarians trying to document ‘electronic books’
  • needs of text analysts trying to document ‘coding practices’ within digital resources
On the one hand, the Librarian's header
  • uses standard bibliographic concepts
  • respects established mappings to other such records (e.g. MARC)
  • has a preference for structured data over loose prose
On the other, Everyman's header
  • Supports a (potentially) huge range of very miscellaneous information, organized in fairly ad hoc ways
  • Unpredictable combinations of narrowly encoded documentation systems and loose prose descriptions

20. TEI Header Structure

The TEI header has four main components:
  • <fileDesc> (file description) contains a full bibliographic description of an electronic file.
  • <encodingDesc> (encoding description) documents the relationship between an electronic text and the source or sources from which it was derived.
  • <revisionDesc> (revision description) summarizes the revision history for a file.
  • <profileDesc> (text-profile description) provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting. (just about everything not covered in the other header elements

Only <fileDesc> is required; the others are optional.

21. Simple TEI Header for Punch Project

<teiHeader>  <fileDesc>   <titleStmt>    <title>Punch, or the London Charivari, Vol. 147, July 1, 1914</title>   </titleStmt>   <publicationStmt>    <idno type="gutenberg">24357</idno>    <availability>     <p>This text is freely available for re-use          under US and UK law, consult your local          legal restrictions if elsewhere.</p>    </availability>   </publicationStmt>   <sourceDesc>    <p>This text is a TEI version of a Project Gutenberg        text originally located at <ptr       target="http://www.gutenberg.org/dirs/2/4/3/5/24357/"/>.        As per their license agreement we have removed all        references to the PG trademark.</p>   </sourceDesc>  </fileDesc>  <revisionDesc>   <change when="2008-07-26T23:49:55.968+01:00"/>  </revisionDesc> </teiHeader>

22. Below the paragraph...

Within the elements already introduced, TEI offers plenty of scope for mark-up of smaller components. For example:
  • boundaries, such as page, column, or line breaks
  • highlighting, emphasis and quotation
  • editorial changes such as correction, normalization etc.
  • names, numbers, dates, addresses...
  • links and cross-references
  • notes, annotation, indexing
  • graphics
  • bibliographic citations
  • words and other analyses

23. Highlighting

By highlighting we mean any combination of typographic features (font, size, hue, etc.) which distinguishes the highlighted text from its surroundings. This may be for many reasons...
  • to mark foreign, archaic, technical usages
  • for emphasis when spoken
  • to show something is not part of the text.. (e.g. cross references, titles, headings)
  • or is attributed to some other agency inside or outside the text (e.g. direct speech, quotation)

TEI provides both a generic <hi> tag and a large number of specific ones...

24. A few highlighting examples

  • <hi> (highlighted: reason unknown or unimportant)
    <p>[The rest of this communication is omitted owing to considerations of space.—<hi rend="sc">Ed</hi>.]</p>
  • <emph> (emphasized)
    <said>'E won't bite yer <emph>if you buy 'im</emph> guv'ner.</said>
  • <title> and <foreign>:
    <p>  <foreign xml:lang="fr">À propos</foreign> of Oxford, it is a question whether that extremely amusing book <title>Verdant Green</title> is still much read by freshers. </p>
  • <distinct> (linguistically marked)
    But then I remind myself that the Russian ballet is nothing if not <distinct>bizarre</distinct>

25. Quotation

Quotation marks can similarly be used to set off text for many reasons:
  • <q> (used if the reason is unknown or unimportant)
  • <said> (speech or thought)
  • <quote> (attributed to an external source)
  • <mentioned> and <soCalled> (nuances of narrative status)
<p>  <said who="#Celia">I know a lovely tin of potted    grouse,</said> said Celia, and she went off to cut some sandwiches. </p>
<head>How to utilise the art of <soCalled>suggestion</soCalled> </head> <head>The Doctor, six down at the turn, <soCalled>suggests</soCalled> to his opponent that they are playing croquet, and wins by two and one.</head>

26. Quotation (continued)

Note that these elements can nest within one another:
<p>The poet returned to his work. <said>   <quote>In      tooth and claw,</quote>  </said> he muttered to himself, <said>   <quote>In tooth and claw.</quote>  </said> </p>

27. Editorial intervention

As a simple example, consider: ‘Excuse me sir, but would you like to buy a nice little dawg?’ on page 6.

We can:
  • use <orig> to show that "dawg" is what it says, even though this is a nonstandard spelling
  • use <reg> to show that "dog" is an editorially-supplied regularisation of what it says
  • or provide both within a <choice> element to say either is a valid encoding:
...a nice little <choice>  <orig>dawg</orig>  <reg>dog</reg> </choice>?

28. Names of persons, places, things...

  • <name> (a name in the text, contains a proper noun or noun phrase)
  • <rs> (a general-purpose name or referencing string )
  • <title> (any form of title)

The @type attribute is useful for categorizing these, and they both also have @key, @ref, and @nymRef attributes.

29. Examples of names

Using @type to distinguish personal from geographic names:
<p>The scene opens at a party given by <name type="person">Potiphar</name> in <name type="place">Venice</name>. </p>
Using @key and @ref to de-reference names:
<p>  <label>Business done.</label>—The Commons still harping on the Budget. <name    type="person"    ref="http://en.wikipedia.org/wiki/Timothy_Michael_Healy">    Tim Healy</name> enlivened proceedings by vigorous personal attack on <q>the most reckless and incapable  <rs key="LLG">Chancellor of the Exchequer</rs>    that ever sat on the Treasury Bench.</q>  <name key="LLG">Lloyd George's</name> retort courteous looked forward to with interest. </p>

30. Dates

  • <date> contains a date and time in any format
  • For processing it is convenient to add a normalized version, using the @when attribute
  • Uncertain dates and times, and ranges, can be indicated by other attributes: @notBefore, @notAfter, @from @to
<p>House of Commons, <date when="1914-06-22">Monday, June 22, 1914</date>.</p> <p>  <date notAfter="1914-06-01notBefore="1914-03-01">Sunday, a month ago,</date> was hot. </p>

31. Cross references

A cross reference is a link from one point in a text (the source) to another (the target).

TEI provides generic elements <ptr> and <ref> for this purpose. If the linking text can be automatically generated use <ptr>; otherwise use <ref>.

The source is the location of the <ptr> or <ref>; the target is specified by the @target attribute, in the form of a URI reference.

See <ref target="#Section12">section 12 on page 34</ref>.
See <ptr target="#Section12"/>.

32. Bibliographic Citations

TEI provides special elements for bibliographic citations or references:
  • <bibl> (loosely structured)
  • <biblStruct> (standard bibliographic structure)
  • <listBibl> (encloses a bibliography)

These are typically used in preparing bibliographies, or in footnotes. But even in Punch, there are examples.

33. Simple <bibl> Example

In Punch, bibliographic citations are usually associated with a a quotation from another paper:

The<cit> element groups the two:
<cit>  <quote>It was the time when Henry III. was    batting with Simon de Montfort and his    Barons.</quote>  <bibl>   <title>Straits Times.</title>  </bibl> </cit>

34. Embedded notes

Notes, whether appearing in the original source, or added by an editor, can be marked using the <note> element.

We might use this to add biographical details to the Punch transcriptions:
<p>By-the-by, it is denied that Sir <name rend="sc">Joseph Beecham</name>  <note>Sir Joseph Beecham, 1st Baronet    (8 June 1848 - 23 October 1916)...</note>. was in any way responsible for the Government's "Pills for Earthquakes," by which it was hoped to avert the Irish crisis.</p>

<note> has attributes @place and @resp

35. Linked notes

Since we have several references to the same person, it might be better to put the notes elsewhere and point to them from the names:
<div type="notes">  <note xml:id="BEECHJO">Sir Joseph Beecham, 1st Baronet (8 June 1848 -    23 October 1916) the eldest son of Thomas Beecham (1820-1907) played a    large part in the growth and expansion of his father's medicinal pill    business which he joined in 1866....</note> <!-- other notes --> </div> <div type="snippets">  <p>... Both Earl <name rend="sc">Beauchamp</name>    and <name>Sir <ref target="#BEECHJO">Joseph Beecham</ref>   </name> appear    in the recent Honours List.</p>  <p>By-the-by, it is denied that Sir <name rend="scref="#BEECHJO">Joseph      Beecham</name> was in any way responsible...</p> </div>

Could also use specialised <person> element, in this case.

"Elsewhere" can be anywhere on the Internet...