Using the basic TEI structural elements
1. TEI Infrastructure
- The TEI encoding scheme consists of a number of modules
- These declare XML elements and their attributes
- An element's declaration assigns it to one (or more) model
- Another part declares its possible content and attributes with
reference to these classes
- This indirection allows strength and flexibility
- It makes it easy to add/exclude new elements by referencing existing
2. What is a module?
- A convenient way of grouping together a number of element
- These are usually on a related topic or specific application
- Most chapters focus on elements drawn from a single module, which
that chapter then defines
- A TEI Schema is created by selecting modules and add/removing
elements from them as needed
||Simple Analytic Mechanisms
||Certainty and Responsibility
||Elements Available in All TEI Documents
||Tables, Formulae, and Graphics
||Representation of Non-standard Characters and Glyphs
||The TEI Header
||Linking, Segmentation, and Alignment
||Names, Dates, People, and Places
||Graphs, Networks, and Trees
||Transcriptions of Speech
||The TEI Infrastructure
||Default Text Structure
||Representation of Primary Sources
4. The Imaginary Punch Project
- Punch is a famous English humorous journal, published regularly between 1841 and 1992: see http://www.punch.co.uk/historyofpunch.html.
- A project plans to make available fully marked up texts of
the journal, in conjunction with page images...
- for social historians
- for librarians
- for linguists
- How will the TEI help? And which parts of the TEI will we
5. Looking at Punch, what do we need to mark up?
- issue information and page number for reference purposes
- "chunks" or divisions of text, which may contain a picture, a
poem, some prose, some drama, or a combination
- within the chunks, we can identify formal units such as
- a picture, a caption
- stanzas, lines
- speeches and stage-directions
- and more...
6. TEI tags for the high level structure
We will treat each issue as a single <text>
element, and each identifiable chunk within it as a <div>
element of a particular type (e.g. cartoon, verse,
For example, page 1
also has two, of different types:
<head>The enchanted castle</head>....
7. Why divisions rather than pages?
Because a division can start on one page (page 5 for example) and finish on another (page 6)
We use an empty element <pb> to mark the
boundary between pages, rather than enclosing each page in a <div
<head>Egypt in Venice</head>...
The sequence in which divisions appear is rather
8. Divisions can contain divisions...
<div type="snippet">Men for the Antarctic... Canadians</div>
- TEI also provides division elements with names that
indicate their degree of nesting (<div1>, <div2> etc.)
which some people prefer
- Divisions must always tessellate: once "down" a level, you
cannot pop "up" again within the same division.
9. Floating text
As mentioned above, <div>
s must tesselate over the entire text
<p> ... </p>
<p> ... </p>
<p> ... </p>
is valid but
<p> ... </p>
<p> ... </p>
<p> ... </p>
is not valid
A special <floatingText> element is available for "interruptions"
10. What are divisions made of?
(apart from other smaller divisions)
- <head> (heading)
- <p> (paragraph)
- <sp> (speech, contains any of the foregoing, also
<stage> and <speaker>)
- <list> (contains <head>, <label>, <item>)
- <table>, (contains <row> containing <cell>) ...
- <l> (verse line) optionally grouped into <lg> (line
- <figure> (contains <graphic>, <figDesc>,
11. For example....
contains a figure and a
<head>When the ships come home</head>
<figDesc>A man in Turkish dress lounges on a sofa,
smoking a cigarette and consulting a book
labelled "Naval ledger". Another man, in
traditional Greek costume, stands beside him,
also reading a notebook.</figDesc>
<p> Isn't it time we started fighting again?</p>
<p> Yes, I daresay. How soon could you begin?</p>
<p> Oh, in a few weeks.</p>
<p> No good for me. Shan't be ready till
12. For example...
The militants' tariff
(on Page 15
) contains headings, paragraphs, and
<head>THE MILITANTS' TARIFF.</head>
<head rend="right">Etna Lodge, W.</head>
<p>Mrs. Bangham Smasher, having entered into partnership with the
Misses Burnham Blazer, as General Agents of Destruction, begs to
inform the public that the firm will be prepared to execute
commissions of all kinds, at the shortest notice, on the very
moderate terms given below : —
<cell>For breaking windows, per window ...</cell>
<cell>For howling, kicking, or biting during service
in church, per howl, kick, or bite ...</cell>
<cell>For sitting on doorsteps of obnoxious persons,
per hour, if fine ...</cell>
<cell>For sitting on doorsteps of obnoxious persons,
per hour, if wet ...</cell>
13. Global attributes
Some features (potentially) apply to everything:
TEI provides global attributes for these:
- @xml:id provides a unique identifier for any element;
- @n provides a name or number for any element
- @xml:lang specifies the language of any element, using
an ISO standard code
- @rend and @rendition provide ways of specifying
the visual appearance (rendition) of any element
14. For example...
Egypt in Venice
(on Page 5
) begins with two headings, one in French....
<div type="prose" xml:lang="en" xml:id="I1914-07-01_05_02">
<head>Egypt in Venice.</head>
<head xml:lang="fr" rend="it">"La Légende de Joseph."</head>
<p>Those who know the kind of attractions that the
Russian ballet offers in so many of its themes ....</p>
Each stanza of the poem on page 10
last line which is significantly indented:
<l>There were eight pretty walkers who went up a hill;</l>
<l>They were Jessamine, Joseph and Japhet and Jill,</l>
<l>And Allie and Sally and Tumbledown Bill,</l>
<l rend="indent">And Farnaby Fullerton Rigby.</l>
15. Macrostructure 1
All the issues of Punch
for one year make up a
. We could regard the volume as a single
, and each issue as a <div>
within it. Or we could
use the <group>
16. Macrostructure 2
As well as the texts, we have detailed metadata about each volume,
and images of its pages. These are the three parts of a canonical TEI
17. Macrostructure 3
If many such documents are grouped together to form a corpus
(rather than a collection), it may be useful to factor out the
metadata they have in common:
18. What kinds of metadata?
For the Punch Project
and for any other comparable project, we
will need a place for such information as
- identification of the resource itself ("what is this thing?")
- statements of responsibility ("who did what when?")
- indication of source ("what was this derived from?")
- publication statement ("how is this item distributed and by whom?")
- declaration of encoding practice ("what do the codes we added
The TEI Header supports all these, and more.
19. The TEI Header
The TEI header was designed with two goals in mind
- needs of bibliographers and librarians trying to
document ‘electronic books’
- needs of text analysts trying to document ‘coding
practices’ within digital resources
On the one hand, the Librarian's header
- uses standard bibliographic concepts
- respects established mappings to other such records
- has a preference for structured data over loose prose
On the other, Everyman's header
- Supports a (potentially) huge range of very miscellaneous
information, organized in fairly ad hoc ways
- Unpredictable combinations of narrowly encoded documentation systems
and loose prose descriptions
20. TEI Header Structure
The TEI header has four main components:
<fileDesc> (file description) contains a full bibliographic description of an
<encodingDesc> (encoding description) documents the relationship between an
electronic text and the source or sources from which it was derived.
<revisionDesc> (revision description) summarizes the revision history for a
<profileDesc> (text-profile description) provides a detailed description of
non-bibliographic aspects of a text, specifically the languages and sublanguages used,
the situation in which it was produced, the participants and their setting. (just
about everything not covered in the other header elements
Only <fileDesc> is required; the others are optional.
21. Simple TEI Header for Punch Project
<title>Punch, or the London Charivari, Vol. 147, July 1, 1914</title>
<p>This text is freely available for re-use
under US and UK law, consult your local
legal restrictions if elsewhere.</p>
<p>This text is a TEI version of a Project Gutenberg
text originally located at <ptr
As per their license agreement we have removed all
references to the PG trademark.</p>
22. Below the paragraph...
Within the elements already introduced, TEI offers plenty of scope
for mark-up of smaller components. For example:
- boundaries, such as page, column, or line breaks
- highlighting, emphasis and quotation
- editorial changes such as correction, normalization etc.
- names, numbers, dates, addresses...
- links and cross-references
- notes, annotation, indexing
- bibliographic citations
- words and other analyses
we mean any combination of typographic features
(font, size, hue, etc.) which distinguishes the highlighted text
from its surroundings. This may be for many reasons...
- to mark foreign, archaic, technical usages
- for emphasis when spoken
- to show something is not part of the text..
(e.g. cross references, titles, headings)
- or is attributed to some other agency inside or outside the text (e.g.
direct speech, quotation)
TEI provides both a generic <hi> tag and a
large number of specific ones...
24. A few highlighting examples
<hi> (highlighted: reason unknown or unimportant)
<p>[The rest of this communication is
omitted owing to considerations of
<said>'E won't bite yer <emph>if you buy 'im</emph> guv'ner.</said>
- <title> and <foreign>:
<foreign xml:lang="fr">À propos</foreign> of Oxford, it is a
question whether that extremely amusing book
<title>Verdant Green</title> is still much read by freshers.
<distinct> (linguistically marked)
But then I remind myself
that the Russian ballet is nothing if not
Quotation marks can similarly be used to set off text for many reasons:
- <q> (used if the reason is unknown or
- <said> (speech or thought)
- <quote> (attributed to an external source)
- <mentioned> and <soCalled> (nuances of
<said who="#Celia">I know a lovely tin of potted
grouse,</said> said Celia, and she went off to cut some sandwiches.
<head>How to utilise the art of <soCalled>suggestion</soCalled>
<head>The Doctor, six down at the turn,
<soCalled>suggests</soCalled> to his opponent that
they are playing croquet, and wins by two and one.</head>
26. Quotation (continued)
Note that these elements can nest within one another:
<p>The poet returned to his work. <said>
tooth and claw,</quote>
</said> he muttered to himself,
<quote>In tooth and claw.</quote>
27. Editorial intervention
As a simple example, consider:
‘Excuse me sir, but would you like to buy a nice little dawg?’ on page 6.
- use <orig> to show that "dawg" is what it says, even
though this is a nonstandard spelling
- use <reg> to show that "dog" is an editorially-supplied
regularisation of what it says
- or provide both within a <choice> element to say either is
a valid encoding:
...a nice little
28. Names of persons, places, things...
- <name> (a name in the text, contains a proper noun or noun
- <rs> (a general-purpose name or referencing string )
- <title> (any form of title)
The @type attribute is useful for categorizing these, and they both also have @key,
@ref, and @nymRef attributes.
29. Examples of names
to distinguish personal from geographic names:
<p>The scene opens at a party given by
<name type="person">Potiphar</name> in
<name type="place">Venice</name>. </p>
to de-reference names:
<label>Business done.</label>—The Commons
still harping on the Budget.
Tim Healy</name> enlivened proceedings by vigorous personal attack
on <q>the most reckless and incapable
<rs key="LLG">Chancellor of the Exchequer</rs>
that ever sat on the Treasury Bench.</q>
<name key="LLG">Lloyd George's</name>
retort courteous looked forward to with interest.
- <date> contains a date and time in any format
- For processing it is convenient to add a normalized
version, using the @when attribute
- Uncertain dates and times, and ranges, can be indicated by
other attributes: @notBefore, @notAfter,
<p>House of Commons, <date when="1914-06-22">Monday, June 22, 1914</date>.</p>
<date notAfter="1914-06-01" notBefore="1914-03-01">Sunday, a month ago,</date> was hot.
31. Cross references
A cross reference is a link from one point in a text
(the source) to
another (the target).
TEI provides generic elements <ptr> and <ref> for
this purpose. If the linking text can be automatically generated use
<ptr>; otherwise use
The source is the location of the <ptr> or <ref>; the
target is specified by the @target attribute, in the form of
a URI reference.
See <ref target="#Section12">section 12 on page 34</ref>.
See <ptr target="#Section12"/>.
32. Bibliographic Citations
TEI provides special elements for bibliographic citations or references:
- <bibl> (loosely structured)
- <biblStruct> (standard bibliographic structure)
- <listBibl> (encloses a bibliography)
These are typically used in preparing bibliographies, or in
footnotes. But even in Punch, there are examples.
33. Simple <bibl> Example
In Punch, bibliographic citations are usually associated with a
a quotation from another
element groups the two:
<quote>It was the time when Henry III. was
batting with Simon de Montfort and his
34. Embedded notes
Notes, whether appearing in the original source, or added by an
editor, can be marked using the <note> element.
We might use this to add biographical details to the Punch
<p>By-the-by, it is denied that
Sir <name rend="sc">Joseph Beecham</name>
<note>Sir Joseph Beecham, 1st Baronet
(8 June 1848 - 23 October 1916)...</note>.
was in any way responsible for the Government's
"Pills for Earthquakes," by which it was hoped to
avert the Irish crisis.</p>
<note> has attributes @place and
35. Linked notes
Since we have several references to the same person, it might be
better to put the notes elsewhere and point to them from the names:
<note xml:id="BEECHJO">Sir Joseph Beecham, 1st Baronet (8 June 1848 -
23 October 1916) the eldest son of Thomas Beecham (1820-1907) played a
large part in the growth and expansion of his father's medicinal pill
business which he joined in 1866....</note>
<p>... Both Earl <name rend="sc">Beauchamp</name>
and <name>Sir <ref target="#BEECHJO">Joseph Beecham</ref>
in the recent Honours List.</p>
<p>By-the-by, it is denied that Sir <name rend="sc" ref="#BEECHJO">Joseph
Beecham</name> was in any way responsible...</p>
Could also use specialised <person> element, in this
"Elsewhere" can be anywhere on the Internet...