Indexing with Xaira
1. Introduction
This document describes in some detail the operation of the first of these systems only, and in particular how the information needed by the indexer is managed and passed to it.
The indexer makes as much use as possible of the tagging found in a corpus, but does not require the use of any particular DTD. For this reason, part of the input to the indexer must be a description of the tagging actually used in the corpus to be indexed, together with information about how specific elements are to be treated during the indexing process.
- one or more valid or well-formed XML files. There is no requirement for a DTD, but if one is invoked, the file must be valid against it.
- a single TEI conformant corpus header containing a Xaira Specification, as documented in 4. The Xaira Specification of this document. This must be a distinct file, unlike the documents making up the corpus, which may be spread across several different files, or contained in a single one.
The indexer expects to process only XML, using an appropriate character encoding. This may be UTF-16 or UTF-8, indicated by the appropriate byte order mark; or some other permitted encoding, as indicated by means of an XML declaration. Different files may use different encodings, provided that each has an appropriate declaration. A pre-processor is available in the Indextools utility which can be used to convert character encoding or file format as necessary.
A note on terminology may be helpful. We use the term
file to refer to anything stored and managed by an
operating system. It has a name (Xaira calls this its
#sysid) and may contain one or more parts of a
corpus. A corpus contains a corpus header, which
must be contained in a single file, and a corpus body, which consists of one
or more logical units called documents. Each document is
a well-formed XML structure, possibly including a DTD: it may be
stored as a single file, or it may be spread across several files. In
XML terms, each of these files is a distinct entity, and they are
combined to form a document by standard XML mechanisms such as XML
Xinclude or external entity reference. Xaira regards a corpus as being
composed of one or more corpus texts, each of which has a
distinct bibliographic reference or identifier.In the simplest case,
each document contains a single corpus text; it is however possible to
define multiple texts within a single document. In TEI terms, a
document is normally equivalent to a single <TEI.2> element
(comprising a TEI Header and a <text> element). A TEI
<text> element may however contain multiple corpus texts.
The indexer works by first tokenizing the input stream, that is, by recognising as word forms individual character strings in the text files (2. Tokenization). These word forms may be associated with one or more additional keys which use additional information to group related word forms, or distinguish amongst homonyms (3. Additional keys and lemmatization). Each occurrence of the resulting key is associated with its location in the corpus, and assigned a reference (4.4. Referencing occurrences) which the client can display when it is retrieved. Individual texts can be classified according to different descriptive taxonomies for use by Xaira's partition mechanism (4.6.2. Taxonomy definition, and codebooks may be defined to provide more comprehensible labels for any analytic codes used in the corpus (4.6.1. Codebook definition). The natural languages and the subset of Unicode characters deployed in the corpus may also be defined (4.8. Language and character set issues).
In what follows, we discuss each of these topics in general terms, and then state how the various options are specified in the TEI-conformant corpus header.
Up: Contents Next: 2. Tokenization
Sections in this document:
