Indexing with Xaira

1. Introduction

The Xaira system has the following components:
  • an indexer, which creates index files and lexica derived from a large collection of XML documents
  • a server, which handles queries against the collection, using the index files and lexica
  • one or more client programs, enabling access to the server from different platforms or software environments.

This document describes in some detail the operation of the first of these systems only, and in particular how the information needed by the indexer is managed and passed to it.

The indexer may be run from the command line, under Windows or Unix, or it may be invoked using the Indextools Windows utility described below (6. The indextools utility). Its behaviour is determined by information read from a number of input files, chiefly a corpus header file in which the following information is provided:
  • names and descriptions for all the XML elements and attributes marked up in a corpus
  • how index entries are to be constructed from parsed element content and attribute values
  • how these indexed substrings are to be referenced when displayed to the user

The indexer makes as much use as possible of the tagging found in a corpus, but does not require the use of any particular DTD. For this reason, part of the input to the indexer must be a description of the tagging actually used in the corpus to be indexed, together with information about how specific elements are to be treated during the indexing process.

The indexer expects to process:
  • one or more valid or well-formed XML files. There is no requirement for a DTD, but if one is invoked, the file must be valid against it.
  • a single TEI conformant corpus header containing a Xaira Specification, as documented in 4. The Xaira Specification of this document. This must be a distinct file, unlike the documents making up the corpus, which may be spread across several different files, or contained in a single one.

The indexer expects to process only XML, using an appropriate character encoding. This may be UTF-16 or UTF-8, indicated by the appropriate byte order mark; or some other permitted encoding, as indicated by means of an XML declaration. Different files may use different encodings, provided that each has an appropriate declaration. A pre-processor is available in the Indextools utility which can be used to convert character encoding or file format as necessary.

A note on terminology may be helpful. We use the term file to refer to anything stored and managed by an operating system. It has a name (Xaira calls this its #sysid) and may contain one or more parts of a corpus. A corpus contains a corpus header, which must be contained in a single file, and a corpus body, which consists of one or more logical units called documents. Each document is a well-formed XML structure, possibly including a DTD: it may be stored as a single file, or it may be spread across several files. In XML terms, each of these files is a distinct entity, and they are combined to form a document by standard XML mechanisms such as XML Xinclude or external entity reference. Xaira regards a corpus as being composed of one or more corpus texts, each of which has a distinct bibliographic reference or identifier.In the simplest case, each document contains a single corpus text; it is however possible to define multiple texts within a single document. In TEI terms, a document is normally equivalent to a single <TEI.2> element (comprising a TEI Header and a <text> element). A TEI <text> element may however contain multiple corpus texts.

The indexer works by first tokenizing the input stream, that is, by recognising as word forms individual character strings in the text files (2. Tokenization). These word forms may be associated with one or more additional keys which use additional information to group related word forms, or distinguish amongst homonyms (3. Additional keys and lemmatization). Each occurrence of the resulting key is associated with its location in the corpus, and assigned a reference (4.4. Referencing occurrences) which the client can display when it is retrieved. Individual texts can be classified according to different descriptive taxonomies for use by Xaira's partition mechanism (4.6.2. Taxonomy definition, and codebooks may be defined to provide more comprehensible labels for any analytic codes used in the corpus (4.6.1. Codebook definition). The natural languages and the subset of Unicode characters deployed in the corpus may also be defined (4.8. Language and character set issues).

In what follows, we discuss each of these topics in general terms, and then state how the various options are specified in the TEI-conformant corpus header.

Up: Contents Next: 2. Tokenization

Sections in this document: