ELRA Work Package 3: first draft

Validation of Linguistic Corpora
Tony McEnery, Lou Burnard, Andrew Wilson & Paul Baker

1. Introduction

This document forms the chief deliverable for Work Package 3 of the ELRA contract for validation of language corpora. It discusses the theoretical basis underlying our approach to the formal validation of language corpora, and makes some recommendations about relevant techniques and practices which may be of assistance in performing such evaluations, and documenting their results. Particular attention is paid to the specific case of morpho-syntactically annotated corpora.

1.1. Tagsets and Annotation

Some confusion exists about the terminology associated with linguistically annotated corpora. This is partly because the term tagset is used differently by two different communities. For the traditional corpus linguist, a tagset is the set of possible values used to explicitly annotate a text with a linguistic analysis; for example, the CLAWS tagset comprises a set of values such as NN1, VVD etc., each of which has a specific significance (singular common noun, verb past tense, etc.) For the mark-up specialist however, the term tagset refers to any kind of annotation, in particular the collection of SGML tags corresponding with the elements defined in a particular DTD: for example, the TEI defines a number of tagsets, each containing definitions for specific SGML elements and attributes.

Both usages reflect the fact that all markup introduced into a text is identical, at some level of analysis, in the sense that it serves to record or assert an association between stretches of text and values taken from some externally defined set of interpretations. However most people seem to categorize an analysis such as ‘this is a paragraph’ differently from the formally equivalent judgement ‘this is a noun’. The former judgement is said to be ‘structural’ and the latter ‘interpretative’. This kind of categorization also underlies the notion of ‘level’ of annotation as exemplified by (inter alia) the Corpus Encoding Specification (Ide 1998), where the distinction is further justified by the observation that the addition of so-called ‘structural’ markup is generally easier to automate than that of ‘interpretive’ markup, since the latter (almost) invariably requires human judgement and knowledge, while the former rarely does. Particularly in the case of textual markup, interpretative judgements tend to be more controversial than structural ones, if only because the latter relate to aspects of a text which are accepted as intrinsic to its substance by the community of text readers. Structural interpretations form part of the ‘contracts of literacy’ (Snow and Ninio, 1986) which form the precondition of a text's recognition as meaningful by the members of a particular community of readers.

For purposes of validation, however, the distinction seems unhelpful. All markup introduced into an corpus should be validated in the same way, and the validity of the corpus overall is equally affected by each type of markup used. Nevertheless, we have subdivided our discussion into two parts, reflecting the division currently made by most practitioners between structural and interpretative markup, and which are consequently reflected in actual practices. Structural markup is most generally to be validated with reference to an abstract model of textual components and features which is either entirely intuitive and ‘common sense’ based, or defined in terms of some consensus-based model such as that of the TEI, restated as an SGML DTD. Interpretative markup may be similarly theory-free (see, for example, Leech 1993, but it is more customary to define it with reference to some explicitly stated analytic model, and hence to facilitate both automatic validation of the corpus itself (to check that it is valid in its own terms) and comparison of two corpora using different markup schemes derived from a common abstract model.

In section 2. Validation of Structural Analyses we discuss the process by which the structural markup defined for a given corpus may be validated. The formal mechanism used for this purpose is an SGML document type definition. In section 3. Validation of Morphosyntactic Analyses we discuss in more detail one particular kind of interpretative markup: that which seeks to make explicit morpho-syntactic analysis of a text. We present here an SGML scheme for the formal expression of an abstract model that may be used to validate such analyses both internally and externally. Finally, in section 4. Representation of Validation we suggest some ways in which the result of either validation exercise may be formally documented. We begin, however, by describing the model of formal validation which underlies both descriptions. (For a more detailed discussion of the principles adumbrated here, see Sperberg-McQueen and Burnard 1995).

1.2. Principles of Textual Markup

We begin by positing the existence of textual features or abstractions, instances of which are predicated at various positions within a document. The function of markup is to indicate unambiguously the presence of instances of such features. For example, a document may contain instances of the feature ‘segment’, whose presence might be signalled by such markup conventions as:
  • the start of a new input record;
  • the presence of some distinguishing code or sign such as a star, not otherwise present in the text;
  • the presence of some predefined symbol such as the tag <s>

As noted above, the presence and scope of a feature such as ‘singular noun’ may be predicated in exactly the same way.

We further assume that it is possible to define a grammar for such markup symbols: that is, a grammar which defines which combinations of such symbols in a document are to be regarded as legal. Such grammars generally have regard only to the markup language itself, rather than its extension to the underlying feature set. A markup grammar may simply enumerate all legal markup tokens, or simply specify an algorithm for the identification of markup tokens with no consideration of which markup tokens might be permitted. A more complex grammar (such as SGML) may also be used, enabling the formulation of contextual rules such as ‘the tag X is only legal within the scope of the component identified by tag Y’ in addition to these kinds of rules. Note however that legality is still defined here in terms of syntax: only informal legislation can determine whether the content of an SGML element is ‘correct’ with reference to some semantic model. Publications such as the TEI Guidelines typically extend the syntactic definitions embodied in their DTDs by more or less detailed discussion of the intended semantics of elements, but rarely provide a formally verifiable abstract model of such semantics, nor is it entirely clear what such a model might resemble. Nevertheless, throughout our discussion we will use the term feature (and derivatives) to refer to components of such a model, and the term tag (and derivatives) to refer to components of the markup system used to assert their existence.

This distinction seems to us crucial to the feasibility of validation: ‘A corpus is a collection of utterances, and therefore a sample of actual linguistic behaviour. However, even if we do not believe that the distinction between competence and performance is valid, a corpus is not itself the behaviour, but a record of this behaviour’ (Stubbs, 1996). The function of the markup in the corpus is to make explicit, and hence accessible to comparative study, the recording process for both structural and interpretative encoding in a corpus text. Without this, neither comparative studies of different corpora, nor any assessment of the validity of the corpus ‘record’ with respect to what it ‘records’ will be possible.

We define the process of validation as follows:
  1. for each feature of interest, does the document contain any tagging?
  2. is the tagging of the document syntactically correct?
  3. is the tagging of the document consistently applied (i.e. is every occurrence of a given feature tagged in the same way)?
  4. is the tagging of a document correctly applied, with reference to some externally (or internally) defined abstract model?
  5. if correct, is the tagging of a document complete, with reference to some externally (or internally) defined list of mandatory features?

Taking these in reverse order, it is clear that, in the general case, the last two of these stages are automatable only to the extent that an abstract model can be formally specified for both the feature system itself and for the intended correspondence between that and the tagging employed. We present in section 5.1. A Feature System Declaration for the EAGLES morphosyntactic Guidelines below one such abstract model, the EAGLES Guidelines for morpho-syntactic annotation (Leech and Wilson, 1994), re-expressed as a TEI-conformant feature system, against which any other set of morpho-syntactic annotations using the same representation may be validated, without necessarily having to conform to the EAGLES model. We also discuss the somewhat simpler abstract model proposed by EAGLES itself in section 3.2. Understanding the Markup below.

Equally clearly, however, neither the third nor the first of the stages above can in principle be automated, since both depend on a human judgement to the effect that such and such a feature is in fact present, whether or not it is signalled by the tagging in a text. Such text-comprehension abilities still seem to be somewhat beyond the state of the art in NLP, despite some advances.

The second of the three stages above is however automatable, to the extent that the tagging syntax of the document is fully specified. In an SGML context, this implies the existence of a DTD against which candidate documents can be verified using an SGML parser. For other forms of markup, validation may involve other forms of verification, some of which may be intimately tied in to the behaviour of particular application software. For example, a document marked up in RTF or LaTeX may be considered valid so long as Microsoft Word or LaTeX does not reject it, irrespective of its output. Technical documentation will often specify what markup should be found in a document: where the markup syntax is arbitrary or application specific, clearly special purpose software must be developed to validate it.

2. Validation of Structural Analyses

2.1. Corpus Composition

Language corpora are made by combining together whole texts or extracts from pre-existing documents, usually according to some specific design criteria. The structure of the corpus itself may thus be described (and hence marked up) at two levels: internal, relating to the way the parts of the corpus fit together, and external, relating to compositional features of the originals. This distinction holds good whether the corpus under consideration is a fixed document or a dynamic or ‘monitor’ corpus; in the latter case, as well as generally requiring dictate the use of whole texts rather than extracts, the internal design criteria may be further extended to include such topics as the rate at which new documents enter the corpus, the criteria for determining that they should be discarded from it, etc.

The internal structural features of a corpus are largely self-evident, and require little validation: common practice requires only the clear delimitation of individual text fragments, and to associate with each an appropriate level of description or metadata. In the TEI model, the former constitutes the text proper, and the latter its header. In older corpora, it was common practice to provide such metadata (if at all) as a separate documentary component, with only an informal association between the two, often depending on such artifices as file-naming conventions or sequencing to identify descriptive features of each component. The TEI model uses the power of SGML (in particular, its hierarchic structure and the consequent ability to specify property inheritance) to build more sophisticated structures. (For an account of some of these, see the discussion in e.g. Chapter 23 of the TEI Guidelines.)

The scope of the external features to be found marked up in language corpora varies greatly, depending both on the diverse nature of the materials they include and the diversity of applications envisaged for them. In large corpora, economic considerations alone preclude any attempt at modelling in the markup the full diversity of structures which a detailed textual feature analysis might indicate as possible: in the earliest corpora, for example the Lancaster/Oslo/Bergen corpus, even such basic organizational features as paragraphs or subheadings are rarely distinguished as such. Even today, the corpus designer is always forced to make pragmatic decisions about which structural features will have sufficient usefulness in the intended applications to warrant the expense of identifying them consistently and correctly. For many purposes, division into discrete segments, corresponding with identifiable locations in the original source, is adequate. For other purposes (for example, the study of discourse-related phenomena or text-grammar) a richer approach will be desirable.

Standards such as the CES provide a rich set of feature descriptions from which the corpus builder can select, together with specific tagging rules about how the presence of selected features can be made explicit. There is, however, considerable (and understandable) reluctance to make recommendations about which particular selections are appropriate or mandatory, since this will inevitably depend on the intended application for the corpus.

To validate such corpora therefore, a necessary first step is to identify the intentions of the designer. A corpus which does not mark up paragraph divisions is not necessarily less valid or useful than one which does; a corpus which claims to mark such divisions but which does so inconsistently or inaccurately is. Unfortunately, as WP2 demonstrates, it is often hard for corpus builders to specify their intentions in this respect, and harder for the validator to determine the extent to which these intentions have been carried out. Documentation and the provision of a DTD go some way to simplifying the task, as further discussed below.

2.2. Syntactic Consistency

As noted above, the extent to which the syntactic consistency of the structural markup in a corpus can be validated depends on the extent to which that markup uses a formally verifiable syntax. The great merit of SGML as a markup language is precisely that it makes this automatic verification simply a matter of defining an appropriate grammar (a document type definition) and checking the corpus against it. The most widely used software for this purpose is currently the freely available SGML parser SP, particularly its DOS incarnation NSGMLS [SP]. With the growing take up of SGML and of its simplified version XML, the number and sophistication of such systems is likely to increase greatly.

SP and similar programs typically perform a number of other functions on a document, but for validation purposes, the key functions may be summarized as follows:
  • are the tags present in the corpus all defined in its DTD?
  • are the tags in the corpus all present in syntactically correct contexts?
  • do all attributes specified for the tags in the corpus conform to the value ranges specified for them in the DTD?
  • are any cross references specified by the SGML markup satisfied?

The output from an SGML parser is thus typically either simply confirmation that the document does in fact conform to the DTD, or a list of instances where it does not conform. At the risk of stating the obvious, it should be emphasized that a corpus which does not conform to its DTD, or which lacks a DTD, cannot be validated, no matter how closely its markup appears to be modelled on that of the SGML standard. The notion ‘SGML-like’ or ‘unvalidated SGML’ is not a helpful one in this context.

For corpora which do not use SGML markup, validation will require the provision of some DTD-like set of formal rules, and the production of some parser-like software to check them against the corpus itself. Such procedures are eminently feasible, and for simple markup schemes may be considered preferable to the expense of converting the markup to true SGML. For a variety of reasons not necessary to summarize, we do not recommend this approach: in the long run, the use of a widely accepted standardized markup language should always be less expensive than the maintenance of an idiosyncratic or application-limited scheme.

2.3. Structural Correctness

The list of questions to which an SGML parser will provide answers given in the previous section falls some way short of what we would like to know before deciding that a given corpus is suitable for our purposes in the general case. In particular, a parser cannot tell us
  • whether every item tagged as an instance of some feature is in fact such an instance
  • whether every instance of some feature is in fact tagged as such

To a large extent, however, these are limitations inherent in the whole markup enterprise; they also touch on fundamental problems of naming and ontology which have exercised philosophers since the time of Aristotle, and for which it would be unreasonable to expect immediate answers. Nevertheless, it is possible to make some pragmatic observations, additional to those provided in section 3.3.1. Semantic Correctness below concerning the semantic validation of analytic tagging.

Although not formally presented as such, pre-defined feature lists such as those provided by the TEI and CES may be regarded as constituting a kind of abstract model for the structural components of texts. They thus provide a useful reference point against which the validator may check both that the objects tagged as representing some feature appear to conform with the definitions supplied there, and conversely that no features conformant with those definitions are present but untagged or tagged inappropriately. This remains however an entirely manual process.

Few corpora are small enough to permit the luxury of a close reading, and so in the general vase this kind of manual validation can only be done by sampling. Typical procedures are thus to inspect some random sample of the corpus for the presence of specific tagged features, for example, the paragraph boundaries or headings. Provided that the location of these samples within the original documents is known, an attempt can then be made to assess the accuracy with which the tagging of structural features has been carried out across the corpus with respect to the original source. In the absence of an original source, such accuracy can be assessed only in statistical terms, for example by comparing the distribution of certain tagged features in the sample with their distribution across the whole, where a ‘correct’ distribution can be hypothesized on the basis of a priori reasoning (e.g. the number of paragraphs per text of a given type should be reasonably stable) or by applying other statistically derived heuristics.

3. Validation of Morphosyntactic Analyses

In this section we discuss the possibilities for automatic or semi-automatic validation of one particular form of interpretative markup: that which seeks to mark up the result of a morphosyntactic analysis.

3.1. Presence

Whatever form of markup is employed, morphosyntactic tagging is usually supplied at the level of individual tokens in a text and is thus usually self-evident. In the absence of any documentation, it is likely to be a generally a simple matter to extract from a document all the unique tokens constituting the markup, and also to identify the lexemes to which they are attached, as was done, for example, by Garside and McEnery 1993. In this example, annotations were separated from words by underscore characters. Other schemes place the markup and lexeme in separate ‘fields’, or on alternate lines within the text proper. In SGML documents, annotations may be represented as attribute values, or as distinct elements, and the association between lexical item and annotation may be made by means of pointer or link.

3.2. Understanding the Markup

It will be rather less easy (in the absence of documentation) to determine what feature or combination of features each markup token is intended to represent. The list of all markup tokens, together with an index of their occurrences, and the associated lexical item, might be collated with an annotated corpus in which the same lexical items are associated with annotations whose feature equivalences are known, thus providing a kind of latter-day Rosetta Stone for the purpose. Such a process is hardly likely to be easily automated. This is one good reason for insisting on the availability of such documentation, preferably in a form which can be readily mapped to agreed standards.

Such mapping requires the predefinition of an agreed set of morphosyntactic features, independent of markup. Such a set is provided in the context of several western European languages (such as Danish, English, French, German, Greek and Spanish) by the EAGLES morphosyntactic annotation guidelines (Leech and Wilson, 1994), which we have therefore adopted as a test case for our recommendations. The procedures described here and the conclusions we reach would be equally applicable to any other set of Guidelines. However, as the EAGLES guidelines have been published on the basis of a wide-ranging review of corpus builders, recommendations derived from it are likely both to reflect, and potentially have a wide impact on, current practice.

The EAGLES recommendations have a dual focus: as well as providing an abstract model of the feature sets against which any particular combination of the features tagged in some corpus may be validated, the Recommendations specify explicitly a subset of ‘recommended’ features which it is assumed should always be marked. Validation at this level thus becomes a matter of simply checking that the recommended features are in fact present [mdash ] in the terms we introduced in section 1.2. Principles of Textual Markup above, validation that the tagging is not only syntactically correct, but also complete.

3.2.1. Representation of Features in EAGLES

EAGLES provides a ‘intermediate representation ’ for the encoding of feature sets. This operates as follows:
  • a one- or two-letter code is used for some ‘obligatory features’ (the basic parts of speech) [mdash ] for example, AJ indicates the feature ‘Adjective’, N indicates the feature ‘Noun’, and so on;
  • each recommended feature that is assigned to an obligatory feature occupies one place in the representation [mdash ] thus, if, as for the obligatory feature ‘Noun’, there are four associated recommended features, then there will be a four-place representation. ‘Recommended’ features are not mandatory, but come with a strong suggestion that any system of morphosyntactic annotation for the languages covered by EAGLES should include them;
  • in each place or ‘slot’ in the representation a number is inserted according to the value represented. For instance, the first slot in the representation for ‘Noun’ is assigned to the recommended feature ‘Type’: this has two possible values [mdash ] ‘Common’ (represented by 1) and ‘Proper’ (represented by 2). So the representation for a proper noun would begin N2 and that for common noun N1. If a recommended feature is not represented for whatever reason, a 0 is placed in the appropriate slot instead of an actual feature value.

Here are some examples of complete intermediate representations for nouns:

common noun, singular; gender and case not represented
common noun, singular, genitive; gender not represented
proper noun, feminine, singular, accusative

This representation provides a convenient means of facilitating validation against a standard list of features. By comparing intermediate representations from the corpus with the representation of the master list of features, it may easily be ascertained what features and values are or are not represented. Even where the intermediate representation is not used, a mapping list can still be produced showing for each corpus tag the EAGLES feature which it encodes. This latter kind of list is also essential for non-EAGLES-conformant corpora and, on a smaller scale, for any additional optional features used within the EAGLES remit. In section 5.2. Sample Mapping Lists for the EAGLES Obligatory Features we present examples of mapping lists for a non-EAGLES-conformant tagset (in this case, Lancaster University's Claws C7 tagset as used in the part-of-speech annotation of the British National Corpus).

Two problems arise however when attempting such mappings. The tagset under consideration may under-specify with relation to the EAGLES master list, that is, some annotation may map onto more than one feature combination. For example, the CLAWS 7 tagset uses the tag VV0 to denote any non third person singular form of a regular present tense verb, thus blurring the distinction between the imperative, first person singular, second person singular and first, second or third person plural.

The opposite situation [mdash ] where the tagset over-specifies is also possible, particularly where the bondary between morphosyntax and semantics is blurred, where the tagset makes distinctions between sets of features regarded as equivalent by EAGLES. For example, CLAWS includes a ‘Noun of Style’ tag (NNB) to mark English honorifics such as ‘Mr’, ‘Dame’, ‘Professor’ etc. for which no equivalent feature is identified by EAGLES, and which therefore cannot be distinguished from other parts of proper names.

It should be noted that EAGLES does allow for arbitrary extensions to cover language-specific features. However, to stay with the previous example, honorifics are to be found in most European languages, and hence to treat them as language-specific is not appropriate. Extensibility of the basic features and their sub-categorizations will clearly be essential to any general purpose representation scheme for feature systems, and some such systems may require something more complex than a simple two-level categorization of this kind. EAGLES, itself the product of a consensus amongst corpus analysts at a particular point in time, was designed with the changing needs and practices of that community in mind. It is anticipated that revisions to both the list of recommended features and the sets of features they summarize will occur steadily, particularly as the field of application extends beyond the relatively frequently studied Western European languages.

In the general case, what is needed is a representation scheme which maximizes the flexibility of the annotation scheme without compromising the need to validate instances of its use. We discuss such a scheme in the next section.

3.2.2. Representation using Feature Structures

A more powerful and discriminating representation is provided by the TEI tagset for feature structure analysis. This has two parts, a set of tags for the direct representation of feature structures, which can be linked to instances of textual objects so analysed, and a set of tags for documenting the feature system itself, that is, the constraints, allowable feature-value pairs etc. which are to be regarded as valid in a given analysis.

The feature system representation is defined in chapter 26 of the TEI Guidelines; Langendoen and Simons 1995 provides a useful introduction. A feature, in this scheme, is defined as a pair, comprising a name and a value. The latter may be one of a defined set of value types, including Boolean (plus or minus), numeric, string (an unclosed set of values), symbol (one of a defined set), a feature structure, or a reference to one. A feature structure is a named combination of such features, ordered or unordered.

For example, in an analysis of nouns, we might identify the features number and proper, with values singular or plural, and plus or minus respectively. (The decision as to the appropriate domain for a value is inevitably arbitrary: we have here chosen to regard number as being a symbolic value to allow for the possibility of additional values such as dual or uncountable). These features may be combined to form feature structures corresponding to part-of-speech annotations such as NP1 or NP2 as follows:
<fs id=NP1 name="> <f name=class><sym value=noun> <f name=number><sym value=singular></f> <f name=proper><plus></f></fs> <fs id=NP2> <f name=class><sym value=noun> <f name=number><sym value=plural></f> <f name=proper><plus></f></fs>
To reduce the redundancy of this representation, one may specify the individual features making up a given feature structure by reference. This requires that the features to be used are first specified independently of the structures in which they are to be combined, using a construct known as a feature library, represented by a <fLib> element, each one being given a unique identifier, as follows:
<flib> <f name=class id=FCN><sym value=noun> <f name=number id=FN1><sym value=singular></f> <f name=number id=FN2><sym value=plural></f> <f name=proper id=FPP><plus></f> <f name=proper id=FPM><minus></f> </fLib>
Each of the feature structures attested can now be represented by reference to these underlying primitives, using the feats attribute, as follows:
<fs id=NN1 feats="FCN FPM FN1"> <fs id=NN2 feats="FCN FPM FN2"> <fs id=NP1 feats="FCN FPP FN1"> <fs id=NN1 feats="FCN FPP FN2">

It should be apparent how this approach permits an SGML aware processor to identify automatically linguistic analyses where features such as number or properness are marked, independently of the actual category code (the NN1 or NP2) used to mark the analysis. In addition, of course, the use of the SGML ID/IDREF mechanism allows for simple validation of the codes used. For more sophisticated validation, for example to ensure that the feature properness cannot be both plus and minus in the same analysis, the TEI specifies an additional declarative mechanism, known as a feature system declaration (FSD).

Full details of the FSD are provided in chapter 26 of the TEI Guidelines; its relevance for our present purposes is that it provide a mechanism, intermediate in constraining power between a full document type definition (which requires that all possible annotations or tags be specified in advance) and the kind of limited validation possible with the EAGLES mapping list. A fully elaborated feature system declaration for the EAGLES morphosyntactic classification scheme is presented in section 5.1. A Feature System Declaration for the EAGLES morphosyntactic Guidelines below. This more general solution makes possible a form of internal validation, whereby the contents of the corpus are validated against feature lists produced specifically for that corpus, or where the feature list used is a super- or sub- set of the EAGLES feature list, without losing the ability to validate that part of the feature set which does coincide with EAGLES' recommendations.

3.2.3. Documenting the Feature Set

Returning for the moment to the utility of the original EAGLES report for validation, as a first step for languages covered by the report, corpus designers would be foolish to ignore the relevance of the EAGLES obligatory and recommended features, since these now form an agreed cross-linguistic EU standard. Any internal validation should thus be regarded as secondary to an EAGLES validation. Adoption of a feature-based system for validation makes possible the application of identical validation techniques in either case.

The process of deriving a feature set from documentation is also a convenient way of checking the thoroughness and consistency of the documentation itself. Anomalies such as the presence of undocumented tags in the corpus, or the presence of unused or ‘phantom’ features in the documentation are often only found by such a process.

The former are easily handled by rectifying the documentation, but the latter are slightly more problematic. Phantom features may occur for any of three reasons:

  1. they are present for the sake of completeness but simply did not occur in the text corpus being examined;
  2. their presence is a historical accident, representing for example a change in the design of the feature analysis;
  3. they should have been applied to the corpus but were not.

Clearly, the most serious case is that of (3): here the annotation does not validate against the intended features and needs to be rectified. Such deficiency, at least at the EAGLES obligatory and recommended levels, should be immediately evident when the corpus annotation used is checked against the feature list. In the case of (2), only the documentation needs correcting. In the case of (1), the matter should simply be documented, for the information of corpus users. Phantom tags can be introduced as the result of typographic errors; the use of an automatic system for introduction of tags and their automatic validation against the agreed corpus tagset entirely does away with this form of error.

3.3. Syntactic Correctness and Consistency

The aim of this level of validation is to ensure that the form of tags is consistent. Specifically, it should check that:
  • each appropriate lexical item receives an appropriate annotation;
  • each appropriate lexical item receives a single annotation;
  • each annotation used is documented and corresponds with a known feature, i.e. there are no typographic errors;
  • the annotation is presented using a consistent and correct syntax.

We use the phrase ‘lexical item’ above to indicate that the tokens to which annotation is attached need not correspond with orthographic words. Although many commonly used annotation schemes for English do in fact attempt to make this correspondence, it is unnecessary where a single formalism such as SGML or something of equivalent power is used to represent both structure and analysis.

Thus, the CLAWS scheme uses a special form of annotation known as ‘ditto’ tags to indicate that the annotation for one token applies also to another. For example, the English conjunction ‘so that’ should properly be regarded as a single conjunction, although it is orthographically represented as two tokens. Early versions of CLAWS tagged this phrase as so_CS21 that_CS22 or, using the equivalent SGML formalism, as
<w CS21>so <w CS22>that.
The actual annotation for conjunction is CS, the following digit 2 indicates the number of tokens to which it is to be attached, and the final 1 and 2 indicate the number of this token within the sequence. A more natural approach would be to revise the tokenization rules so that the token so that might be treated as a single unit, tagging it as
<w CS2>so that.
. Uncoupling the annotation structure from the orthographic structure also enables a consistent approach to be taken for the case where the morphosyntactic units to be tagged are smaller than orthographic words.

We recommend above that a single annotation be attached to each lexical token, recognizing that in production systems it may be necessary to retain deliberately ambiguous or polyvalent annotations to avoid incorrect deterministic disambiguation. Such exceptions to the ‘one word, one tag’ rule, should be clearly documented to aid validation; ideally each possible combination of multiple annotations can be represented as a distinct choice within the feature set. The FSD notation recommended below supports this possibility.

The majority of these tasks can be achieved using a series of procedures aided by simple Unix tools such as awk and grep. Checking SGML requires an SGML parser, and a number of these are available. As part of this workpackage, we reviewed the SGML validation that had been undertaken on the corpora covered in the WP2 review. For most part, the results (summarized in section 5.3. Some current markup validation practice below) indicate that as yet only a few corpus builders are taking advantage of the availability of tools such as SGML parsers to validate formally-defined markup schemes.

This is unsurprising, given the fact that such schemes have only begun to gain wide acceptance in the last few years. However, it does seem strange that the topic of validation is rarely touched on in the extant literature concerning corpus design and construction; where it is, the topic appears to relate almost exclusively to the statistical validity of a given sample as representative of some aspect of language (see for example Clear 1992, Atkins et al 1990). Corpora such as the LOB and Brown have been so exhaustively studied and analysed that it would be surprising if such errors as they contain had not come to light; furthermore, where they have, however, corpus designers and builders seem to have been uninterested in their status or implications. A plausible reason for this is that it is only with the advent of really large corpora, often produced by automatic or semi-automatic methods of data capture such as optical character recognition or as a by-product of electronic typesetting, that questions of accuracy and authenticity have arisen.

3.3.1. Semantic Correctness

As stated above, an accurate assessment of the semantic validity of any markup in a corpus is an inherently intractable problem. Where the function of the markup is to assert the existence of a human interpretation of the data, it is probably the case that this can only be validated manually, although some control over variability may be derived by the application of some rough heuristics to assess semantic conformance to a pre-established norm. For example, if we know the statistical distribution of specific nouns, verbs etc in a general corpus like the BNC, then we may be able to check future corpora on the basis of these rough distributions. However, this is clearly a rough and ready process.

Let us turn to considering hand validation. Even where human checking occurs, a validation cannot be considered 100% accurate, since frequently there is scope for error or genuine disagreement, even within a single set of guidelines [mdash ] (see for example Baker 1997). One possibly automated check would be to see whether an assigned tag is allowed for a given word, by checking the word's entry in a lexicon. However, this only makes sense when (a) a lexicon has been used to tag the text and (b) manual correction has taken place [mdash ] otherwise we can already be sure that the tag is permissible, unless there is something very seriously wrong with the operation of the tagging program. Limitations on this method of checking are (a) the fact that often a suffix list, etc., rather than an exhaustive lexicon, is used for tag assignment and (b) the presence of new tags, i.e., permissible and correct tags added by human annotators because a new contextual reading is missing from the lexicon.

3.4. Other forms of Annotation

In addition to the strictly morphosyntactic analysis discussed so far, the EAGLES Guidelines also envisage two generic forms of syntactic analysis: phrase structure and dependency. Phrase structure grammars require the ability to model well-balanced trees in a markup language, while structural dependency grammar requires the ability to describe directed acyclic graphs.

Both abilities are intrinsic to the SGML abstract model, and the tasks of first representing, and then validating the correctness of such structures, is thus comparatively trivial. Furthermore, it is clear that the fundamental problems of semantic validation are the same whether analyses are attached to high level structural units such as those identified by syntactic analysis or to lower level word-like tokens.

The generality of the SGML model leads to its being suitable for the tagging of a semantically highly diverse set of textual features. For example, the TEI recommendations propose that SGML tagging be applied to mark inter alia the following features:
  • orthographic and presentational features of the transcription
  • links to corresponding objects (for example digitized recordings of transcribed speech, digitised page images of transcribed writing etc.)
  • explicit disambiguation of features such as proper nouns, dates, times, etc.
  • part-of-speech and morphology
  • syntactic analysis
  • discourse analysis
  • contextual, bibliographic, and topically related features
  • editorial correction, normalization, commentary, or annotation

While there is no doubt that an SGML encoding can cope with all of these forms of analysis individually, the difficulty of distinguishing them in combination rapidly increases, particularly if they are all located in the same data stream. There is an increasing tendency therefore towards so-called ‘out-of-line’ annotation, in which potentially many, possibly contradictory, annotations or analytic interpretations are stored independently of the text itself, but linked to it by means of hypertext pointers. Similar techniques are required for the alignment of the structural components of multilingual or multimedia corpora.

Such techniques have much to recommend them, but place additional constraints on the ease with which the semantic and syntactic correctness of any one analysis can be validated. As well as checking that the analysis is internally consistent, it must be possible to check that the targets of each link are correctly specified. This may be difficult, if a non-portable or non-robust method has been used to specify them, or impossible entirely if the corpus text has been changed. Reliable standards for the specification of robust and application-independent linking mechanisms (e.g. HyTime, XLL) have a degree of acceptance within the computing sector, but are not yet widely accepted or understood within the community of corpus creators. An obvious exception to this generalization is in the special case of multilingual or multimedia aligned corpora where such mechanisms are essential.

We have restricted ourselves primarily to morphosyntax and syntax, partly because these are the most widely encountered forms of annotation and are also the only ones for which, at present, EAGLES guidelines exist. Other forms of annotation are sparser and more diverse, with insufficient examples of each type to make generally acceptable recommendations, even where consensus exists as to the scope or application of such analyses. This situation is like to change over time and consideration should be given on a rolling basis to validation procedures as the application of annotation types and the development of standards proceeds.

With this said, it is likely that many of the issues for validation of, say, pragmatic annotation, will be similar to those for morphosyntax. While the precise details of the scope of annotations and the interpretative nature of the schemes may differ, basic issues such as idiosyncratic v. widely accepted annotation schemes and questions of rigid v. fluid analysis schemes will most likely remain the same. So future work on the validation of such further annotations will be able to refer to this document for guidance, if not a complete solution.

4. Representation of Validation

The TEI Guidelines provide for the recording of some aspects of the validation process by specialised documentation within the TEI Header, but do not include elements for all the aspects touched on in our discussion. We list here the relevant elements from section 5.3 of the Guidelines, and also make preliminary suggestions for some additional elements which might usefully be added in a future revision of the TEI scheme.

The <encodingDesc> element in the TEI header is intended to ‘document the relationship between an electronic text and the source or sources from which it was derived’. As such it is the natural location for statements about the results of the validation process. The following elements, each of which is described in more detail in the Guidelines, seem of particular relevance:
describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.
contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection.
provides details of editorial principles and practices applied during the encoding of a text.
provides detailed information about the tagging applied to an SGML document.
specifies how canonical references are constructed for this text.
contains one or more taxonomies defining any classificatory codes used elsewhere in the text.
identifies the feature system declaration which contains definitions for a particular type of feature structure.
Some of these elements, for example <projectDesc> and <samplingDecl> are purely documentary, in that they are defined as containing only a prose description. For others, however, a more detailed substructure is proposed. The <tagsDecl> element for example is defined as containing a series of <tagUsage> elements, each of which specifies the number of occurrences found within a document for each SGML tag used. Such elements can thus be used to record the result of any structural validation carried out, simply as a count of the number of elements, optionally extended by any desired usage notes, as in the following example:
<tagsDecl> <tagUsage gi=DIV1 occurs=20></tagUsage> <tagUsage gi=P occurs=2043>Used for typographic paragraphs and also for individual list components</tagUsage> </tagsDecl>
Use of this element provides a useful way of documenting actual SGML tagging practice within a text, and can readily be automatically generated during the validation process. If usage notes are supplied, they need only specify information not already implicit in the definition of the element's syntax, as in the example above, where the <p> tag has been used for something which might more properly have been encoded using a different TEI element.
A more elaborate scheme is defined by the TEI <classDecl> element for the documentation of the classification scheme applied to a corpus, which permits (for example) a formal specification of any descriptive taxonomy or typology applied to the texts. The Guidelines suggest, as an example, the following way of representing the Brown corpus typology:
<taxonomy id=B> <bibl>Brown Corpus</bibl> <category id=B.A><catDesc>Press Reportage <category id=B.A1><catDesc>Daily</category> <category id=B.A2><catDesc>Sunday</category> <category id=B.A3><catDesc>National</category> <category id=B.A4><catDesc>Provincial</category> <category id=B.A5><catDesc>Political</category> <category id=B.A6><catDesc>Sports</category> <!-- ... --> </category> <category id=B.D><catDesc>Religion <category id=B.D1><catDesc>Books</category> <category id=B.D2><catDesc>Periodicals and tracts</category> </category> <!-- ... --> </taxonomy>
This method does not however allow for documentation of the extent to which the taxonomy has been applied, i.e. the coverage associated with each category within it. One way of filling this gap might be to define an additional element <coverage> as additional content for the existing <category> element, with attributes such as unit and <extent> to specify the proportion of the corpus which has been assigned this descriptive category. One might then specify, for example, that 7 texts or 3000 words in a given corpus have been assigned to the ‘Provincial Press’ class as follows:
<category id=B.A4><catDesc>Provincial</catDesc> <coverage unit=text extent=7> <coverage unit=word extent=3000> </category>
Again, the <coverage> element can be automatically generated during the validation process. Its presence, like that of the <tagUsage> elements discussed above, enables the corpus user to tell at a glance whether a given corpus is relevant to a specific requirement, subject of course to the general proviso that the corpus under examination is marked up correctly. In other words, it enables us to satisfy the ‘completeness’ criterion, as well as the ‘syntactic correctness’ criterion (which must have been satisfied in the case of an SGML corpus).
With regard to recording the usage of feature structures within a TEI document, the TEI provides a <fsdDecl> element, the function of which is to associate each feature structure used in a document with the (externally defined) feature system declaration to which it belongs. For example:
<fsdDecl type=NN2 fsd=eaglesFSD> <fsdDecl type=NN1 fsd=eaglesFSD>
indicates that the feature structures NN1 and NN2 are defined by the feature system which is contained in an external entity named eaglesFSD. (The use of an external SGML entity is a consequence of technical aspects of the way the TEI document type definition is implemented, which need not concern us here). As with the <tagUsage> element, each feature structure actually used within the corpus should be specified in this way. This mechanism allows for multiple analyses (using different FSDs) to co-occur within a given corpus, which may be of interest. However, there is no scope for inclusion of coverage or validation information, which might arguably be more useful. A simple way of rectifying this might be to define a new <fsUsage> element, analogous to the <tagUsage> element, with similar attributes and semantics. One might then include in the Header statements such as
<fsUsage type=NN2 occurs=1234> <fsUsage type=NN1 occurs=164538>
Alternatively, given the need to supply a <fsdDecl> element, it would be more economical to combine the function of the latter into the new element and write:
<fsUsage type=NN2 occurs=1234 fsd=eaglesFSD> <fsUsage type=NN1 occurs=164538 fsd=eaglesFSD>

As with the other elements discussed so far, the <fsUsage> elements for a given corpus should be automatically generated during the validation process, rather than manually added, and would therefore provide an automatic degree of consistency checking, as well as providing an explicit record of tagging practice within the text, rather than what is implicitly claimed for it. This in turn implies a further requirement for the documentation of the results of any manual or semi-automatic validation performed. (It precludes the explicit identification of features defined by the FSD but missing from the corpus, for example).

Such information might be provided as running text within an <interpretation> element, one of the subcomponents of the <editorialDecl> elements, although the definition provided for this suggests rather that it is intended for the corpus creator to record his or her intentions in this regard rather than for the corpus validator to record actual practice or assessment of the extent to which such intentions have been realised. The only example cited in the Guidelines is as follows:
<interpretation> <p>The part of speech analysis applied throughout section 4 was added by hand and has not been validated
As an initial step, we recommend including within a this element a statement of such topics as
  • what type of annotation is it claimed that the corpus includes (none, morphosyntactic, etc);
  • whether the annotation is consistently applied (as implied by the coverage elements);
  • whether the annotation is judged semantically correct, and by what criteria.

Where a finer grained validation is required, for example, at the level of individual features or tags, it may be preferable to add further attributes to the <tagUsage> or <fsUsage> elements discussed above. For example, a check attribute, with values such as NONE, SOME, or ALL, might be used to record the status of validation for each <fsUsage> element to which it applied. This might be useful where a corpus is initially morphosyntactically tagged by a program and then manually corrected on a piecemeal basis: the value for this attribute would then be changed as validation and hand correction progressed, on a feature-by-feature basis. Attaching validation feature at this level of granularity also has the advantage that certain categories (for example definite articles in English) are far easier to validate with confidence than others.

Clearly, there is a need for more formalization of the validation process, and a greater degree of consensus on what it is feasible or desirable to include by way of metrics before more specific recommendations can be made. This document is intended to provide a basis for such discussion.

5. Appendixes

5.1. A Feature System Declaration for the EAGLES morphosyntactic Guidelines

This is a complete FSD for the EAGLES Guidelines for morphosyntactic analysis, using the formalism defined in chapter 26 of the TEI Guidelines. It consists of a series of declarations for feature structures, each represented as a <fsDecl> element, and each corresponding with an EAGLES recommended feature. Each <fsDecl> contains a series of <fDecl> elements, each corresponding with a set of the feature-value pairs defined for that feature structure in the EAGLES scheme. The values (<vRange>) are specified as a set of alternate values using the <vAlt> element, indicating that EAGLES does not permit multi-valued features, but a system-dependent default value (<dft>) is permitted for use in cases where none of the specified values is applicable.
<!DOCTYPE teiFsd2 system "teifsd2.dtd"> <TEIfsd2> <teiHeader> <fileDesc> <titleStmt> <title>Feature System Declaration for the EAGLES tagset</title> </titleStmt> <publicationstmt> <p>Prepared for ELRA WP3 </publicationstmt> <sourcedesc><p>No source: this is an original work</sourcedesc> </filedesc> <revisionDesc> <change><date>2 apr 1997</date> <respstmt><resp>ed</resp><name>LB</name></respstmt> <item>Minor changes for validation; added header</item> </change> <change> <date>31 mar 1997</date> <respstmt><resp></resp><name>APM</name></respstmt> <item>First complete draft</item> </change> </revisionDesc> </teiHeader> <!-- Feature system for Nouns --> <fsDecl type = Noun> <fDecl name = Type> <fDescr>Range types associated with a noun</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Common --> <sym value=2><!-- Proper --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Gender> <fDescr>Range genders associated with a noun</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Masculine --> <sym value=2><!-- Feminine --> <sym value=3><!-- Neuter --> <sym value=4><!-- Common FOR USE WITH DUTCH AND DANISH ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Number> <fDescr>Range number associated with a noun</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Singular --> <sym value=2><!-- Plural --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Case> <fDescr>Range case associated with a noun</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Nominative --> <sym value=2><!-- Genitive --> <sym value=3><!-- Dative --> <sym value=4><!-- Accusative --> <sym value=5><!-- Vocative --> <sym value=6><!-- Indeclinable VALUE FOR GREEK ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Countability> <fDescr>Optional attribute counatbility</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Count --> <sym value=2><!-- Mass --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Countability> <fDescr>Language Specific Attribute Definiteness for Danish</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Definite --> <sym value=2><!-- Indefinite --> <sym value=3><!-- Unmarked --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> </fsDecl> <!-- Feature system for Verbs --> <fsDecl type = Verb> <fDecl name = Person> <fDescr>Range person associated with a verb</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- First Person --> <sym value=2><!-- Second person --> <sym value=3><!-- Third person --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Gender> <fDescr>Range genders associated with a verb</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Masculine --> <sym value=2><!-- Feminine --> <sym value=3><!-- Neuter --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Number> <fDescr>Range number associated with a verb</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Singular --> <sym value=2><!-- Plural --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Finiteness> <fDescr>Range finiteness associated with a verb</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Finite --> <sym value=2><!-- Non Finite --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = FormOrMood> <fDescr>Range form/mood associated with a verb</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Indicative --> <sym value=2><!-- Subjunctive --> <sym value=3><!-- Imperative --> <sym value=4><!-- Conditional --> <sym value=5><!-- Infinitive --> <sym value=6><!-- Participle --> <sym value=7><!-- Gerund --> <sym value=8><!-- Supine --> <sym value=9><!-- Ing Form VALID FOR ENGLISH ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Tense> <fDescr>Range tense associated with a verb</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Present --> <sym value=2><!-- Imperfect --> <sym value=3><!-- Future --> <sym value=4><!-- Past --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Voice> <fDescr>Range voice associated with a verb</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Active --> <sym value=2><!-- Passive --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Status> <fDescr>Range status associated with a verb</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Main --> <sym value=2><!-- Auxiliary --> <sym value=3><!-- Optional Attribute Semi Auxiliary --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Aspect> <fDescr>Optional Aspect attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Perfective --> <sym value=2><!-- Imperfective --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Separability> <fDescr>Optional Separability Attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Non Separable --> <sym value=2><!-- Separable --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Reflexivity> <fDescr>Optional Reflexivity Attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Reflexive --> <sym value=2><!-- Non reflexive --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Auxiliary> <fDescr>Optional Auxiliary Attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Have --> <sym value=2><!-- Be --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = AuxiliaryFunction> <fDescr>Auxiliary Function Attribute Applicable ONLY TO ENGLISH</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Primary --> <sym value=2><!-- Modal --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> </fsDecl> <!-- Feature system for Adjectives --> <fsDecl type = Adjective> <fDecl name = Degree> <fDescr>Range degree associated with an adjective</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Positive --> <sym value=2><!-- Comparative --> <sym value=3><!-- Superlative --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Gender> <fDescr>Range genders associated with an adjective</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Masculine --> <sym value=2><!-- Feminine --> <sym value=3><!-- Neuter --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Number> <fDescr>Range number associated with an adjective</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Singular --> <sym value=2><!-- Plural --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Case> <fDescr>Range case associated with an adjective</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Nominative --> <sym value=2><!-- Genitive --> <sym value=3><!-- Dative --> <sym value=4><!-- Accusative --> <sym value=5><!-- Vocative GREEK ONLY--> <sym value=6><!-- Indeclinable GREEK ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = InflectionType> <fDescr>Optional Inflection Type Attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Weak flection --> <sym value=2><!-- Strong flection --> <sym value=3><!-- Mixed --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Use> <fDescr>Optional Use Attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Attributive--> <sym value=2><!-- Predicative --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = NPFunction> <fDescr>Optional NP Function Attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Premodifying --> <sym value=2><!-- Postmodifying --> <sym value=3><!-- Head function --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> </fsDecl> <!-- Feature system for Pronoun-Determiners --> <fsDecl type = PronounDeterminer> <fDecl name = Person> <fDescr>Range person associated with a pronoun/determiner</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- First Person --> <sym value=2><!-- Second person --> <sym value=3><!-- Third person --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Gender> <fDescr>Range genders associated with a pronoun/determiner</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Masculine --> <sym value=2><!-- Feminine --> <sym value=3><!-- Neuter --> <sym value=4><!-- Common DANISH ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Number> <fDescr>Range number associated with a pronoun/determiner</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Singular --> <sym value=2><!-- Plural --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Case> <fDescr>Range case associated with a pronoun/determiner</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Nominative --> <sym value=2><!-- Genitive --> <sym value=3><!-- Dative --> <sym value=4><!-- Accusative --> <sym value=5><!-- Non Genitive --> <sym value=6><!-- Oblique --> <sym value=7><!-- Prepositional case SPANISH ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Category> <fDescr>Range category associated with a pronoun/determiner</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Pronoun --> <sym value=2><!-- Determiner --> <sym value=3><!-- Both Pronoun and Determiner --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = PronounType> <fDescr>Range pronoun type associated with a pronoun/determiner</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Demonstrative --> <sym value=2><!-- Indefinite --> <sym value=3><!-- Possessive --> <sym value=4><!-- Int/Rel --> <sym value=5><!-- Personal/Reflexive --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = DeterminerType> <fDescr>Range determiner type associated with a pronoun/determiner</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Demonstrative --> <sym value=2><!-- Indefinite --> <sym value=3><!-- Possessive --> <sym value=4><!-- Int/Rel --> <sym value=5><!-- Partitive --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Strength> <fDescr>Range strength associated with a pronoun/determiner in FRENCH DUTCH AND GREEK ONLY</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Weak --> <sym value=2><!-- Strong --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = SpecialPronounType> <fDescr>Optional Special Pronoun Type Attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Personal --> <sym value=2><!-- Reflexive --> <sym value=3><!-- Reciprocal --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = WHType> <fDescr>Optional WH Type Attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Interogative --> <sym value=2><!-- Relative --> <sym value=3><!-- Exclamatory --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Politeness> <fDescr>Optional Politeness Attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Polite --> <sym value=2><!-- Familiar --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> </fsDecl> <!-- Feature system for Articles --> <fsDecl type = Articles> <fDecl name = ArticleType> <fDescr>Range types associated with an article</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Definite --> <sym value=2><!-- Indefinite --> <sym value=3><!-- Partitive FRENCH ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Gender> <fDescr>Range genders associated with an article</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Masculine --> <sym value=2><!-- Feminine --> <sym value=3><!-- Neuter --> <sym value=4><!-- Common DANISH ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Number> <fDescr>Range number associated with an article</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Singular --> <sym value=2><!-- Plural --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Case> <fDescr>Range case associated with an article</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Nominative --> <sym value=2><!-- Genitive --> <sym value=3><!-- Dative --> <sym value=4><!-- Accusative --> <sym value=5><!-- Vocative GREEK ONLY --> <sym value=6><!-- Indeclinable GREEK ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> </fsDecl> <!-- Feature system for Adverbs --> <fsDecl type = Adverbs> <fDecl name = Degree> <fDescr>Range degree associated with an adverb</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Positive --> <sym value=2><!-- Comparative --> <sym value=3><!-- Superlative --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = AdverbType> <fDescr>Optional Adverb Type Attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- General --> <sym value=2><!-- Degree --> <sym value=3><!-- Particle ENGLISH GERMAN DUTCH ONLY --> <sym value=4><!-- Pronominal ENGLISH GERMAN DUTCH ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Polarity> <fDescr>Optional Polarity Attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- WH Type --> <sym value=2><!-- Non Wh Type --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = WHType> <fDescr>Range degree associated with an adverb</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Interogative --> <sym value=2><!-- Relative --> <sym value=3><!-- Exclamatory --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> </fsDecl> <!-- Feature system for Adpositions --> <fsDecl type = Adposition> <fDecl name = Type> <fDescr>Range types associated with an adposition</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Preposition --> <sym value=2><!-- Optional Fused Prepositional Article Value --> <sym value=3><!-- Postposition ENGLISH GERMAN ONLY --> <sym value=4><!-- Circumposition ENGLISH GERMAN ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> </fsDecl> <!-- Feature system for Conjunctions --> <fsDecl type = Conjunction> <fDecl name = Type> <fDescr>Range types associated with a conjunction</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Coordinating --> <sym value=2><!-- Subordinating --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = CoordType> <fDescr>Optional Coordination Type Attribute</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Simple --> <sym value=2><!-- Correlative--> <sym value=3><!-- Inital --> <sym value=4><!-- Non Initial--> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = SubordType> <fDescr>Subordination Type for GERMAN ONLY</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- With finite --> <sym value=2><!-- With infinite--> <sym value=3><!-- Comparative--> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> </fsDecl> <!-- Feature system for Numerals --> <fsDecl type = Numerals> <fDecl name = Type> <fDescr>Range types associated with a numeral</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Cardinal --> <sym value=2><!-- Ordinal --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Gender> <fDescr>Range genders associated with a numeral</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Masculine --> <sym value=2><!-- Feminine --> <sym value=3><!-- Neuter --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Number> <fDescr>Range number associated with a numeral</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Singular --> <sym value=2><!-- Plural --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Case> <fDescr>Range case associated with a numeral</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Nominative --> <sym value=2><!-- Genitive --> <sym value=3><!-- Dative --> <sym value=4><!-- Accusative --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Function> <fDescr>Range function associated with a numeral</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Pronoun --> <sym value=2><!-- Determiner --> <sym value=3><!-- Adjective --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> </fsDecl> <!-- Feature system for Unique tags --> <fsDecl type = unique> <fdecl name="interjection"> <fDescr>Range of types associated with interjections</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Interjection --> </vAlt></vRange> <vDefault><dft></vDefault> </fdecl></fsDecl> <fsDecl type = Unique> <fDecl name = InfinitiveMarker> <fDescr>Range types associated with an infinitive marker</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- German marker zu GERMAN ONLY --> <sym value=2><!-- Danish marker at DANISH ONLY --> <sym value=3><!-- Dutch marker DUTCH ONLY --> <sym value=4><!-- English marker ENGLISH ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = NegativeParticle> <fDescr>Negative particles ENGLISH ONLY</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- full form not --> <sym value=2><!-- contracted form of not --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = ExistentialMarker> <fDescr>Existential Markers</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- English existential marker ENGLISH ONLY --> <sym value=2><!-- Danish existential marker DANISH ONLY --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = SecondNegativeParticle> <fDescr>Second negative particles FRENCH ONLY</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- French pas --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Anticipatory> <fDescr>Anticipatory Marker er DUTCH only</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- er --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Mediopassive> <fDescr>Mediopassive PORTUGESE ONLY</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Mediopassive marker se --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = PreverbalParticle> <fDescr>Preverbal Particle GREEK ONLY</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Preverbal particle --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> </fsDecl> <!-- Feature system for Residuals --> <fsDecl type = Residual> <fDecl name = Type> <fDescr>Range types associated with a residual</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Foreign Word --> <sym value=2><!-- Formula --> <sym value=3><!-- Symbol --> <sym value=4><!-- Acronym --> <sym value=5><!-- Abbreviation --> <sym value=6><!-- Unclassified --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Number> <fDescr>Range number associated with a residual</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Singular --> <sym value=2><!-- Plural --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Gender> <fDescr>Range genders associated with a residual</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Masculine --> <sym value=2><!-- Feminine --> <sym value=3><!-- Neuter --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> </fsDecl> <!-- Feature system for Punctuation --> <fsDecl type = Punctuation> <fDecl name = Period> <fDescr>Range types associated with a fullstop</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Period --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Comma> <fDescr>Range types associated with a comma</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Comma --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> <fDecl name = Question> <fDescr>Range types associated with a question mark</fDescr> <vRange><vAlt> <sym value=0><!-- Value not relevant for a language --> <sym value=1><!-- Question mark --> </vAlt></vRange> <vDefault><dft></vDefault> </fDecl> </fsDecl> </TEIfsd2>

5.2. Sample Mapping Lists for the EAGLES Obligatory Features

The following tables illustrate how a particular set of analytic tags, in this case the CLAWS7 tagset, can be re-expressed in terms of the EAGLES ‘intermediate representation’. In cases where the CLAWS7 tag underspecies, each possible EAGLES value is given as an alternation.

The tables are organized as follows. Each table relates to an EAGLES obligatory feature, within which appear entries for all of the CLAWS tags categorised as being grouped with that feature. These tags are then further analysed in terms of their recommended features.
Table 1. Mapping list for Nouns
Table 2. Mapping list for Verbs
Table 3. Mapping list for Pronoun-Determiners
Table 4. Mapping list for Adjectives
Table 5. Mapping list for Adverbs
Table 6. Mapping list for Articles
Table 7. Mapping list for Adposition tags
Table 8. Mapping list for Conjunctions
Table 9. Mapping list for Numerals
Table 10. Mapping list for Residuals
Table 11. Mapping list for Unique tags
UHI Interjection
EXUEExistential ‘there’
TOUTInfinitive marker
XXUXNegative particle
PUQRPunctuation mark (quotation)
PUNRPunctuation mark (non-quotation)

5.3. Some current markup validation practice

In the following list, we summarize claims made by the builders of several of the corpora analysed in Work Package 2 regarding how the encoding of their corpus was validated. The information here is only partial, and has not been reviewed by our informants.
SGML parser used to validate all markup against the CDIF (Corpus Document Interchange Format) dtd; all tagging errors reported are then hand-corrected. Some semantic validation (on a portion of each text) also performed for errors such as incorrect or missing headings, with limited manual correction. All addition of analytic tagging was automatic. but its syntactic validity was checked again, using an SGML parser. As a separate exercise, a 2 percent sample of the corpus was hand-checked for accuracy of analytic tagging, and the results used to improve the original part-of-speech tagging. (Results of this are not yet publicly available, but are due in 1998).
LOB and Brown
No SGML mark-up used, but structure indicated by means of a simple and automatically verifiable coding. Typographic errors are retained unchanged. Analytic coding performed using similar techniques to those of the BNC.
London Lund Corpus
No SGML mark-up used, but detailed indication of prosodic features using idiosyncratic markup scheme; no information available as to how this was verified.
Penn Treebank
No SGML mark-up used, but detailed indication of syntactic features using idiosyncratic markup scheme;validated by own analytic tools.
Originally used own SGML-like markup scheme, validated by suite of WordPerfect macros which inserted text unit markup after full stops etc. This system ‘generally ensures that markup symbols are closed, and reminds users to do so should they try opening the same symbol again before closing it.’ Nelson 1996, p 65-66. After developing further software tools to check validity, the project has reportedly converted to an SGML system, but we have been unable to obtain further details of this.
Multext and CRATER
Where applicable, automatic conversion of preexisting header data was carried out. As for primary data in most cases division and/or paragraph level markup of some kind already existed in the texts we received, so getting P and DIV was a matter of conversion or automatic insertion. However, corrections were made by hand to P level markup. Since they were dealing with issues of alignment the accuracy of sentence level (and above) tags was crucial, so, while automatic means where used for as many of the steps as practical, hand-checking was also performed on sentence and above (<p>, <quote>, <div> etc) markup. All texts were parsed against their respective DTDs.
According to our informant, ‘The corpora were produced all over Europe in various formats and by people with varying amounts of experience and expertise in such work. Many started with a paper text, which was then scanned or even keyboarded. So this was clearly an issue to be tackled, especially since we wanted to align the texts and needed the markup to be not just accurate and SGML-wise correct, but also similar enough to assist the aligner. Parsers (nsgmls/xemacs) were used to check and correct the SGML, and most of the hands-on dirty work was done recently at the workshop in Nancy with Laurent Romary and his team. Most of the TELRI-ers who had prepared texts came along and we had the chance to really check and compare the texts. Some of the texts very initially sliced into sentences using tools that has been developped at our sites and which, being SGML aware can base their work upon an existing [lt ]p[gt ] structure. ’
The Lampeter Corpus
Originally prepared using word processor macros to insert minimal tagging for font changes and some structural features, use of different languages etc. The texts were then converted to true SGML by a combination of automatic and manual means, and have been proof read several times. Correction and validation carried out using emacs, PSGML, SP, and Author/Editor.
Validated against the TEI P3 DTD twice, once after proofreading, and then again after alignment to check that the values of the id and corresp attributes are unique and that the value of the corresp attribute points to an existing id in the parallel text. All validation performed by SP; project has developed its own SGML-aware software for further analysis.
Uses SGML-like coding for speaker identification and vocalic effects but not validated during data capture; some subsequent SGML-based analysis and validation.
Uses simply OCP-style markup only; validated only by analytic tools.
Some use of SGML-style tagging, e.g. for anaphor markup. No formal validation, other than by analytic tools.
Speech Thought and Writing Presentation Corpus
Some use of SGML-style tagging but no formal validation, other than by analytic tools. Tagging all manually added.
Minimal TEI-conformant dtd defined at start of project against which all corpora are eventually to be validated. Considerable variation in encoding practices reported amongst partners, no detailed information currently available.

6. References

  1. Atkins, S., Clear J. and Ostler, N. (1992). Corpus design criteria Literary and Linguistic Computing 7:1, 1-16.
  2. Baker, J.P. (1997) Consistency and accuracy in correcting automatically tagged data in Garside, R., Leech, G. and Mcenery, A.P.Corpus AnnotationAddison Wesley Longman1997
  3. Clear, J.H. (1992) Corpus sampling in Leitner, G.New directions in English language corporaMouton de Gruyter1992
  4. Garside, R.G. and McEnery, A.M. (1993). Treebanking: the compilation of a corpus of skeleton parsed sentences. In: E. Black, R. Garside and G.Leech, Statistically Driven Computer Grammars of English: The IBM-Lancaster Approach . Amsterdam: Rodopi.
  5. Ide, N. and Veronis, J. (1995) Text Encoding Initiative: background and context Kluwer 1995 0-7923-3704-2
  6. Ide, Nancy (coordinator) (1998) Corpus Encoding Specification (forthcoming, in Proceedings of the First International Conference on Language Resources and Evaluation); see also URL http://www.cs.vassar.edu/CES
  7. Langendoen, T.L. and Simons G. (1995) Rationale for the TEI Recommendations for Feature-structure Markup (in Ide and Veronis 1995)
  8. Leech, G. (1993). Corpus Annotation Systems. Literary and Linguistic Computing, 8(4) pp. 275--281.
  9. Leech, G. and Wilson, A. (1994). EAGLES Morphosyntactic Annotation. EAGLES Report EAG-CSG/IR-T3.1.. Pisa: Istituto di Linguistica Computazionale.
  10. Nelson, G. (1996). Markup systems. In: S. Greenbaum (ed.), Comparing English Worldwide: The International Corpus of English, pp. 36--53. Oxford: Clarendon Press.
  11. Snow, C. and Ninio, A. (1986). The Contracts of Literacy: What Children Learn from Reading Books. In: W. Teal and E. Sulsky (eds.), Emergent Literacy , pp. 116-138. New Jersey: Ablex.
  12. Sperberg McQueen, C.M. and Burnard, L. (1995) The design of the TEI Encoding Scheme (in Ide and Veronis 1995)
  13. Stubbs, M. (1996) Text and Corpus AnalysisBlackwell
  14. Clark, James (1998) SP: An SGML system [software]. Available from URL http://www.jclark.com/sp/