ELRA Work Package 3: first draft
This document forms the chief deliverable for Work Package 3 of the ELRA contract for validation of language corpora. It discusses the theoretical basis underlying our approach to the formal validation of language corpora, and makes some recommendations about relevant techniques and practices which may be of assistance in performing such evaluations, and documenting their results. Particular attention is paid to the specific case of morpho-syntactically annotated corpora.
Some confusion exists about the terminology associated with
linguistically annotated corpora. This is partly because the term
tagset is used differently by two different communities.
For the traditional corpus linguist, a tagset is the set of possible
values used to explicitly annotate a text with a linguistic analysis;
for example, the CLAWS tagset comprises a set of values such as
NN1, VVD etc., each of which has a specific significance
(singular common noun, verb past tense, etc.) For the mark-up
specialist however, the term tagset refers to any kind of annotation,
in particular the collection of SGML tags corresponding with the
elements defined in a particular DTD: for example, the TEI defines a
number of tagsets, each containing definitions for specific SGML
elements and attributes.
Both usages reflect the fact that all markup introduced into a text is identical, at some level of analysis, in the sense that it serves to record or assert an association between stretches of text and values taken from some externally defined set of interpretations. However most people seem to categorize an analysis such as ‘this is a paragraph’ differently from the formally equivalent judgement ‘this is a noun’. The former judgement is said to be ‘structural’ and the latter ‘interpretative’. This kind of categorization also underlies the notion of ‘level’ of annotation as exemplified by (inter alia) the Corpus Encoding Specification (Ide 1998), where the distinction is further justified by the observation that the addition of so-called ‘structural’ markup is generally easier to automate than that of ‘interpretive’ markup, since the latter (almost) invariably requires human judgement and knowledge, while the former rarely does. Particularly in the case of textual markup, interpretative judgements tend to be more controversial than structural ones, if only because the latter relate to aspects of a text which are accepted as intrinsic to its substance by the community of text readers. Structural interpretations form part of the ‘contracts of literacy’ (Snow and Ninio, 1986) which form the precondition of a text's recognition as meaningful by the members of a particular community of readers.
For purposes of validation, however, the distinction seems unhelpful. All markup introduced into an corpus should be validated in the same way, and the validity of the corpus overall is equally affected by each type of markup used. Nevertheless, we have subdivided our discussion into two parts, reflecting the division currently made by most practitioners between structural and interpretative markup, and which are consequently reflected in actual practices. Structural markup is most generally to be validated with reference to an abstract model of textual components and features which is either entirely intuitive and ‘common sense’ based, or defined in terms of some consensus-based model such as that of the TEI, restated as an SGML DTD. Interpretative markup may be similarly theory-free (see, for example, Leech 1993, but it is more customary to define it with reference to some explicitly stated analytic model, and hence to facilitate both automatic validation of the corpus itself (to check that it is valid in its own terms) and comparison of two corpora using different markup schemes derived from a common abstract model.
In section 2. Validation of Structural Analyses we discuss the process by which the structural markup defined for a given corpus may be validated. The formal mechanism used for this purpose is an SGML document type definition. In section 3. Validation of Morphosyntactic Analyses we discuss in more detail one particular kind of interpretative markup: that which seeks to make explicit morpho-syntactic analysis of a text. We present here an SGML scheme for the formal expression of an abstract model that may be used to validate such analyses both internally and externally. Finally, in section 4. Representation of Validation we suggest some ways in which the result of either validation exercise may be formally documented. We begin, however, by describing the model of formal validation which underlies both descriptions. (For a more detailed discussion of the principles adumbrated here, see Sperberg-McQueen and Burnard 1995).
- the start of a new input record;
- the presence of some distinguishing code or sign such as a star, not otherwise present in the text;
- the presence of some predefined symbol such as the tag <s>
We further assume that it is possible to define a grammar for such markup symbols: that is, a grammar which defines which combinations of such symbols in a document are to be regarded as legal. Such grammars generally have regard only to the markup language itself, rather than its extension to the underlying feature set. A markup grammar may simply enumerate all legal markup tokens, or simply specify an algorithm for the identification of markup tokens with no consideration of which markup tokens might be permitted. A more complex grammar (such as SGML) may also be used, enabling the formulation of contextual rules such as ‘the tag X is only legal within the scope of the component identified by tag Y’ in addition to these kinds of rules. Note however that legality is still defined here in terms of syntax: only informal legislation can determine whether the content of an SGML element is ‘correct’ with reference to some semantic model. Publications such as the TEI Guidelines typically extend the syntactic definitions embodied in their DTDs by more or less detailed discussion of the intended semantics of elements, but rarely provide a formally verifiable abstract model of such semantics, nor is it entirely clear what such a model might resemble. Nevertheless, throughout our discussion we will use the term feature (and derivatives) to refer to components of such a model, and the term tag (and derivatives) to refer to components of the markup system used to assert their existence.
This distinction seems to us crucial to the feasibility of validation: ‘A corpus is a collection of utterances, and therefore a sample of actual linguistic behaviour. However, even if we do not believe that the distinction between competence and performance is valid, a corpus is not itself the behaviour, but a record of this behaviour’ (Stubbs, 1996). The function of the markup in the corpus is to make explicit, and hence accessible to comparative study, the recording process for both structural and interpretative encoding in a corpus text. Without this, neither comparative studies of different corpora, nor any assessment of the validity of the corpus ‘record’ with respect to what it ‘records’ will be possible.
- for each feature of interest, does the document contain any tagging?
- is the tagging of the document syntactically correct?
- is the tagging of the document consistently applied (i.e. is every occurrence of a given feature tagged in the same way)?
- is the tagging of a document correctly applied, with reference to some externally (or internally) defined abstract model?
- if correct, is the tagging of a document complete, with reference to some externally (or internally) defined list of mandatory features?
Taking these in reverse order, it is clear that, in the general case, the last two of these stages are automatable only to the extent that an abstract model can be formally specified for both the feature system itself and for the intended correspondence between that and the tagging employed. We present in section 5.1. A Feature System Declaration for the EAGLES morphosyntactic Guidelines below one such abstract model, the EAGLES Guidelines for morpho-syntactic annotation (Leech and Wilson, 1994), re-expressed as a TEI-conformant feature system, against which any other set of morpho-syntactic annotations using the same representation may be validated, without necessarily having to conform to the EAGLES model. We also discuss the somewhat simpler abstract model proposed by EAGLES itself in section 3.2. Understanding the Markup below.
Equally clearly, however, neither the third nor the first of the stages above can in principle be automated, since both depend on a human judgement to the effect that such and such a feature is in fact present, whether or not it is signalled by the tagging in a text. Such text-comprehension abilities still seem to be somewhat beyond the state of the art in NLP, despite some advances.
The second of the three stages above is however automatable, to the extent that the tagging syntax of the document is fully specified. In an SGML context, this implies the existence of a DTD against which candidate documents can be verified using an SGML parser. For other forms of markup, validation may involve other forms of verification, some of which may be intimately tied in to the behaviour of particular application software. For example, a document marked up in RTF or LaTeX may be considered valid so long as Microsoft Word or LaTeX does not reject it, irrespective of its output. Technical documentation will often specify what markup should be found in a document: where the markup syntax is arbitrary or application specific, clearly special purpose software must be developed to validate it.
Language corpora are made by combining together whole texts or extracts from pre-existing documents, usually according to some specific design criteria. The structure of the corpus itself may thus be described (and hence marked up) at two levels: internal, relating to the way the parts of the corpus fit together, and external, relating to compositional features of the originals. This distinction holds good whether the corpus under consideration is a fixed document or a dynamic or ‘monitor’ corpus; in the latter case, as well as generally requiring dictate the use of whole texts rather than extracts, the internal design criteria may be further extended to include such topics as the rate at which new documents enter the corpus, the criteria for determining that they should be discarded from it, etc.
The internal structural features of a corpus are largely self-evident, and require little validation: common practice requires only the clear delimitation of individual text fragments, and to associate with each an appropriate level of description or metadata. In the TEI model, the former constitutes the text proper, and the latter its header. In older corpora, it was common practice to provide such metadata (if at all) as a separate documentary component, with only an informal association between the two, often depending on such artifices as file-naming conventions or sequencing to identify descriptive features of each component. The TEI model uses the power of SGML (in particular, its hierarchic structure and the consequent ability to specify property inheritance) to build more sophisticated structures. (For an account of some of these, see the discussion in e.g. Chapter 23 of the TEI Guidelines.)
The scope of the external features to be found marked up in language corpora varies greatly, depending both on the diverse nature of the materials they include and the diversity of applications envisaged for them. In large corpora, economic considerations alone preclude any attempt at modelling in the markup the full diversity of structures which a detailed textual feature analysis might indicate as possible: in the earliest corpora, for example the Lancaster/Oslo/Bergen corpus, even such basic organizational features as paragraphs or subheadings are rarely distinguished as such. Even today, the corpus designer is always forced to make pragmatic decisions about which structural features will have sufficient usefulness in the intended applications to warrant the expense of identifying them consistently and correctly. For many purposes, division into discrete segments, corresponding with identifiable locations in the original source, is adequate. For other purposes (for example, the study of discourse-related phenomena or text-grammar) a richer approach will be desirable.
Standards such as the CES provide a rich set of feature descriptions from which the corpus builder can select, together with specific tagging rules about how the presence of selected features can be made explicit. There is, however, considerable (and understandable) reluctance to make recommendations about which particular selections are appropriate or mandatory, since this will inevitably depend on the intended application for the corpus.
To validate such corpora therefore, a necessary first step is to identify the intentions of the designer. A corpus which does not mark up paragraph divisions is not necessarily less valid or useful than one which does; a corpus which claims to mark such divisions but which does so inconsistently or inaccurately is. Unfortunately, as WP2 demonstrates, it is often hard for corpus builders to specify their intentions in this respect, and harder for the validator to determine the extent to which these intentions have been carried out. Documentation and the provision of a DTD go some way to simplifying the task, as further discussed below.
As noted above, the extent to which the syntactic consistency of the structural markup in a corpus can be validated depends on the extent to which that markup uses a formally verifiable syntax. The great merit of SGML as a markup language is precisely that it makes this automatic verification simply a matter of defining an appropriate grammar (a document type definition) and checking the corpus against it. The most widely used software for this purpose is currently the freely available SGML parser SP, particularly its DOS incarnation NSGMLS [SP]. With the growing take up of SGML and of its simplified version XML, the number and sophistication of such systems is likely to increase greatly.
- are the tags present in the corpus all defined in its DTD?
- are the tags in the corpus all present in syntactically correct contexts?
- do all attributes specified for the tags in the corpus conform to the value ranges specified for them in the DTD?
- are any cross references specified by the SGML markup satisfied?
The output from an SGML parser is thus typically either simply confirmation that the document does in fact conform to the DTD, or a list of instances where it does not conform. At the risk of stating the obvious, it should be emphasized that a corpus which does not conform to its DTD, or which lacks a DTD, cannot be validated, no matter how closely its markup appears to be modelled on that of the SGML standard. The notion ‘SGML-like’ or ‘unvalidated SGML’ is not a helpful one in this context.
For corpora which do not use SGML markup, validation will require the provision of some DTD-like set of formal rules, and the production of some parser-like software to check them against the corpus itself. Such procedures are eminently feasible, and for simple markup schemes may be considered preferable to the expense of converting the markup to true SGML. For a variety of reasons not necessary to summarize, we do not recommend this approach: in the long run, the use of a widely accepted standardized markup language should always be less expensive than the maintenance of an idiosyncratic or application-limited scheme.
- whether every item tagged as an instance of some feature is in fact such an instance
- whether every instance of some feature is in fact tagged as such
To a large extent, however, these are limitations inherent in the whole markup enterprise; they also touch on fundamental problems of naming and ontology which have exercised philosophers since the time of Aristotle, and for which it would be unreasonable to expect immediate answers. Nevertheless, it is possible to make some pragmatic observations, additional to those provided in section 3.3.1. Semantic Correctness below concerning the semantic validation of analytic tagging.
Although not formally presented as such, pre-defined feature lists such as those provided by the TEI and CES may be regarded as constituting a kind of abstract model for the structural components of texts. They thus provide a useful reference point against which the validator may check both that the objects tagged as representing some feature appear to conform with the definitions supplied there, and conversely that no features conformant with those definitions are present but untagged or tagged inappropriately. This remains however an entirely manual process.
Few corpora are small enough to permit the luxury of a close reading, and so in the general vase this kind of manual validation can only be done by sampling. Typical procedures are thus to inspect some random sample of the corpus for the presence of specific tagged features, for example, the paragraph boundaries or headings. Provided that the location of these samples within the original documents is known, an attempt can then be made to assess the accuracy with which the tagging of structural features has been carried out across the corpus with respect to the original source. In the absence of an original source, such accuracy can be assessed only in statistical terms, for example by comparing the distribution of certain tagged features in the sample with their distribution across the whole, where a ‘correct’ distribution can be hypothesized on the basis of a priori reasoning (e.g. the number of paragraphs per text of a given type should be reasonably stable) or by applying other statistically derived heuristics.
In this section we discuss the possibilities for automatic or semi-automatic validation of one particular form of interpretative markup: that which seeks to mark up the result of a morphosyntactic analysis.
Whatever form of markup is employed, morphosyntactic tagging is usually supplied at the level of individual tokens in a text and is thus usually self-evident. In the absence of any documentation, it is likely to be a generally a simple matter to extract from a document all the unique tokens constituting the markup, and also to identify the lexemes to which they are attached, as was done, for example, by Garside and McEnery 1993. In this example, annotations were separated from words by underscore characters. Other schemes place the markup and lexeme in separate ‘fields’, or on alternate lines within the text proper. In SGML documents, annotations may be represented as attribute values, or as distinct elements, and the association between lexical item and annotation may be made by means of pointer or link.
It will be rather less easy (in the absence of documentation) to determine what feature or combination of features each markup token is intended to represent. The list of all markup tokens, together with an index of their occurrences, and the associated lexical item, might be collated with an annotated corpus in which the same lexical items are associated with annotations whose feature equivalences are known, thus providing a kind of latter-day Rosetta Stone for the purpose. Such a process is hardly likely to be easily automated. This is one good reason for insisting on the availability of such documentation, preferably in a form which can be readily mapped to agreed standards.
Such mapping requires the predefinition of an agreed set of morphosyntactic features, independent of markup. Such a set is provided in the context of several western European languages (such as Danish, English, French, German, Greek and Spanish) by the EAGLES morphosyntactic annotation guidelines (Leech and Wilson, 1994), which we have therefore adopted as a test case for our recommendations. The procedures described here and the conclusions we reach would be equally applicable to any other set of Guidelines. However, as the EAGLES guidelines have been published on the basis of a wide-ranging review of corpus builders, recommendations derived from it are likely both to reflect, and potentially have a wide impact on, current practice.
The EAGLES recommendations have a dual focus: as well as providing an abstract model of the feature sets against which any particular combination of the features tagged in some corpus may be validated, the Recommendations specify explicitly a subset of ‘recommended’ features which it is assumed should always be marked. Validation at this level thus becomes a matter of simply checking that the recommended features are in fact present [mdash ] in the terms we introduced in section 1.2. Principles of Textual Markup above, validation that the tagging is not only syntactically correct, but also complete.
- a one- or two-letter code is used for some ‘obligatory
features’ (the basic parts of speech) [mdash ] for example,
AJindicates the feature ‘Adjective’,
Nindicates the feature ‘Noun’, and so on;
- each recommended feature that is assigned to an obligatory feature occupies one place in the representation [mdash ] thus, if, as for the obligatory feature ‘Noun’, there are four associated recommended features, then there will be a four-place representation. ‘Recommended’ features are not mandatory, but come with a strong suggestion that any system of morphosyntactic annotation for the languages covered by EAGLES should include them;
- in each place or ‘slot’ in the representation a number is
inserted according to the value represented. For instance, the first
slot in the representation for ‘Noun’ is assigned to the
recommended feature ‘Type’: this has two possible values [mdash ]
‘Common’ (represented by
1) and ‘Proper’ (represented by
2). So the representation for a proper noun would begin
N2and that for common noun
N1. If a recommended feature is not represented for whatever reason, a
0is placed in the appropriate slot instead of an actual feature value.
- common noun, singular; gender and case not represented
- common noun, singular, genitive; gender not represented
- proper noun, feminine, singular, accusative
This representation provides a convenient means of facilitating validation against a standard list of features. By comparing intermediate representations from the corpus with the representation of the master list of features, it may easily be ascertained what features and values are or are not represented. Even where the intermediate representation is not used, a mapping list can still be produced showing for each corpus tag the EAGLES feature which it encodes. This latter kind of list is also essential for non-EAGLES-conformant corpora and, on a smaller scale, for any additional optional features used within the EAGLES remit. In section 5.2. Sample Mapping Lists for the EAGLES Obligatory Features we present examples of mapping lists for a non-EAGLES-conformant tagset (in this case, Lancaster University's Claws C7 tagset as used in the part-of-speech annotation of the British National Corpus).
Two problems arise however when attempting such mappings. The tagset under consideration may under-specify with relation to the EAGLES master list, that is, some annotation may map onto more than one feature combination. For example, the CLAWS 7 tagset uses the tag VV0 to denote any non third person singular form of a regular present tense verb, thus blurring the distinction between the imperative, first person singular, second person singular and first, second or third person plural.
The opposite situation [mdash ] where the tagset
over-specifies is also possible, particularly where the
bondary between morphosyntax and semantics is blurred, where the
tagset makes distinctions between sets of features regarded as
equivalent by EAGLES. For example, CLAWS includes a ‘Noun of
Style’ tag (
NNB) to mark English honorifics
such as ‘Mr’, ‘Dame’,
‘Professor’ etc. for which no equivalent feature
is identified by EAGLES, and which therefore cannot be distinguished
from other parts of proper names.
It should be noted that EAGLES does allow for arbitrary extensions to cover language-specific features. However, to stay with the previous example, honorifics are to be found in most European languages, and hence to treat them as language-specific is not appropriate. Extensibility of the basic features and their sub-categorizations will clearly be essential to any general purpose representation scheme for feature systems, and some such systems may require something more complex than a simple two-level categorization of this kind. EAGLES, itself the product of a consensus amongst corpus analysts at a particular point in time, was designed with the changing needs and practices of that community in mind. It is anticipated that revisions to both the list of recommended features and the sets of features they summarize will occur steadily, particularly as the field of application extends beyond the relatively frequently studied Western European languages.
In the general case, what is needed is a representation scheme which maximizes the flexibility of the annotation scheme without compromising the need to validate instances of its use. We discuss such a scheme in the next section.
A more powerful and discriminating representation is provided by the TEI tagset for feature structure analysis. This has two parts, a set of tags for the direct representation of feature structures, which can be linked to instances of textual objects so analysed, and a set of tags for documenting the feature system itself, that is, the constraints, allowable feature-value pairs etc. which are to be regarded as valid in a given analysis.
The feature system representation is defined in chapter 26 of the TEI Guidelines; Langendoen and Simons 1995 provides a useful introduction. A feature, in this scheme, is defined as a pair, comprising a name and a value. The latter may be one of a defined set of value types, including Boolean (plus or minus), numeric, string (an unclosed set of values), symbol (one of a defined set), a feature structure, or a reference to one. A feature structure is a named combination of such features, ordered or unordered.
It should be apparent how this approach permits an SGML aware
processor to identify automatically linguistic analyses where features
such as number or properness are marked, independently of the actual
category code (the
NP2) used to mark
the analysis. In addition, of course, the use of the SGML ID/IDREF
mechanism allows for simple validation of the codes used. For more
sophisticated validation, for example to ensure that the feature
properness cannot be both plus and minus in the same analysis, the TEI
specifies an additional declarative mechanism, known as a feature system declaration (FSD).
Full details of the FSD are provided in chapter 26 of the TEI Guidelines; its relevance for our present purposes is that it provide a mechanism, intermediate in constraining power between a full document type definition (which requires that all possible annotations or tags be specified in advance) and the kind of limited validation possible with the EAGLES mapping list. A fully elaborated feature system declaration for the EAGLES morphosyntactic classification scheme is presented in section 5.1. A Feature System Declaration for the EAGLES morphosyntactic Guidelines below. This more general solution makes possible a form of internal validation, whereby the contents of the corpus are validated against feature lists produced specifically for that corpus, or where the feature list used is a super- or sub- set of the EAGLES feature list, without losing the ability to validate that part of the feature set which does coincide with EAGLES' recommendations.
Returning for the moment to the utility of the original EAGLES report for validation, as a first step for languages covered by the report, corpus designers would be foolish to ignore the relevance of the EAGLES obligatory and recommended features, since these now form an agreed cross-linguistic EU standard. Any internal validation should thus be regarded as secondary to an EAGLES validation. Adoption of a feature-based system for validation makes possible the application of identical validation techniques in either case.
The process of deriving a feature set from documentation is also a convenient way of checking the thoroughness and consistency of the documentation itself. Anomalies such as the presence of undocumented tags in the corpus, or the presence of unused or ‘phantom’ features in the documentation are often only found by such a process.
- they are present for the sake of completeness but simply did not occur in the text corpus being examined;
- their presence is a historical accident, representing for example a change in the design of the feature analysis;
- they should have been applied to the corpus but were not.
Clearly, the most serious case is that of (3): here the annotation does not validate against the intended features and needs to be rectified. Such deficiency, at least at the EAGLES obligatory and recommended levels, should be immediately evident when the corpus annotation used is checked against the feature list. In the case of (2), only the documentation needs correcting. In the case of (1), the matter should simply be documented, for the information of corpus users. Phantom tags can be introduced as the result of typographic errors; the use of an automatic system for introduction of tags and their automatic validation against the agreed corpus tagset entirely does away with this form of error.
- each appropriate lexical item receives an appropriate annotation;
- each appropriate lexical item receives a single annotation;
- each annotation used is documented and corresponds with a known feature, i.e. there are no typographic errors;
- the annotation is presented using a consistent and correct syntax.
We use the phrase ‘lexical item’ above to indicate that the tokens to which annotation is attached need not correspond with orthographic words. Although many commonly used annotation schemes for English do in fact attempt to make this correspondence, it is unnecessary where a single formalism such as SGML or something of equivalent power is used to represent both structure and analysis.
so_CS21 that_CS22or, using the equivalent SGML formalism, as
CS, the following digit 2 indicates the number of tokens to which it is to be attached, and the final 1 and 2 indicate the number of this token within the sequence. A more natural approach would be to revise the tokenization rules so that the token so that might be treated as a single unit, tagging it as
We recommend above that a single annotation be attached to each lexical token, recognizing that in production systems it may be necessary to retain deliberately ambiguous or polyvalent annotations to avoid incorrect deterministic disambiguation. Such exceptions to the ‘one word, one tag’ rule, should be clearly documented to aid validation; ideally each possible combination of multiple annotations can be represented as a distinct choice within the feature set. The FSD notation recommended below supports this possibility.
The majority of these tasks can be achieved using a series of procedures aided by simple Unix tools such as awk and grep. Checking SGML requires an SGML parser, and a number of these are available. As part of this workpackage, we reviewed the SGML validation that had been undertaken on the corpora covered in the WP2 review. For most part, the results (summarized in section 5.3. Some current markup validation practice below) indicate that as yet only a few corpus builders are taking advantage of the availability of tools such as SGML parsers to validate formally-defined markup schemes.
This is unsurprising, given the fact that such schemes have only begun to gain wide acceptance in the last few years. However, it does seem strange that the topic of validation is rarely touched on in the extant literature concerning corpus design and construction; where it is, the topic appears to relate almost exclusively to the statistical validity of a given sample as representative of some aspect of language (see for example Clear 1992, Atkins et al 1990). Corpora such as the LOB and Brown have been so exhaustively studied and analysed that it would be surprising if such errors as they contain had not come to light; furthermore, where they have, however, corpus designers and builders seem to have been uninterested in their status or implications. A plausible reason for this is that it is only with the advent of really large corpora, often produced by automatic or semi-automatic methods of data capture such as optical character recognition or as a by-product of electronic typesetting, that questions of accuracy and authenticity have arisen.
As stated above, an accurate assessment of the semantic validity of any markup in a corpus is an inherently intractable problem. Where the function of the markup is to assert the existence of a human interpretation of the data, it is probably the case that this can only be validated manually, although some control over variability may be derived by the application of some rough heuristics to assess semantic conformance to a pre-established norm. For example, if we know the statistical distribution of specific nouns, verbs etc in a general corpus like the BNC, then we may be able to check future corpora on the basis of these rough distributions. However, this is clearly a rough and ready process.
Let us turn to considering hand validation. Even where human checking occurs, a validation cannot be considered 100% accurate, since frequently there is scope for error or genuine disagreement, even within a single set of guidelines [mdash ] (see for example Baker 1997). One possibly automated check would be to see whether an assigned tag is allowed for a given word, by checking the word's entry in a lexicon. However, this only makes sense when (a) a lexicon has been used to tag the text and (b) manual correction has taken place [mdash ] otherwise we can already be sure that the tag is permissible, unless there is something very seriously wrong with the operation of the tagging program. Limitations on this method of checking are (a) the fact that often a suffix list, etc., rather than an exhaustive lexicon, is used for tag assignment and (b) the presence of new tags, i.e., permissible and correct tags added by human annotators because a new contextual reading is missing from the lexicon.
In addition to the strictly morphosyntactic analysis discussed so far, the EAGLES Guidelines also envisage two generic forms of syntactic analysis: phrase structure and dependency. Phrase structure grammars require the ability to model well-balanced trees in a markup language, while structural dependency grammar requires the ability to describe directed acyclic graphs.
Both abilities are intrinsic to the SGML abstract model, and the tasks of first representing, and then validating the correctness of such structures, is thus comparatively trivial. Furthermore, it is clear that the fundamental problems of semantic validation are the same whether analyses are attached to high level structural units such as those identified by syntactic analysis or to lower level word-like tokens.
- orthographic and presentational features of the transcription
- links to corresponding objects (for example digitized recordings of transcribed speech, digitised page images of transcribed writing etc.)
- explicit disambiguation of features such as proper nouns, dates, times, etc.
- part-of-speech and morphology
- syntactic analysis
- discourse analysis
- contextual, bibliographic, and topically related features
- editorial correction, normalization, commentary, or annotation
While there is no doubt that an SGML encoding can cope with all of these forms of analysis individually, the difficulty of distinguishing them in combination rapidly increases, particularly if they are all located in the same data stream. There is an increasing tendency therefore towards so-called ‘out-of-line’ annotation, in which potentially many, possibly contradictory, annotations or analytic interpretations are stored independently of the text itself, but linked to it by means of hypertext pointers. Similar techniques are required for the alignment of the structural components of multilingual or multimedia corpora.
Such techniques have much to recommend them, but place additional constraints on the ease with which the semantic and syntactic correctness of any one analysis can be validated. As well as checking that the analysis is internally consistent, it must be possible to check that the targets of each link are correctly specified. This may be difficult, if a non-portable or non-robust method has been used to specify them, or impossible entirely if the corpus text has been changed. Reliable standards for the specification of robust and application-independent linking mechanisms (e.g. HyTime, XLL) have a degree of acceptance within the computing sector, but are not yet widely accepted or understood within the community of corpus creators. An obvious exception to this generalization is in the special case of multilingual or multimedia aligned corpora where such mechanisms are essential.
We have restricted ourselves primarily to morphosyntax and syntax, partly because these are the most widely encountered forms of annotation and are also the only ones for which, at present, EAGLES guidelines exist. Other forms of annotation are sparser and more diverse, with insufficient examples of each type to make generally acceptable recommendations, even where consensus exists as to the scope or application of such analyses. This situation is like to change over time and consideration should be given on a rolling basis to validation procedures as the application of annotation types and the development of standards proceeds.
With this said, it is likely that many of the issues for validation of, say, pragmatic annotation, will be similar to those for morphosyntax. While the precise details of the scope of annotations and the interpretative nature of the schemes may differ, basic issues such as idiosyncratic v. widely accepted annotation schemes and questions of rigid v. fluid analysis schemes will most likely remain the same. So future work on the validation of such further annotations will be able to refer to this document for guidance, if not a complete solution.
The TEI Guidelines provide for the recording of some aspects of the validation process by specialised documentation within the TEI Header, but do not include elements for all the aspects touched on in our discussion. We list here the relevant elements from section 5.3 of the Guidelines, and also make preliminary suggestions for some additional elements which might usefully be added in a future revision of the TEI scheme.
- describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.
- contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection.
- provides details of editorial principles and practices applied during the encoding of a text.
- provides detailed information about the tagging applied to an SGML document.
- specifies how canonical references are constructed for this text.
- contains one or more taxonomies defining any classificatory codes used elsewhere in the text.
- identifies the feature system declaration which contains definitions for a particular type of feature structure.
NN2are defined by the feature system which is contained in an external entity named
eaglesFSD. (The use of an external SGML entity is a consequence of technical aspects of the way the TEI document type definition is implemented, which need not concern us here). As with the <tagUsage> element, each feature structure actually used within the corpus should be specified in this way. This mechanism allows for multiple analyses (using different FSDs) to co-occur within a given corpus, which may be of interest. However, there is no scope for inclusion of coverage or validation information, which might arguably be more useful. A simple way of rectifying this might be to define a new <fsUsage> element, analogous to the <tagUsage> element, with similar attributes and semantics. One might then include in the Header statements such as
As with the other elements discussed so far, the <fsUsage> elements for a given corpus should be automatically generated during the validation process, rather than manually added, and would therefore provide an automatic degree of consistency checking, as well as providing an explicit record of tagging practice within the text, rather than what is implicitly claimed for it. This in turn implies a further requirement for the documentation of the results of any manual or semi-automatic validation performed. (It precludes the explicit identification of features defined by the FSD but missing from the corpus, for example).
- what type of annotation is it claimed that the corpus includes (none, morphosyntactic, etc);
- whether the annotation is consistently applied (as implied by the coverage elements);
- whether the annotation is judged semantically correct, and by what criteria.
Where a finer grained validation is required, for example, at the
level of individual features or tags, it may be preferable to add
further attributes to the <tagUsage> or <fsUsage>
elements discussed above. For example, a check
attribute, with values such as
ALL, might be used to record the status of validation
for each <fsUsage> element to which it applied. This might be
useful where a corpus is initially morphosyntactically tagged by a
program and then manually corrected on a piecemeal basis: the value
for this attribute would then be changed as validation and hand
correction progressed, on a feature-by-feature basis. Attaching
validation feature at this level of granularity also has the advantage
that certain categories (for example definite articles in English) are
far easier to validate with confidence than others.
Clearly, there is a need for more formalization of the validation process, and a greater degree of consensus on what it is feasible or desirable to include by way of metrics before more specific recommendations can be made. This document is intended to provide a basis for such discussion.
The following tables illustrate how a particular set of analytic tags, in this case the CLAWS7 tagset, can be re-expressed in terms of the EAGLES ‘intermediate representation’. In cases where the CLAWS7 tag underspecies, each possible EAGLES value is given as an alternation.
- SGML parser used to validate all markup against the CDIF (Corpus Document Interchange Format) dtd; all tagging errors reported are then hand-corrected. Some semantic validation (on a portion of each text) also performed for errors such as incorrect or missing headings, with limited manual correction. All addition of analytic tagging was automatic. but its syntactic validity was checked again, using an SGML parser. As a separate exercise, a 2 percent sample of the corpus was hand-checked for accuracy of analytic tagging, and the results used to improve the original part-of-speech tagging. (Results of this are not yet publicly available, but are due in 1998).
- LOB and Brown
- No SGML mark-up used, but structure indicated by means of a simple and automatically verifiable coding. Typographic errors are retained unchanged. Analytic coding performed using similar techniques to those of the BNC.
- London Lund Corpus
- No SGML mark-up used, but detailed indication of prosodic features using idiosyncratic markup scheme; no information available as to how this was verified.
- Penn Treebank
- No SGML mark-up used, but detailed indication of syntactic features using idiosyncratic markup scheme;validated by own analytic tools.
- Originally used own SGML-like markup scheme, validated by suite of WordPerfect macros which inserted text unit markup after full stops etc. This system ‘generally ensures that markup symbols are closed, and reminds users to do so should they try opening the same symbol again before closing it.’ Nelson 1996, p 65-66. After developing further software tools to check validity, the project has reportedly converted to an SGML system, but we have been unable to obtain further details of this.
- Multext and CRATER
- Where applicable, automatic conversion of preexisting header data was carried out. As for primary data in most cases division and/or paragraph level markup of some kind already existed in the texts we received, so getting P and DIV was a matter of conversion or automatic insertion. However, corrections were made by hand to P level markup. Since they were dealing with issues of alignment the accuracy of sentence level (and above) tags was crucial, so, while automatic means where used for as many of the steps as practical, hand-checking was also performed on sentence and above (<p>, <quote>, <div> etc) markup. All texts were parsed against their respective DTDs.
- According to our informant, ‘The corpora were produced all over Europe in various formats and by people with varying amounts of experience and expertise in such work. Many started with a paper text, which was then scanned or even keyboarded. So this was clearly an issue to be tackled, especially since we wanted to align the texts and needed the markup to be not just accurate and SGML-wise correct, but also similar enough to assist the aligner. Parsers (nsgmls/xemacs) were used to check and correct the SGML, and most of the hands-on dirty work was done recently at the workshop in Nancy with Laurent Romary and his team. Most of the TELRI-ers who had prepared texts came along and we had the chance to really check and compare the texts. Some of the texts very initially sliced into sentences using tools that has been developped at our sites and which, being SGML aware can base their work upon an existing [lt ]p[gt ] structure. ’
- The Lampeter Corpus
- Originally prepared using word processor macros to insert minimal tagging for font changes and some structural features, use of different languages etc. The texts were then converted to true SGML by a combination of automatic and manual means, and have been proof read several times. Correction and validation carried out using emacs, PSGML, SP, and Author/Editor.
- Validated against the TEI P3 DTD twice, once after proofreading, and then again after alignment to check that the values of the id and corresp attributes are unique and that the value of the corresp attribute points to an existing id in the parallel text. All validation performed by SP; project has developed its own SGML-aware software for further analysis.
- Uses SGML-like coding for speaker identification and vocalic effects but not validated during data capture; some subsequent SGML-based analysis and validation.
- Uses simply OCP-style markup only; validated only by analytic tools.
- Some use of SGML-style tagging, e.g. for anaphor markup. No formal validation, other than by analytic tools.
- Speech Thought and Writing Presentation Corpus
- Some use of SGML-style tagging but no formal validation, other than by analytic tools. Tagging all manually added.
- Minimal TEI-conformant dtd defined at start of project against which all corpora are eventually to be validated. Considerable variation in encoding practices reported amongst partners, no detailed information currently available.
- Atkins, S., Clear J. and Ostler, N.
Corpus design criteriaLiterary and Linguistic Computing 7:1, 1-16.
- Baker, J.P. (1997)
Consistency and accuracy in correcting automatically tagged datain Garside, R., Leech, G. and Mcenery, A.P.Corpus AnnotationAddison Wesley Longman1997
- Clear, J.H. (1992)
Corpus samplingin Leitner, G.New directions in English language corporaMouton de Gruyter1992
- Garside, R.G. and McEnery, A.M.
Treebanking: the compilation of a corpus of skeleton parsed sentences. In: E. Black, R. Garside and G.Leech, Statistically Driven Computer Grammars of English: The IBM-Lancaster Approach . Amsterdam: Rodopi.
- Ide, N. and Veronis, J. (1995) Text Encoding Initiative: background and context Kluwer 1995 0-7923-3704-2
- Ide, Nancy (coordinator) (1998)
Corpus Encoding Specification(forthcoming, in Proceedings of the First International Conference on Language Resources and Evaluation); see also URL http://www.cs.vassar.edu/CES
- Langendoen, T.L. and Simons G. (1995)
Rationale for the TEI Recommendations for Feature-structure Markup(in Ide and Veronis 1995)
- Leech, G. (1993).
Corpus Annotation Systems. Literary and Linguistic Computing, 8(4) pp. 275--281.
- Leech, G. and Wilson, A. (1994). EAGLES Morphosyntactic Annotation. EAGLES Report EAG-CSG/IR-T3.1.. Pisa: Istituto di Linguistica Computazionale.
- Nelson, G. (1996).
Markup systems. In: S. Greenbaum (ed.), Comparing English Worldwide: The International Corpus of English, pp. 36--53. Oxford: Clarendon Press.
- Snow, C. and Ninio, A. (1986).
The Contracts of Literacy: What Children Learn from Reading Books. In: W. Teal and E. Sulsky (eds.), Emergent Literacy , pp. 116-138. New Jersey: Ablex.
- Sperberg McQueen, C.M. and Burnard, L. (1995)
The design of the TEI Encoding Scheme(in Ide and Veronis 1995)
- Stubbs, M. (1996) Text and Corpus AnalysisBlackwell
- Clark, James (1998) SP: An SGML system [software]. Available from URL http://www.jclark.com/sp/