ELRA Work Package 3: first draft

4. Representation of Validation

The TEI Guidelines provide for the recording of some aspects of the validation process by specialised documentation within the TEI Header, but do not include elements for all the aspects touched on in our discussion. We list here the relevant elements from section 5.3 of the Guidelines, and also make preliminary suggestions for some additional elements which might usefully be added in a future revision of the TEI scheme.

The <encodingDesc> element in the TEI header is intended to ‘document the relationship between an electronic text and the source or sources from which it was derived’. As such it is the natural location for statements about the results of the validation process. The following elements, each of which is described in more detail in the Guidelines, seem of particular relevance:
describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.
contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection.
provides details of editorial principles and practices applied during the encoding of a text.
provides detailed information about the tagging applied to an SGML document.
specifies how canonical references are constructed for this text.
contains one or more taxonomies defining any classificatory codes used elsewhere in the text.
identifies the feature system declaration which contains definitions for a particular type of feature structure.
Some of these elements, for example <projectDesc> and <samplingDecl> are purely documentary, in that they are defined as containing only a prose description. For others, however, a more detailed substructure is proposed. The <tagsDecl> element for example is defined as containing a series of <tagUsage> elements, each of which specifies the number of occurrences found within a document for each SGML tag used. Such elements can thus be used to record the result of any structural validation carried out, simply as a count of the number of elements, optionally extended by any desired usage notes, as in the following example:
<tagsDecl> <tagUsage gi=DIV1 occurs=20></tagUsage> <tagUsage gi=P occurs=2043>Used for typographic paragraphs and also for individual list components</tagUsage> </tagsDecl>
Use of this element provides a useful way of documenting actual SGML tagging practice within a text, and can readily be automatically generated during the validation process. If usage notes are supplied, they need only specify information not already implicit in the definition of the element's syntax, as in the example above, where the <p> tag has been used for something which might more properly have been encoded using a different TEI element.
A more elaborate scheme is defined by the TEI <classDecl> element for the documentation of the classification scheme applied to a corpus, which permits (for example) a formal specification of any descriptive taxonomy or typology applied to the texts. The Guidelines suggest, as an example, the following way of representing the Brown corpus typology:
<taxonomy id=B> <bibl>Brown Corpus</bibl> <category id=B.A><catDesc>Press Reportage <category id=B.A1><catDesc>Daily</category> <category id=B.A2><catDesc>Sunday</category> <category id=B.A3><catDesc>National</category> <category id=B.A4><catDesc>Provincial</category> <category id=B.A5><catDesc>Political</category> <category id=B.A6><catDesc>Sports</category> <!-- ... --> </category> <category id=B.D><catDesc>Religion <category id=B.D1><catDesc>Books</category> <category id=B.D2><catDesc>Periodicals and tracts</category> </category> <!-- ... --> </taxonomy>
This method does not however allow for documentation of the extent to which the taxonomy has been applied, i.e. the coverage associated with each category within it. One way of filling this gap might be to define an additional element <coverage> as additional content for the existing <category> element, with attributes such as unit and <extent> to specify the proportion of the corpus which has been assigned this descriptive category. One might then specify, for example, that 7 texts or 3000 words in a given corpus have been assigned to the ‘Provincial Press’ class as follows:
<category id=B.A4><catDesc>Provincial</catDesc> <coverage unit=text extent=7> <coverage unit=word extent=3000> </category>
Again, the <coverage> element can be automatically generated during the validation process. Its presence, like that of the <tagUsage> elements discussed above, enables the corpus user to tell at a glance whether a given corpus is relevant to a specific requirement, subject of course to the general proviso that the corpus under examination is marked up correctly. In other words, it enables us to satisfy the ‘completeness’ criterion, as well as the ‘syntactic correctness’ criterion (which must have been satisfied in the case of an SGML corpus).
With regard to recording the usage of feature structures within a TEI document, the TEI provides a <fsdDecl> element, the function of which is to associate each feature structure used in a document with the (externally defined) feature system declaration to which it belongs. For example:
<fsdDecl type=NN2 fsd=eaglesFSD> <fsdDecl type=NN1 fsd=eaglesFSD>
indicates that the feature structures NN1 and NN2 are defined by the feature system which is contained in an external entity named eaglesFSD. (The use of an external SGML entity is a consequence of technical aspects of the way the TEI document type definition is implemented, which need not concern us here). As with the <tagUsage> element, each feature structure actually used within the corpus should be specified in this way. This mechanism allows for multiple analyses (using different FSDs) to co-occur within a given corpus, which may be of interest. However, there is no scope for inclusion of coverage or validation information, which might arguably be more useful. A simple way of rectifying this might be to define a new <fsUsage> element, analogous to the <tagUsage> element, with similar attributes and semantics. One might then include in the Header statements such as
<fsUsage type=NN2 occurs=1234> <fsUsage type=NN1 occurs=164538>
Alternatively, given the need to supply a <fsdDecl> element, it would be more economical to combine the function of the latter into the new element and write:
<fsUsage type=NN2 occurs=1234 fsd=eaglesFSD> <fsUsage type=NN1 occurs=164538 fsd=eaglesFSD>

As with the other elements discussed so far, the <fsUsage> elements for a given corpus should be automatically generated during the validation process, rather than manually added, and would therefore provide an automatic degree of consistency checking, as well as providing an explicit record of tagging practice within the text, rather than what is implicitly claimed for it. This in turn implies a further requirement for the documentation of the results of any manual or semi-automatic validation performed. (It precludes the explicit identification of features defined by the FSD but missing from the corpus, for example).

Such information might be provided as running text within an <interpretation> element, one of the subcomponents of the <editorialDecl> elements, although the definition provided for this suggests rather that it is intended for the corpus creator to record his or her intentions in this regard rather than for the corpus validator to record actual practice or assessment of the extent to which such intentions have been realised. The only example cited in the Guidelines is as follows:
<interpretation> <p>The part of speech analysis applied throughout section 4 was added by hand and has not been validated
As an initial step, we recommend including within a this element a statement of such topics as
  • what type of annotation is it claimed that the corpus includes (none, morphosyntactic, etc);
  • whether the annotation is consistently applied (as implied by the coverage elements);
  • whether the annotation is judged semantically correct, and by what criteria.

Where a finer grained validation is required, for example, at the level of individual features or tags, it may be preferable to add further attributes to the <tagUsage> or <fsUsage> elements discussed above. For example, a check attribute, with values such as NONE, SOME, or ALL, might be used to record the status of validation for each <fsUsage> element to which it applied. This might be useful where a corpus is initially morphosyntactically tagged by a program and then manually corrected on a piecemeal basis: the value for this attribute would then be changed as validation and hand correction progressed, on a feature-by-feature basis. Attaching validation feature at this level of granularity also has the advantage that certain categories (for example definite articles in English) are far easier to validate with confidence than others.

Clearly, there is a need for more formalization of the validation process, and a greater degree of consensus on what it is feasible or desirable to include by way of metrics before more specific recommendations can be made. This document is intended to provide a basis for such discussion.

Up: Contents Previous: 3. Validation of Morphosyntactic Analyses Next: 5. Appendixes