D1: Validation Manual for Written Language Resources
This report is a first draft version of deliverable D1.1b for the validation unit contract ELTA/0209/VAL-1: a manual for the evaluation of Written Language Resources. It should be read as a parallel report to deliverable D1.1a which describes the validation of lexica. The theoretical framework underlying this work was presented in  and ; the present document assumes an understanding of the issues discussed in those reports.
- Does the resource contain any automatic way of identifying particular linguistic or structural feature of interest, such as descriptive markup?
- Is the markup of the resource syntactically valid?
- Is the markup of the resource semantically correct, with reference to some externally (or internally) defined abstract model?
- Is the markup of the resource consistently applied (i.e. is every occurrence of a given feature marked in the same unambiguous way)?
- If consistent and correct, is the markup of a resource complete, with reference to some externally (or internally) defined list of required or recommended features?
We use the word corpus throughout as a short form for written language resource, recognizing however that many corpora include both spoken and written materials. There is consequently some overlap between the procedures and requirements described here and those defined in .
There is a significantly more variety amongst corpora than is the case with other types of resource validated by ELRA, which makes it correspondingly difficult to define normative guidelines for their evaluation. Language corpora typically have many users, and may have applications often entirely unanticipated by their original creators. Since normative decisions are of necessity taken with reference to intended applications, reliable recommendations about (for example) which features of a text should be marked up may seem difficult or impossible.
A number of projects in Europe and North America have, nevertheless, made significant progress in defining both general principles and specific minimal proposals for feature-sets appropriate to a significant range of applications. In the field of written language corpora, these include Eagles and its successor ISLE (); for archival storage the OLAC proposals are also of specific relevance ().
Rather than attempt to define validation procedures appropriate to many different scenarios, the approach taken here is to define objective criteria by which data essential to such an evaluation can be assembled, by reference to a check list of identifiable characteristics. The intended outcome from a validation process should thus be an accurate description of a corpus, using standardized terminology; it is up to the individual user to determine the extent to which a corpus so described is likely to be fit for a given purpose.
The bulk of this report consists of a checklist of features and properties which the evaluator should investigate in the resource under consideration. In some cases, specific recommendations are made; for the most part however, the goal of the evaluation is to determine whether or not information about the feature concerned is available, and where it is available, how reliable it is.
Assessment of reliability or ‘correctness’ will usually involve checking the information supplied in the corpus against the corresponding information provided by some independent source of information. For example, if the corpus cites a particular bibliographic source, it should be possible to validate the accuracy of the citation by referring to a source of information such as an online bibliographic catalogue. Similarly, if the corpus markup asserts that a some source is classified as (say) ‘national daily newspaper’, it should be possible to check this classification against some external categorization. At a lower level, where the corpus markup states that a specific point in the encoded text corresponds with some page number in a source text, it is possible to verify this by reference to the source text. We term this external validation.
Often, however, the information supplied in a corpus is not available elsewhere, and cannot therefore be validated by reference to any external source. In such a case, all that can be checked is that the information is provided consistently, and does not contradict reasonable expectations. For example, assertions about the morpho-syntactic status of specific tokens in a corpus cannot (usually) be validated against any independently defined source. They can however be checked against a language model, stated explicitly in the corpus itself or informally derived from real-world knowledge. We term this internal validation.
Evaluation can (and probably should) be performed both as a part of the creation of a resource, and subsequently by an independent assessor. As noted above, the creator of a resource may have a different set of objectives from subsequent users of it, which suggests that it is probably a good idea not to rely solely on evaluations carried out during resource production. On the other hand, only the creator of a resource may be able to supply some of the less obviously apparent information required for a complete description of a resource. The two stages are thus complementary, and wherever possible both should be attempted.