Below I am sending a section slated to be included in the current version of the TEI SGML guidelines document, specifically addressing the question of the use of attributes.
I have been struggling with this section for months now, wavering back and forth from recommending to allow or not to allow them. My current position is recommending against them. I know that this is not a popular position, but I request that you read my arguments below with an open mind.
There was one major argument that was swaying me to allow. This was Michael's analysis of the `tag explosion' problem in an earlier message (sorry if everyone has not seen this). Basically, I believe that this argument is flawed because the explosion occurs when VALUES of attributes are considered as names of tags, and not just the attribute names themselves. If you simply do a one-to-one mapping from Michael's suggested attribute names to their equivalent tag names, there is no explosion at all, but an equal number using either approach.
For example, for the case of nouns, and the features number, gender, case and declension. In Michael's case, he has one tagged element (nouns) and four attributes (number, gender, case, declension). Each of the attributes is assigned certain VALUES, e.g., `number' can assume the value singular or plural, etc. I would rewrite this making noun an element, and making number, gender, case and declension elements. So, we have five items in Michael's case and five items in my case. The issue of the values of the tags should not be addressed in the DTD (see my argument below).
Also, in the meantime I have spent two hours talking to Frank Tompa and Tim Bray from the OED project at Waterloo. They now have 540 megabytes of tagged data that presents the OED and they did not find a need to use one attribute. This evidence, along with my own experience in the publishing domain where we also found no need for attributes, has pretty well convinced me that they are not necessary. I have tried to convey this conviction below.
I have to make a decision about the recommendations for the June meeting. I see two choices: 1) I can include what is below and try to have it accepted officially by the ML committee, or 2) I can say we have not yet reached a concensus (I see this as the only alternate approach. I cannot write guidelines for using attributes since I don't think they should be used. If such guidelines are to be forthcoming, someone is going to have to write them and convince the rest of us that they make sense.).
Let me know what you think. If you can't respond by next Friday, May 5, and want to, please let me know by when I can expect some input from you. I want to make a decision about the draft to be distributed for the June meeting by May 5.
Recommendation. We recommend that attributes not be used in the declaration of DTDs or their instances.
Detailed Justification. SGML provides two mechanisms for marking information in a DTD: tags and attributes. Tags are used to mark the begin and end of each element that is defined in a DTD. Attributes are associated with particular elements. Their notation is different from that of tags. They cannot be decomposed hierarchically, and they can have certain values assigned to them.
The use of two mechanisms to mark information is necessarily more complex than the use of one. It has already been established that DTDs are complicated to define and use. Thus, two separate mechanisms can only be justified if there is considerable advantage to be gained from having both. We argue that there is no advantage to be gained, either semantically or syntactically, and that only tags, the better of the two mechanisms, should be used.
From a semantic point of view, there is no universal and unambiguous way to distinguish tags from attributes. At best, distinguishing criteria have been suggested that are application dependent. These criteria are not generally applicable across all domains and even lead to contradictory distinctions within a single domain.
For example, in the publishing domain, the concept of `hierarchy' has been proposed as a distinguishing one for tags and attributes. All those pieces of information in a manuscript that are clearly hierarchical are to be marked using tags. All other information is to be marked using attributes. Thus, the manuscript components frontmatter, body and rearmatter would be tagged. But nonhierarchical items like references and citations would be marked using attributes.
The problem with the `hierarchy' criterion is that the concept very quickly blurs as DTDs are defined in other domains. For example, suppose in the linguistics domain it is desired to define a DTD to mark parts of speech, like verb, noun, adverb and so on. Further, assume that for each part it is desired to mark information like gender, person, number, tense and so on. In this case, neither the parts of speech themselves, nor the information about them, is hierarchical in any clear sense.
Another concept that has been applied in the publishing domain to distinguish tags from attributes is that of `content'. All those substantive pieces of information that will actually appear on a printed page, i.e., the content of a manuscript, are to be marked with tags. All else, e.g., formatting or rendering information, are to be marked using attributes. Note that using this criterion , references and citations would now be marked using tags, while using the hierarchy criterion they would be marked using attributes, even though both criteria stem from the same application.
The concept of `content' also quickly blurs as one moves outside the domain of publishing. For example, assume that an analysis is to be done on a table in a manuscript to evaluate visual cues that readers might use to distinguish items in a table. From the point of view of this analysis, column separators would be considered to be `content'. However, suppose for the same table that one wishes to find the average of all the values in the columns of the table. From this point of view, column separators would be considered a nonsubstantive part of the table.
From the syntactic point of view, tags are more powerful than attributes in the sense that tags can be further decomposed into subcomponents. So, from this point of view, they should always be chosen over attributes. On the other hand, certain values, or ranges of allowable values, can be associated with attributes, but not with tags. So, a case can be made for using attributes only if a case can be made that it is desirable to specify allowable values of certain strings when a DTD is defined.
In practice, the value of any terminal string in a DTD is strictly a matter of interest or concern only to a particular application that may be processing instances of the DTD. This is equally true for those strings that may be delimited using tags or those that may be delimited using the attribute mechanism. Since it is in the application that the interest originates, it should rightly be in the application that the necessary work to pursue the interest, i.e., specifying values and constraints on them, is done.
For example, consider the `value' of zip code information. One application may wish to generate mailing lists only for those entries with postal zones indicating residence in Oregon and Washington. Another may wish to verify that all the zip codes are valid based on an official U.S. Postal Service listing of such codes. Or, alternatively, an application may search for all zip codes that are invalid based on such a listing. Still another may wish to analyze the postal codes in order to determine a geographic distribution of entries in the southern portion of the U.S. Each of these applications is interested in a constrained `value' of the string that is used to represent the zip code. However there is no obvious way to impose constraints on the value of the zip code when the DTD is defined such that all of the interests of these varied applications would be served. Further, it would be a bad design policy to burden the DTD itself with a set of constraints that are of interest to only one or a few applications.
Or, consider the `value' of a `column separator' associated with a table element. Suppose a list of allowable values for the separator has been specified in the DTD. This list is likely to be based on some application for instances of the DTD, say a formatter. Now assume that someone is creating an instance of this DTD by marking an already existing manuscript. Suppose that the manuscript has a value for a column separator that does not occur on the list of allowable values. The definer is faced with the choice of illegal markup or no markup at all. On the other hand, if `column separator' had been marked using tags, the definer would simply fill in the string corresponding to the true value of the column separator.
Even if a case could be made for assigning values in the DTD, the mechanism provided by SGML is flawed and would likely prove inadequate for the need. Attributes are declared in SGML using an attribute-declaration statement that specifies the element name to which the attribute applies, the attribute name, a `declared value' for the attribute and a default value. The declared value is intended to specify the possible range of values from which the actual value of the attribute may be chosen.
The SGML standard has attempted to anticipate the possible range of values that would be used by definers of DTDs by restricting the `declared-value' parameter of the attribute declaration to a prespecified set of classes. For example, they can have a declared value of NAME which restricts the attribute value to a string in which the first character is a letter (say) and the rest are either letters, digits, a period or a hyphen. Or attributes can have a declared value of NAMES which restricts the value to a list of NAMEs. All of the prespecified classes allow attributes to take on only one type of value, or a list of one type of value.
The AAP in its DTD for mathematical formulas found these classes to be inadequate to fully specify the desired restrictions on some of the attributes they wanted to declare. For example, to describe the value of an attribute column separators for an array, they wished to specify a list of ordered pairs as the attribute value. The first element of the pair defines the column number and the second element defines the allowable separator, e.g., single line, double line, blank, and so on. Since the standard does not provide sufficient expressive power for this specification, the AAP specified its desired restriction in a lengthy comment following the formal declaration for the attribute (see AAP86c, p. 49). Clearly, such a description is outside the standard, and no software that has been designed to validate or analyze this DTD will be able to deal with this type of declaration.
Two separate mechanisms for marking information in manuscripts are not needed unless each provides unique, necessary functionality. Tags are more powerful than attributes in that tags can be further decomposed. By this criterion, the tag mechanism should be chosen in favor of attributes. Attributes can be specified to assume only constrained values, while tags cannot. But constraints on values cannot be specified in a general, reasonable way when defining a DTD. Such constraint issues are rightly placed within the purview of the applications that need them. So attributes serve no unique, necessary function, they further complicate an already complex task, and they should not be used.