XML Vocabularies and their Management

In general XML parlance a "vocabulary" is a set of element types and attributes designed to be used together for some purpose. A given vocabulary may be "encompassing," meaning that it is intended to be used as the main or only vocabulary for a given document, or "enabling," meaning it is intended to be integrated into and used with encompassing vocabularies.

Examples of encompassing vocabularies are DocBook, XHTML, and JATS (nee NLM). Examples of enabling vocabularies are MathML and Dublin Core metadata.

An historical challenge with managing XML (and SGML) vocabularies has been that, while it's easy to define enabling vocabularies like MathML and possible to define "extensible" encompassing vocabularies like DocBook, there has been no standard, defined mechanism for managing how encompassing and enabling vocabularies should be combined to ensure understandability and interchange.¹

Before DITA, all mechanisms for combining or extending vocabularies were entirely syntactic—they provided no way to examine a document and know how that document's vocabulary related with any other vocabulary. Therefore, there was no way you could, for sure, whether your processing environment could handle that document, or if you could share content from that document with your documents.

For example, DocBook supports extension by providing the syntactic hooks needed to locally modify the content model. This allows you to define your own element and attribute types. However, once you have done that, you no longer have DocBook. In fact, the DocBook Definitive Guide specifically says, in bold type, "If You Change DocBook, It's Not DocBook Anymore!" There is no formal way, in the markup, to tell a processor or an observer how your new element types relate to any known element types (that is, to the element types defined in the DocBook standard). ²

In particular, there are no DocBook-defined constraints on how to extend DocBook, which means that a general-purpose processor cannot "fall back," like DITA can, to a usable default when it encounters an unknown element or attribute.

That is, using only the syntactic tools provided by DTDs or XSD schemas (or any other available form of XML document constraint, such as RelaxNG) extension and customization of XML vocabularies is inherently unmanageable because there is no machine-processable mechanism for communicating or understanding the relationship between any two vocabularies.

If customization is not manageable then the only way to ensure interchange is to disallow customization. This was the approach taken by "interchange" document types, such as ATA 2100. But of course invariant vocabularies suffer a number of serious and fatal problems. They tend to become very large because they must reflect a union of the requirements of all current and expected interchange partners. They tend to not satisfy key requirements of individual interchange partners because you can never put everything in or because local requirements are at odds with interchange requirements (for example, markup that is specific to a given company's internal business processes, which might be trade secrets). I think it's fair to say that, as an industry, we've conclusively proved over the last 20 years or so that monolithic interchange document types do not work.

¹ The HyTime standard, ISO/IEC 10744, provided a mechanism for vocabulary management, but it was not widely adopted before XML made it obsolete. You can view DITA's specialization feature as the XML reincarnation of the architectural form facility of HyTime. DITA's specialization feature was directly influenced by HyTime and the way we applied it within IBM to the IBMIDDoc SGML application.

² Norm Walsh has, informally, shown how this could be done (DITA for DocBook), but there is no formal, standards-based implementation.