This chapter describes the optional DOM Level 3 Content
Model (CM) feature. This module
provides a representation
for XML content models, e.g., DTDs and XML Schemas, together
with operations on the content models, and how such
information within the content
models could be applied to XML documents used in both the
document-editing and
CM-editing worlds. It also provides additional tests for
well-formedness of XML
documents, including Namespace well-formedness. A DOM
application can use the
hasFeature
method of the
DOMImplementation
interface
to determine whether a given DOM supports these capabilities
or not. The
feature string for all the interfaces listed in this section
is "CM".
This chapter interacts strongly with the Load and Save chapter, which is also under development in DOM Level 3. Not only will that code serialize/deserialize content models, but it may also wind up defining its well-formedness and validity checks in terms of what is defined in this chapter. In addition, the CM and Load/Save functional areas will share a common error-reporting mechanism allowing user-registered error callbacks. Note that this may not imply that the parser actually calls the DOM's validation code -- it may be able to achieve better performance via its own -- but the appearance to the user should probably be "as if" the DOM has been asked to validate the document, and parsers should probably be able to validate newly loaded documents in terms of a previously loaded DOM CM.
Finally, this chapter will have separate sections to address the needs of the document-editing and CM-editing worlds, along with a section that details overlapping areas such as validation. In this manner, the document-editing world's focuses on editing aspects and usage of information in the CM are made distinct from the CM-editing world's focuses on defining and manipulating the information in the CM.
In the October 9, 1997 DOM requirements document, the following appeared: "There will be a way to determine the presence of a DTD. There will be a way to add, remove, and change declarations in the underlying DTD (if available). There will be a way to test conformance of all or part of the given document against a DTD (if available)." In later discussions, the following was added, "There will be a way to query element/attribute (and maybe other) declarations in the underlying DTD (if available)," supplementing the primitive support for these in Level 1.
That work was deferred past Level 2, in the hope that XML Schemas would be addressed as well. Presently, it is still unclear whether XML Schemas will be ready in time to be supported in DOM Level 3, but it is anticipated that lowest common denominator general APIs generated in this chapter can support both DTDs and XML Schemas, and other XML content models down the road.
The kinds of information that a Content Model must make available are mostly self-evident from the definitions of Infoset, DTDs, and XML Schemas. However, some kinds of information on which the DOM already relies, e.g., default values for attributes, will finally be given a visible representation here.
The content model referenced in these use cases/requirements is an abstraction and does not refer to DTDs or XML Schemas or any transformations between the two.
For the CM-editing and document-editing worlds, the following use cases and requirements are common to both and could be labeled as the "Validation and Other Common Functionality" section:
Use Cases:
Requirements:
Specific to the CM-editing world, the following are use cases and requirements and could be labeled as the "CM-editing" section:
Use Cases:
Requirements:
Specific to the document-editing world, the following are use cases and requirements and could be labeled as the "Document-editing" section:
Use Cases:
Requirements:
General Issues:
isNamespaceAware
attribute to the generic CM object has been added to help applications determine if qualified names are important. Note that this should not be interpreted as helping identify what the underlying content model is. A MathML example to show how namespaced documents will be validated will be added later.A list of the proposed Content Model data structures and functions follow, starting off with the data structures.
CMObject
is an abstract object that could map to
a DTD, an XML Schema, a database schema, etc. It's a generalized
content model object, that has both an internal and external subset.
The internal subset would always exist, even if empty, with a
link to the external subset, i.e., CMExternalObject
, which may be a non-negative number linked together. It is possible, however, that none of these CMExternalObject
s are active.
An attribute available will be isNamespaceAware
to determine if qualified names are important.
interface CMObject { };
CMExternalObject
is an abstract object that could
map to a DTD, an XML Schema, a database schema, etc. It's a
generalized content model object that is not bound to a
particular XML document. Opaque.
interface CMExternalObject { };
CMNode
, or a CMObject Node, is analogous to a node in
the parse tree, e.g., an element declaration. This can exist for
both CMExternalObject
(include/ignore must be handled
here) and CMObject
. It should handle the following:
interface CommentsPIsDeclaration {
attribute ProcessingInstruction pis;
attribute Comment comments;
};
interface Conditional Declaration {
attribute boolean includeIgnore;
};
Opaque.
interface CMNode { };
CMNodeList
is the CM analogue to
NodeList
; ordering is important, as opposed to
NamedCMNodeMap
. Opaque.
interface CMNodeList { };
NamedCMNodeMap
is the CM analogue to
NamedNodeMap
. Ordering is not important. Opaque.
interface NamedCMNodeMap { };
CMDataType
is a string for now, as in
"int" or "float", hence no typechecking.
interface CMDataType { };
CMType
is a CMNode
's node type. For example, one type
could be ElementDeclaration
, composed of a tagname,
content-type, etc. Others could be ElementCMModel
and AttributeDeclaration
.
interface CMType { };
The element name along with a description: empty, any, mixed, elements, PCDATA, in the context of a CMNode
.
interface ElementDeclaration { readonly attribute DOMString elementName; attribute DOMString contentType; attribute NamedCMNodeMap attributes; };
attributes
of type NamedCMNodeMap
NamedNodeMap
.
contentType
of type DOMString
elementName
of type DOMString
, readonly
An element in the context of a CMNode
.
interface ElementCMModel { attribute DOMString listOperator; attribute int multiplicity; attribute int lowValue; attribute int highValue; attribute NamedCMNodeMap subModels; attribute CMNodeList definingElement; };
definingElement
of type CMNodeList
highValue
of type int
listOperator
of type DOMString
lowValue
of type int
multiplicity
of type int
subModels
of type NamedCMNodeMap
CMNode
s in which the element can be defined.
An attribute in the context of a CMNode
.
interface AttributeDeclaration { readonly attribute DOMString attrName; attribute CMDataType attrType; attribute DOMString defaultValue; attribute DOMString enumAttr; attribute CMNodeList ownerElement; };
attrName
of type DOMString
, readonly
attrType
of type CMDataType
defaultValue
of type DOMString
enumAttr
of type DOMString
ownerElement
of type CMNodeList
As in current DOM.
interface EntityDeclaration { };
This section contains "Validation and Other" methods common to both the document-editing and CM-editing worlds (includes Document
, DOMImplementation
, and ErrorHandler
methods).
This interface extends the Document
interface
with additional methods for both document and CM editing.
interface DocumentCM : Document { boolean isValid(); int numCMs(); CMObject getInternalCM(); CMExternalObject * getCMs(); CMObject getActiveCM(); void addCM(in CMObject cm); void removeCM(in CMObject cm); boolean activateCM(in CMObject cm); };
activateCM
CMObject
active. Note that if an user wants to activate one CM to get default attribute values and then activate another to do validation, an user can do that; however, only one CM is active at a time.cm
of type
CMObject
CMObject
points to a list of CMExternalObject
s; with this call, only the specified CM will be active.
|
True if the |
addCM
CMObject
with a document. Can be invoked multiple times to result in a list of CMExternalObject
s. Note that only one sole internal CMObject
is associated with the document, however, and that only one of the possible list of CMExternalObject
s is active at any one time.cm
of type
CMObject
No return.
getActiveCM
CMExternalObject
for a document.
|
getCMs
CMExternalObject
s associated with a document from the CMObject
. This list arises when addCM()
is invoked.
|
A list of |
getInternalCM
isValid
|
Valid or not. |
numCMs
CMExternalObject
s associated with the document. Only one CMObject
can be associated with the document, but it may point to a list of CMExternalObject
s.
|
Non-negative number of external CM objects. |
removeCM
CMExternalObject
. Can be invoked multiple times to remove a number of these in the list of CMExternalObject
s.cm
of type
CMObject
No return.
This interface extends the DomImplementation
interface
with additional methods.
interface DomImplementationCM : DomImplementation { boolean validate(); CMObject createCM(); CMExternalObject createExternalCM(); CMObject cloneCM(in CMObject cm); CMExternalObject cloneExternalCM(in CMExternalObject cm); };
cloneCM
cloneExternalCM
CMExternalObject
to another CMExternalObject
. The CMExternalObject
returned wouldn't be associated with a document.cm
of type
CMExternalObject
CMObject
to be cloned.
Cloned |
createCM
A NULL return indicates failure. |
createExternalCM
A NULL return indicates failure. |
validate
|
Is the CM valid? |
Basic interface for CM or Load/Save error handlers. If an application needs to implement customized error handling for CM or Load/Save, it must implement this interface and then register an instance using the setErrorHandler method. All errors and warnings will then be reported through this interface. Application writers can override the methods in a subclass to take user-specified actions.
interface ErrorHandler { void warning(in where DOMString, in how DOMString, in why DOMString) raises(DOMException2); void fatalError(in where DOMString, in how DOMString, in why DOMString) raises(DOMException2); void error(in where DOMString, in how DOMString, in why DOMString) raises(DOMException2); };
error
DOMString
of type
where
DOMString
of type
how
DOMString
of type
why
|
A subclass of DOMException. |
fatalError
DOMString
of type
where
DOMString
of type
how
DOMString
of type
why
|
A subclass of DOMException. |
warning
DOMString
of type
where
DOMString
of type
how
DOMString
of type
why
|
A subclass of DOMException. |
This section contains "CM-editing" methods (includes CMObject
, CMNode
, ElementDeclaration
, and ElementCMModel
methods).
CMObject
is an abstract object that could map to
a DTD, an XML Schema, a database schema, etc. It's a generalized
content model object, that has both an internal and external subset.
The internal subset would always exist, even if empty, with a
link to the external subset, i.e., CMExternalObject
, which may be a non-negative number linked together. It is possible, however, that none of these CMExternalObject
s are active.
An attribute available will be isNamespaceAware
to determine if qualified names are important.
interface CMObject { readonly attribute boolean isNamespaceAware; nsElement getCMNamespace(); namedCMNodeMap getCMElements(); boolean removeCMNode(in CMNode node); boolean insertbeforeCMNode(in CMNode newnode, in CMNode parentnode); };
isNamespaceAware
of type boolean
, readonly
getCMElements
getCMNamespace
CMObject
.
|
Namespace of |
insertbeforeCMNode
removeCMNode
CMNode
, or a CMObject Node, is analogous to a node in
the parse tree, e.g., an element declaration. This can exist for
both CMExternalObject
(include/ignore must be handled
here) and CMObject
. It should handle the following:
interface CMNode { CMType getCMNodeType(); };
The element name along with a description: empty, any, mixed, elements, PCDATA, in the context of a CMNode
.
interface ElementDeclaration { int getContentType(); ElementCMModel getCMElement(); namedCMNodeMap getCMAttributes(); namedCMNodeMap getCMElementsChildren(); };
getCMAttributes
getCMElement
Content model of element. |
getCMElementsChildren
getContentType
CMNode
.
|
Content type constant. |
An element in the context of a CMNode
.
interface ElementCMModel { ElementCMModel setCMElementCardinality(in CMNode node, in int high, in int low); ElementCMModel getCMElementCardinality(in CMNode node, out int high, out int low); };
getCMElementCardinality
Element in the context of a |
setCMElementCardinality
Element in the context of a |
This section contains "Document-editing" methods (includes Node
, Element
, Text
and Document
methods).
This interface extends the Node
interface
with additional methods for document editing.
interface NodeCM : Node { boolean canInsertBefore(); boolean canRemoveChild(); boolean canReplaceChild(); boolean canAppendChild(); };
canAppendChild
AppendChild
.
|
Success or failure. |
canInsertBefore
InsertBefore
.
|
Success or failure. |
canRemoveChild
RemoveChild
.
|
Success or failure. |
canReplaceChild
ReplaceChild
.
|
Success or failure. |
An element in the context of a CMNode
.
interface ElementCMModel { boolean isValid(); int contentType(); boolean canSetAttribute(in DOMString attrname, in DOMString attrval); boolean canSetAttributeNode(); };
canSetAttribute
attrname
of type
DOMString
attrval
of type
DOMString
|
Success or failure. |
canSetAttributeNode
|
Success or failure. |
contentType
|
Constant for mixed, empty, any, etc. |
isValid
|
Success or failure. |
This interface extends the Text
interface
with additional methods for document editing.
interface TextCM : Text { boolean isWhitespaceOnly(); boolean canSetData(); boolean canAppendData(); boolean canReplaceData(); boolean canInsertData(); };
canAppendData
|
Success or failure. |
canInsertData
|
Success or failure. |
canReplaceData
|
Success or failure. |
canSetData
|
Success or failure. |
isWhitespaceOnly
|
True if content only whitespace; false for non-whitespace if it is a text node in element content. |
This interface extends the Document
interface
with additional methods for document editing.
interface DocumentCM : Document { boolean isElementDefined(in DOMString elemTypeName); boolean isAttributeDefined(in DOMString elemTypeName, in DOMString attrName); boolean isEntityDefined(in DOMString entName); };
isAttributeDefined
elemTypeName
of type
DOMString
attrName
of type
DOMString
|
Success or failure. |
isElementDefined
elemTypeName
of type
DOMString
|
Success or failure. |
isEntityDefined
entName
of type
DOMString
|
Success or failure. |
Editing and generating a content model falls in the CM-editing world. The most obvious requirement for this set of requirements is for tools that author content models, either under user control, i.e., explicitly designed document types, or generated from other representations. The latter class includes transcoding tools, e.g., synthesizing an XML representation to match a database schema.
It's important to note here that a DTD's "internal subset" is part of the Content Model, yet is loaded, stored, and maintained as part of the individual document instance. This implies that even tools which do not want to let users change the definition of the Document Type may need to support editing operations upon this portion of the CM. It also means that our representation of the CM must be aware of where each portion of its content resides, so that when the serializer processes this document it can write out just the internal subset. A similar issue may arise with external parsed entities, or if schemas introduce the ability to reference other schemas. Finally, the internal-subset case suggests that we may want at least a two-level representation of content models, so a single DOM representation of a DTD can be shared among several documents, each potentially also having its own internal subset; it's possible that entity layering may be represented the same way.
The API for altering the content model may also be the CM's official interface with parsers. One of the ongoing problems in the DOM is that there is some information which must currently be created via completely undocumented mechanisms, which limits the ability to mix and match DOMs and parsers. Given that specialized DOMs are going to become more common (sub-classed, or wrappers around other kinds of storage, or optimized for specific tasks), we must avoid that situation and provide a "builder" API. Particular pairs of DOMs and parsers may bypass it, but it's required as a portability mechanism.
Note that several of these applications require that a CM be able to be created, loaded, and manipulated without/before being bound to a specific Document. A related issue is that we'd want to be able to share a single representation of a CM among several documents, both for storage efficiency and so that changes in the CM can quickly be tested by validating it against a set of known-good documents. Similarly, there is a known problem in DOM Level 2 where we assume that the DocumentType will be created before the Document, which is fine for newly-constructed documents but not a good match for the order in which an XML parser encounters this data; being able to "rebind" a Document to a new CM, after it has been created may be desirable.
As noted earlier, questions about whether one can alter the content of the CM via its syntax, via higher-level abstractions, or both, exist. It's also worth noting that many of the editing concepts from the Document tree still apply; users should probably be able to clone part of a CM, remove and re-insert parts, and so on.
In addition to using the content model to validate a document instance, applications would like to be able to use it to guide construction and editing of documents, which falls into the document-editing world. Examples of this sort of guided editing already exist, and are becoming more common. The necessary queries can be phrased in several ways, the most useful of which may be a combination of "what does the DTD allow me to insert here" and "if I insert this here, will the document still be valid". The former is better suited to presentation to humans via a user interface, and when taken together with sub-tree validation may subsume the latter.
It has been proposed that in addition to asking questions about specific parts of the content model, there should be a reasonable way to obtain a list of all the defined symbols of a given type (element, attribute, entity) independent of whether they're valid in a given location; that might be useful in building a list in a user-interface, which could then be updated to reflect which of these are relevant for the program's current state.
Remember that namespaces also weigh in on this issue, in the case of attributes, a "can-this-go-there" may prompt a namespace-well-formedness check and warn you if you're about to conflict with or overwrite another attribute with the same NSURI/localname but different prefix... or same nodename but different NSURI.
As mentioned above, we have to deal with the fact that the shortest distance between two valid documents may be through an invalid one. Users may want to know several levels of detail (all the possible children, those which would be valid given what preceeds this point, those which would be valid given both preceeding and following siblings). Also, once XML Schemas introduce context sensitive validity, we may have to consider the effect of children as well as the individual node being inserted.
The most obvious use for a content model (DTD or XML Schema or any Content Model) is to use it to validate that a given XML document is in fact a properly constructed instance of the document type described by this CM. This again falls into the document-editing world. The XML spec only discusses performing this test at the time the document is loaded into the "processor", which most of us have taken to mean that this check should be performed at parse time. But it is obviously desirable to be able to revalidate a document -- or selected subtrees -- at other times. One such case would be validating an edited or newly constructed document before serializing it or otherwise passing it to other users. This issue also arises if the "internal subset" is altered -- or if the whole Content Model changes.
In the past, the DOM has allowed users to create invalid documents, and assumed the serializer would accept the task of detecting problems and announcing/repairing them when the document was written out in XML syntax... or that they would be checked for validity when read back in. We considered adding validity checks to the DOM's existing editing operations to prevent creation of invalid documents, but are currently inclined against this for several reasons. First, it would impose a significant amount of computational overhead to the DOM, which might be unnecessary in many situations, e.g., if the change is occurring in a context where we know the result will be valid. Second, "the shortest distance between two good documents may be through a bad document". Preventing a document from becoming temporarily invalid may impose a considerable amount of additional work on higher-level code and users Hence our current plan is to continue to permit editing to produce invalid DOMs, but provide operations which permit a user to check the validity of a node on demand.
Note that validation includes checking that ID attributes are unique, and that IDREFs point to IDs which actually exist.
XML defined the "well-formed" (WF) state for documents which are parsed without reference to their DTDs. Knowing that a document is well-formed may be useful by itself even when a DTD is available. For example, users may wish to deliberately save an invalid document, perhaps as a checkpoint before further editing. Hence, the CM feature will permit both full validity checking (see next section) and "lightweight" WF checking, as requested by the caller. This falls within the document-editing world.
While the DOM inherently enforces some of XML's well-formedness conditions (proper nesting of elements, constraints on which children may be placed within each node), there are some checks that are not yet performed. These include:
In addition, Namespaces introduce their own concepts of well-formedness. Specifically:
namespaceNormalize
operation,
which would create the implied
declarations and reconcile conflicts in
some reasonably standardized manner.
This may be a major undertaking, since
some DOMs may be using the namespace
to direct subclassing of the nodes or
similar special treatment; as with the
existing normalize
method,
you may be left with a
different-but-equivalent set of node
objects.In the past, the DOM has allowed users to create documents which violate these rules, and assumed the serializer would accept the task of detecting problems and announcing/repairing them when the document was written out in XML syntax. We considered adding WF checks to the DOM's existing editing operations to prevent WF violations from arising, but are currently inclined against this for two reasons. First, it would impose a significant amount of computational overhead to the DOM, which might be unnecessary in many situations (for example, if the change is occuring in a context where we know the illegal characters have already been prevented from arising). Second, "the shortest distance between two good documents may be through a bad document" -- preventing a document from becoming temporarily ill-formed may impose a considerable amount of additional work on higher-level code and users. (Note possible issue for Serialization: In some applications, being able to save and reload marginally poorly-formed DOMs might be useful -- editor checkpoint files, for example.) Hence our current plan is to continue to permit editing to produce ill-formed DOMs, but provide operations which permit a user to check the well-formedness of a node on demand, and possily provide expose some of the primitive (eg, string-checking) functions directly.