public abstract class LexicalPreservationBase extends XMLFilterImpl3
The LexicalPreservationBase filter provides a means for retaining information that is usually discarded for round trip
processing. For example, it can extract and store DOCTYPE
declarations, element declarations, attribute
declarations, entity declarations, default attribute expansions, and entity references. It can also extract and store some of
the XML Declaration data (i.e. XML version and character encoding attributes), so long as an input stream is used to read the
incoming data. If a reader (character stream) is used then the original character encoding is lost (at least by a Xerces SAX
parser).
The XML declaration data is stored in a preserve:xmldecl
element, which is a child of the root element. The
internal subset data is stored in a preserve:doctype
element, which is a child of the root element. It is possible
to turn off both the XML declaration and DOCTYPE (including its internal subset) storage by removing the
LexicalPreservationBase.PreserveItem.XML_DECL
and LexicalPreservationBase.PreserveItem.DOCTYPE
items from the set of items to preserve respectively.
The LexicalPreservationBase.PreserveItem.COMMENT
and LexicalPreservationBase.PreserveItem.PROCESS_INST
are used to indicate whether XML comments and processing
instructions (PIs) should be encoded into preserve:comment
and pi:tag
elements, where
tag
is replaced by the name of the PI. If they are not encoded, then the XML comments and PIs are left in situ.
Note that normal comments and PIs are lost during the comparison process, so you may wish to use the alternative filter for
encoding them, such as xml2pi.xsl
filter. Encoded XML comments and processing instructions (PIs) that appear
outside the root element are contained in a preserve:pi-and-comment
element, which is a child of the root element.
There can be up to three preserve:pi-and-comment
elements, which are distinguished by their region
attribute value:
LexicalPreservationBase.Region.BEFORE_DTD
- PIs and comments before the DOCTYPE/Internal-Subset declaration.LexicalPreservationBase.Region.AFTER_DTD
- PIs and comments after the DOCTYPE/Internal-Subset declaration, but before the root element.LexicalPreservationBase.Region.AFTER_BODY
- PIs and comments after the root element (XML body has been completed).
The preserve:defaultAttributes
attribute is added to any element that contains defaulted attributes. Its job is to
record the attribute names, so that they can be stripped out later. It is possible to turn off the default attribute
identification by removing the LexicalPreservationBase.PreserveItem.ATTRIBUTES
from the set of items to preserve.
Entity references can be marked for later round trip preservation by adding the LexicalPreservationBase.PreserveItem.ENTITY_REF
to the set of
items to preserve. Further, it is also possible to control the amount of information that is retained when preserving an entity
application, via the use of the LexicalPreservationBase.PreserveItem.ENTITY_REPLACEMENT_TEXT
and LexicalPreservationBase.PreserveItem.INNER_ENTITY_REF
items.
If LexicalPreservationBase.PreserveItem.ENTITY_REPLACEMENT_TEXT
is in the set of preserved items, then the content of the entity application is
retained. If LexicalPreservationBase.PreserveItem.INNER_ENTITY_REF
is in the set of preserved items then inner entity applications are also
marked. Note that it is possible, though not recommended, to omit both the LexicalPreservationBase.PreserveItem.ENTITY_REF
and
LexicalPreservationBase.PreserveItem.ENTITY_REPLACEMENT_TEXT
from the set of items to be preserved; this will result in entity references being
omitted from the output altogether.
Note that the underpinning XML parser must provide Locator2
and Attributes2
objects in order for character
encodings and default attribute expansion to be detected. Further it needs to support the
"http://xml.org/sax/features/resolve-dtd-uris" feature, which is set to false, to ensure that relative references are
retained. This feature is implemented by the Xerces_2 parser.
Example File
(01) <?xml version="1.0" encoding="UTF-8"?> (02) <!-- A pre DOCTYPE comment --> (03) <!DOCTYPE article SYSTEM "http://www.docbook.org/xml/4.5/docbookx.dtd" (04) [ <!ENTITY % paramEnt " (05) <!ATTLIST simpara level (unknown|novice|trainee|practioner|expert) 'unknown'> (06) "> (07) <!ELEMENT exampleElement (#PCDATA)> (08) <!ATTLIST exampleElement yesNo (yes|no) 'no'> (09) %paramEnt; (10) <!ENTITY genEnt "an <emphasis role='bold'>internal (parsed) general</emphasis> entity."> (11) ]> (12) <?myPI Content of the processing instruction.?> (13) <article> (14) <title>Lexical Preservation Filter Demo</title> (15) <!-- In the following paragraph we reference the entity &genEnt; --> (16) <para>This paragraph references &genEnt;</para> (17) <para><![CDATA[Content of the CDATA Section text]]></para> (18) <simpara>An overridden simpara with a defaulted level attribute.</simpara> (19) </article> (20) <!-- A post XML body comment -->
The results of applying this filter to the above file is discussed in stages where the lines under consideration are reproduced along with the output where everything is being preserved. For clarity, the whitespace aspect of the preservation is not maintained.
XML Declaration and Root Element
(01) <?xml version="1.0" encoding="UTF-8"?> (13) <article>
In order to compare and preserve the XML declaration it is added to the body of the document, as illustrated by line labelled (01b) below. Further, the namespaces that are used by this lexical preservation filter (13b to 13e) are attached to the * root element (13a).
(01a) <?xml version="1.0" encoding="UTF-8"?> (13a) <article (13b) xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" (13c) xmlns:preserve="http://www.deltaxml.com/ns/preserve" (13d) xmlns:er="http://www.deltaxml.com/ns/entity-references" (13e) xmlns:pi="http://www.deltaxml.com/ns/processing-instructions"> (01b) <preserve:xmldecl xml-version="1.0" encoding="UTF-8"/>
Processing Instructions and Comments before the DOCTYPE declaration
(02) <!-- A pre DOCTYPE comment -->
In order to compare and preserve comments and processing instructions that occur before the DOCTYPE declaration a
preserve:pi-and-comment
block is introduced, with mode BEFORE_DTD
.
(02a) <preserve:pi-and-comment region="BEFORE_DTD"> (02b) <preserve:comment> A pre DOCTYPE comment </preserve:comment> (02c) </preserve:pi-and-comment>
DOCTYPE and Internal Subset Declaration
(03) <!DOCTYPE article SYSTEM "http://www.docbook.org/xml/4.5/docbookx.dtd" (04) [ <!ENTITY % paramEnt " (05) <!ATTLIST q level (unknown|novice|trainee|practioner|expert) 'unknown'> (06) "> (07) <!ELEMENT exampleElement (#PCDATA)> (08) <!ATTLIST exampleElement yesNo (yes|no) 'no'> (09) %paramEnt; (10) <!ENTITY genEnt "an <emphasis role='bold'>internal (parsed) general</emphasis> entity."> (11) ]>
In order to compare and preserve the DOCTYPE and internal subset declaration it is added to the preserve:doctype
block.
The content of the entity declarations is escaped using the ASCII exclamation mark (!) characters, where '!!' represents the
exclamation mark character. This special form of escaping ensures that it does not interfere with standard XML entity encoding
mechanisms. Note that the entity reference, in line (09), is transformed into four lines (09a), (05b), (05c), and (09b); the
key point is that the definition of the entity has been expanded, and so can be compared.
(03a) <preserve:doctype name="article" systemId="http://www.docbook.org/xml/4.5/docbookx.dtd"> (04a) <preserve:internalParsedParameterEntityDecl name="paramEnt" deltaxml:key="entity_par_paramEnt" (05a) value=" !(*lt!)!!ATTLIST simpara level (unknown|novice|trainee|practioner|expert) !(*apos!)unknown!(*apos!)!(*gt!) " (06a) /> (07a) <preserve:elementDecl name="exampleElement" deltaxml:key="element_exampleElement" model="(#PCDATA)"/> (08a) <preserve:attributeDecl name="yesNo" deltaxml:key="attribute(exampleElement,yesNo)" (08b) eName="exampleElement" type="(yes|no)" value="no"/> (09a) <er:paramEnt parameter="yes"> (05b) <preserve:attributeDecl name="level" deltaxml:key="attribute(simpara,level)" (05c) eName="simpara" type="(unknown|novice|trainee|practioner|expert)" value="unknown"/> (09b) </er:paramEnt> (10a) <preserve:internalParsedGeneralEntityDecl name="genEnt" deltaxml:key="entity_gen_genEnt" (10b) value="an !(*lt!)emphasis role=!(*apos!)bold!(*apos!)!(*gt!)internal (parsed) general!(*lt!)/emphasis!(*gt!) entity."/> (03b) </preserve:doctype>
Processing instructions and comments between DOCTYPE and body
(12) <?myPI Content of the processing instruction.?>
In order to compare and preserve comments and processing instructions that occur between the DOCTYPE declaration and the body,
a preserve:pi-and-comment
block is introduced, with mode AFTER_DTD
.
(12a) <preserve:pi-and-comment region="AFTER_DTD"> (12b) <pi:myPI>Content of the processing instruction.</pi:myPI> (12c) </preserve:pi-and-comment>
The body of the document
(14) <title>Lexical Preservation Filter Demo</title> (15) <!-- In the following paragraph we reference the entity &genEnt; --> (16) <para>This paragraph references &genEnt;</para> (17) <para><![CDATA[Content of the CDATA Section text] ]></para> (18) <simpara>An overridden simpara with a defaulted level attribute.</simpara>
The following 5 lines illustrate how comments, entity references, CDATA sections, and defaulted attributes are encoded so that they can be compared and preserved within the body of the document.
(14a) <title>Lexical Preservation Filter Demo</title> (15a) <preserve:comment> In the following paragraph we reference the entity &genEnt; </preserve:comment> (16a) <para>This paragraph references <er:genEnt>an <emphasis role="bold">internal (parsed) general</emphasis> entity.</er:genEnt></para> (17a) <para><preserve:cdata>Content of the CDATA Section text</preserve:cdata></para> (18a) <simpara level="unknown" preserve:defaultAttributes="{}level">An overridden simpara with a defaulted level attribute.</simpara>
Ending the body of the document and the post body Processing Instructions and Comments.
(19) </article> (20) <!-- A post XML body comment -->]]>
In order to compare and preserve comments and processing instructions that occur after the body of the document a
preserve:pi-and-comment
block is introduced, with mode AFTER_BODY
.
(20a) <preserve:pi-and-comment region="AFTER_BODY"> (20b) <preserve:comment> A post XML body comment </preserve:comment> (20c) </preserve:pi-and-comment> (19a) </article>
Modifier and Type | Class and Description |
---|---|
static class |
LexicalPreservationBase.PreserveItem
An enum used to specify which items should be preserved.
|
static class |
LexicalPreservationBase.Region
An enumeration marks the region of the document that is being parsed.
|
XMLFilterImpl3.SaxEventItem
PROCESS_ADDITIONAL_INFO, PROCESS_ALL, PROCESS_BODY, PROCESS_DATA, PROCESS_DECLS, PROCESS_ELEM_AND_ATTRIB_DECLS, PROCESS_ENTITY_AND_NOTATION_DECLS, PROCESS_ENTITY_DECLS, PROCESS_ENTITY_REFS, PROCESS_EXCEPTIONS, PROCESS_INTERNAL_SUBSET, PROCESS_NORMAL_BODY
Constructor and Description |
---|
LexicalPreservationBase()
Constructs a new LexicalPreservationBase.
|
Modifier and Type | Method and Description |
---|---|
void |
attributeDecl(java.lang.String eName,
java.lang.String aName,
java.lang.String type,
java.lang.String mode,
java.lang.String value)
Implementation of the attributeDecl SAX event handler.
|
void |
characters(char[] ch,
int start,
int length)
Implementation of the characters SAX event handler.
|
void |
comment(char[] ch,
int start,
int length)
Implementation of the comment SAX event handler.
|
void |
elementDecl(java.lang.String name,
java.lang.String model)
Implementation of the elementDecl SAX event handler.
|
void |
endCDATA()
Implementation of the endCDATA SAX event handler.
|
void |
endDocument()
Implementation of the SAX endDocument event handler.
|
void |
endDTD()
Implementation of the endDTD SAX event handler.
|
void |
endElement(java.lang.String uri,
java.lang.String localName,
java.lang.String name)
Implementation of the SAX startElement event handler.
|
void |
endEntity(java.lang.String name)
Implementation of the endEntity SAX event handler.
|
void |
externalEntityDecl(java.lang.String name,
java.lang.String publicId,
java.lang.String systemId)
Implementation of the externalEntityDecl SAX event handler.
|
void |
ignorableWhitespace(char[] ch,
int start,
int length)
Implementation of the ignorableWhitespace SAX event handler.
|
void |
internalEntityDecl(java.lang.String name,
java.lang.String value)
Implementation of the internalEntityDecl SAX event handler.
|
void |
notationDecl(java.lang.String name,
java.lang.String publicId,
java.lang.String systemId)
Implementation of the notationDecl SAX event handler.
|
void |
processingInstruction(java.lang.String target,
java.lang.String data)
Implementation of the processingInstruction SAX event handler.
|
void |
setDocumentLocator(org.xml.sax.Locator locator)
Implementation of the SAX setDocumentationLocator event handler.
|
void |
setParent(org.xml.sax.XMLReader parent) |
void |
skippedEntity(java.lang.String name)
Implementation of the skipEntity SAX event handler.
|
void |
startCDATA()
Implementation of the startCDATA SAX event handler.
|
void |
startDocument()
Implementation of the SAX startDocument event handler.
|
void |
startDTD(java.lang.String name,
java.lang.String publicId,
java.lang.String systemId)
Implementation of the startDTD SAX event handler.
|
void |
startElement(java.lang.String uri,
java.lang.String localName,
java.lang.String name,
org.xml.sax.Attributes atts)
Implementation of the SAX startElement event handler.
|
void |
startEntity(java.lang.String name)
Implementation of the startEntity SAX event handler.
|
void |
unparsedEntityDecl(java.lang.String name,
java.lang.String publicId,
java.lang.String systemId,
java.lang.String notationName)
Implementation of the externalEntityDecl SAX event handler.
|
endPrefixMapping, error, fatalError, getProcessFilter, processEvent, setProcessFilter, setProcessFilter, setProcessFilter, startPrefixMapping, warning
getProperty, parse, parse, setProperty
public LexicalPreservationBase()
public void setParent(org.xml.sax.XMLReader parent)
setParent
in interface org.xml.sax.XMLFilter
setParent
in class org.xml.sax.helpers.XMLFilterImpl
public void setDocumentLocator(org.xml.sax.Locator locator)
setDocumentLocator
in interface org.xml.sax.ContentHandler
setDocumentLocator
in class XMLFilterImpl3
locator
- The SAX locator data.public void startDocument() throws org.xml.sax.SAXException
startDocument
in interface org.xml.sax.ContentHandler
startDocument
in class XMLFilterImpl3
org.xml.sax.SAXException
- when there is a problem with the SAX event stream.public void endDocument() throws org.xml.sax.SAXException
endDocument
in interface org.xml.sax.ContentHandler
endDocument
in class XMLFilterImpl3
org.xml.sax.SAXException
- when there is a problem with the SAX event stream.public void startElement(java.lang.String uri, java.lang.String localName, java.lang.String name, org.xml.sax.Attributes atts) throws org.xml.sax.SAXException
startElement
in interface org.xml.sax.ContentHandler
startElement
in class XMLFilterImpl3
uri
- The namespace of the element.localName
- The element's local name.name
- The element's qualified name.atts
- The element's attributes.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.public void endElement(java.lang.String uri, java.lang.String localName, java.lang.String name) throws org.xml.sax.SAXException
endElement
in interface org.xml.sax.ContentHandler
endElement
in class XMLFilterImpl3
uri
- The namespace of the element.localName
- The element's local name.name
- The element's qualified name.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.public void startDTD(java.lang.String name, java.lang.String publicId, java.lang.String systemId) throws org.xml.sax.SAXException
startDTD
in interface org.xml.sax.ext.LexicalHandler
startDTD
in class XMLFilterImpl3
name
- The name of the root element.publicId
- The doctype's public identifier.systemId
- The doctype's system identifier.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.LexicalHandler.startDTD(String, String, String)
public void endDTD() throws org.xml.sax.SAXException
endDTD
in interface org.xml.sax.ext.LexicalHandler
endDTD
in class XMLFilterImpl3
org.xml.sax.SAXException
- when there is a problem with the SAX event stream.LexicalHandler.startDTD(String, String, String)
public void startEntity(java.lang.String name) throws org.xml.sax.SAXException
startEntity
in interface org.xml.sax.ext.LexicalHandler
startEntity
in class XMLFilterImpl3
name
- The entity's name.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.LexicalHandler.startEntity(String)
public void endEntity(java.lang.String name) throws org.xml.sax.SAXException
endEntity
in interface org.xml.sax.ext.LexicalHandler
endEntity
in class XMLFilterImpl3
name
- The entity's name.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.LexicalHandler.endEntity(String)
public void skippedEntity(java.lang.String name) throws org.xml.sax.SAXException
skippedEntity
in interface org.xml.sax.ContentHandler
skippedEntity
in class XMLFilterImpl3
name
- The entity's name.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.public void elementDecl(java.lang.String name, java.lang.String model) throws org.xml.sax.SAXException
elementDecl
in interface org.xml.sax.ext.DeclHandler
elementDecl
in class XMLFilterImpl3
name
- The name of the element being declared.model
- The model used to define what content the element is allowed to contain.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.DeclHandler.elementDecl(String, String)
public void attributeDecl(java.lang.String eName, java.lang.String aName, java.lang.String type, java.lang.String mode, java.lang.String value) throws org.xml.sax.SAXException
attributeDecl
in interface org.xml.sax.ext.DeclHandler
attributeDecl
in class XMLFilterImpl3
eName
- The name of the element that the attribute being declared belongs to.aName
- The name of the attribute that is being declared.type
- The attribute's type.mode
- The attribute's mode (e.g required, optional, fixed).value
- The attribute's default value.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.DeclHandler.attributeDecl(String, String, String, String, String)
public void internalEntityDecl(java.lang.String name, java.lang.String value) throws org.xml.sax.SAXException
internalEntityDecl
in interface org.xml.sax.ext.DeclHandler
internalEntityDecl
in class XMLFilterImpl3
name
- The name of the internal parsed entity being declared.value
- The definition of the entities content (replacement text).org.xml.sax.SAXException
- when there is a problem with the SAX event stream.DeclHandler.internalEntityDecl(String, String)
public void externalEntityDecl(java.lang.String name, java.lang.String publicId, java.lang.String systemId) throws org.xml.sax.SAXException
externalEntityDecl
in interface org.xml.sax.ext.DeclHandler
externalEntityDecl
in class XMLFilterImpl3
name
- The name of the external parsed entity being declared.publicId
- The public identity of the entity.systemId
- The system identity of the entity.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.DeclHandler.externalEntityDecl(String, String, String)
public void unparsedEntityDecl(java.lang.String name, java.lang.String publicId, java.lang.String systemId, java.lang.String notationName) throws org.xml.sax.SAXException
unparsedEntityDecl
in interface org.xml.sax.DTDHandler
unparsedEntityDecl
in class XMLFilterImpl3
name
- The name of the unparsed entity being declared.publicId
- The public identity of the entity.systemId
- The system identity of the entity.notationName
- The entity's notationorg.xml.sax.SAXException
- when there is a problem with the SAX event stream.public void notationDecl(java.lang.String name, java.lang.String publicId, java.lang.String systemId) throws org.xml.sax.SAXException
notationDecl
in interface org.xml.sax.DTDHandler
notationDecl
in class XMLFilterImpl3
name
- The name of the notation entity being declared.publicId
- The public identity of the notation.systemId
- The system identity of the notation.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.public void comment(char[] ch, int start, int length) throws org.xml.sax.SAXException
comment
in interface org.xml.sax.ext.LexicalHandler
comment
in class XMLFilterImpl3
ch
- The text.start
- The start position of the text to be extracted.length
- The number of characters to extract.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.LexicalHandler.comment(char[], int, int)
public void processingInstruction(java.lang.String target, java.lang.String data) throws org.xml.sax.SAXException
processingInstruction
in interface org.xml.sax.ContentHandler
processingInstruction
in class XMLFilterImpl3
target
- The name of the processing instruction.data
- The content of the processing instruction.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.public void startCDATA() throws org.xml.sax.SAXException
startCDATA
in interface org.xml.sax.ext.LexicalHandler
startCDATA
in class XMLFilterImpl3
org.xml.sax.SAXException
- when there is a problem with the SAX event stream.LexicalHandler.startCDATA()
,
ContentHandler.characters(char[], int, int)
public void endCDATA() throws org.xml.sax.SAXException
endCDATA
in interface org.xml.sax.ext.LexicalHandler
endCDATA
in class XMLFilterImpl3
org.xml.sax.SAXException
- when there is a problem with the SAX event stream.LexicalHandler.endCDATA()
,
ContentHandler.characters(char[], int, int)
public void characters(char[] ch, int start, int length) throws org.xml.sax.SAXException
characters
in interface org.xml.sax.ContentHandler
characters
in class XMLFilterImpl3
ch
- The text.start
- The start position of the text to be extracted.length
- The number of characters to extract.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.public void ignorableWhitespace(char[] ch, int start, int length) throws org.xml.sax.SAXException
ignorableWhitespace
in interface org.xml.sax.ContentHandler
ignorableWhitespace
in class XMLFilterImpl3
ch
- The text.start
- The start position of the text to be extracted.length
- The number of characters to extract.org.xml.sax.SAXException
- when there is a problem with the SAX event stream.