public final class OrphanedWordOutfilter extends WordBufferingFilter
This is an optional fixup outfilter which is part of the Word by Word pipeline. Its operation/application is discussed in more detail in the Word by Word Text Comparison guide
Converts unchanged deltaxml:word elements into modified deltaxml:word elements if they are surrounded
by a large sequence of changed deltaxml:word elements.
Example
The following delta (pretty printed) contains a single unchanged word, "common-word". Even though this word has been correctly matched, in most cases the word is coincidentally the same (i.e. in a large change the word happens to appear in the new and old versions).
<root deltaxml:deltaV2="A!=B"> <deltaxml:word deltaxml:deltaV2="A!=B"> <deltaxml:textGroup deltaxml:deltaV2="A!=B"> <deltaxml:text deltaxml:deltaV2="A">word1</deltaxml:text> <deltaxml:text deltaxml:deltaV2="B">word3</deltaxml:text> </deltaxml:textGroup> </deltaxml:word> <deltaxml:space deltaxml:deltaV2="A=B"> </deltaxml:space> <deltaxml:word deltaxml:deltaV2="A!=B"> <deltaxml:textGroup deltaxml:deltaV2="A!=B"> <deltaxml:text deltaxml:deltaV2="A">word2</deltaxml:text> <deltaxml:text deltaxml:deltaV2="B">word8</deltaxml:text> </deltaxml:textGroup> </deltaxml:word> <deltaxml:space deltaxml:deltaV2="A=B"> /deltaxml:space> <deltaxml:word deltaxml:deltaV2="A=B">common-word</deltaxml:word> <deltaxml:space deltaxml:deltaV2="A=B"> /deltaxml:space> <deltaxml:word deltaxml:deltaV2="A!=B"> <deltaxml:textGroup deltaxml:deltaV2="A!=B"> <deltaxml:text deltaxml:deltaV2="A">word4</deltaxml:text> <deltaxml:text deltaxml:deltaV2="B">word6</deltaxml:text> </deltaxml:textGroup> </deltaxml:word> <deltaxml:space deltaxml:deltaV2="A=B"> </deltaxml:space"> <deltaxml:word deltaxml:deltaV2="A!=B"> <deltaxml:textGroup deltaxml:deltaV2="A!=B"> <deltaxml:text deltaxml:deltaV2="A">word5</deltaxml:text> <deltaxml:text deltaxml:deltaV2="B">word7</deltaxml:text> </deltaxml:textGroup> </deltaxml:word> </root>
The filter identifies the orphaned word, and changes it into a modified word. So our common word becomes:
<deltaxml:word deltaxml:deltaV2="A!=B"> <deltaxml:textGroup deltaxml:deltaV2="A!=B"> <deltaxml:text deltaxml:deltaV2="A">common-word</deltaxml:text> <deltaxml:text deltaxml:deltaV2="B">common-word</deltaxml:text> </deltaxml:textGroup> </deltaxml:word>
Properties
The filter has two properties, as follows:
orphanedLengthLimit sets a hard limit on the number of unchanged words that would make up an "orphaned" section.
Default value is 2.
orphanedThresholdPercentage sets the percentage threshold for calculating whether a group is orphaned. If the equation:
unchanged words / (changed words before + unchanged words + changed words after) * 100
is less than the value of orphanedThresholdPercentage then the orphaned words are incorporated into the changed words.
Default value is 20.
WordInfilter| Modifier and Type | Field and Description |
|---|---|
static int |
DEFAULT_ORPHANED_THRESHOLD_PERCENTAGE
the default value for orphanedThresholdPercentaeg, set at 20.
|
static int |
DEFAULT_ORPHANED_WORD_LIMIT
the default value for orphanedLengthlimit, set at 2.
|
| Constructor and Description |
|---|
OrphanedWordOutfilter()
Creates a new instance of
OrphanedWordOutfilter. |
| Modifier and Type | Method and Description |
|---|---|
void |
setorphanedLengthLimit(java.lang.String orphanedLengthLimit)
Sets the value for the maximum number of words that can be included in an orphaned section.
|
void |
setorphanedThresholdPercentage(java.lang.String orphanedThresholdPercentage)
Sets the percentage value for the calculating of orphaned sections.
|
characters, endElement, isBufferModifiedOnly, setwrapUnchangedText, startElement, startPrefixMappingendDocument, getClosestAttributeValueFromAncestor, getClosestAttributeValueFromAnyAncestor, getGrandParentLocalName, getParentLocalName, getParentQName, getProperty, hasAncestor, hasAncestor, hasAncestorWithAttr, hasAncestorWithAttrValue, hasAncestorWithAttrValues, outputCharacters, outputEndElement, outputStartElement, popAncestorStack, pushAncestorStack, setProperty, stackDepth, startDocumentattributeDecl, comment, elementDecl, endCDATA, endDTD, endEntity, externalEntityDecl, internalEntityDecl, parse, parse, startCDATA, startDTD, startEntityendPrefixMapping, error, fatalError, getContentHandler, getDTDHandler, getEntityResolver, getErrorHandler, getFeature, getParent, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setContentHandler, setDocumentLocator, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setParent, skippedEntity, unparsedEntityDecl, warningpublic static final int DEFAULT_ORPHANED_WORD_LIMIT
public static final int DEFAULT_ORPHANED_THRESHOLD_PERCENTAGE
public OrphanedWordOutfilter()
OrphanedWordOutfilter.public void setorphanedLengthLimit(java.lang.String orphanedLengthLimit)
throws org.xml.sax.SAXException
orphanedLengthLimit - the limit to set in maximum number of wordsorg.xml.sax.SAXException - is thrown when the supplied value is invalidpublic void setorphanedThresholdPercentage(java.lang.String orphanedThresholdPercentage)
throws org.xml.sax.SAXException
possibly_orphaned_words_count / preceding_changed_words_count + possibly_orphaned_words_count +
following_changed_words_count * 100orphanedThresholdPercentage - the percentage threshold that is used to decide whether words orphaned between two changed
sections are indeed orphaned.org.xml.sax.SAXException - is thrown when the supplied value is invalid