public final class OrphanedWordOutfilter extends WordBufferingFilter
This is an optional fixup outfilter which is part of the Word by Word pipeline. Its operation/application is discussed in more detail in the Word by Word Text Comparison guide
Converts unchanged deltaxml:word
elements into modified deltaxml:word
elements if they are surrounded
by a large sequence of changed deltaxml:word
elements.
Example
The following delta (pretty printed) contains a single unchanged word, "common-word". Even though this word has been correctly matched, in most cases the word is coincidentally the same (i.e. in a large change the word happens to appear in the new and old versions).
<root deltaxml:deltaV2="A!=B"> <deltaxml:word deltaxml:deltaV2="A!=B"> <deltaxml:textGroup deltaxml:deltaV2="A!=B"> <deltaxml:text deltaxml:deltaV2="A">word1</deltaxml:text> <deltaxml:text deltaxml:deltaV2="B">word3</deltaxml:text> </deltaxml:textGroup> </deltaxml:word> <deltaxml:space deltaxml:deltaV2="A=B"> </deltaxml:space> <deltaxml:word deltaxml:deltaV2="A!=B"> <deltaxml:textGroup deltaxml:deltaV2="A!=B"> <deltaxml:text deltaxml:deltaV2="A">word2</deltaxml:text> <deltaxml:text deltaxml:deltaV2="B">word8</deltaxml:text> </deltaxml:textGroup> </deltaxml:word> <deltaxml:space deltaxml:deltaV2="A=B"> /deltaxml:space> <deltaxml:word deltaxml:deltaV2="A=B">common-word</deltaxml:word> <deltaxml:space deltaxml:deltaV2="A=B"> /deltaxml:space> <deltaxml:word deltaxml:deltaV2="A!=B"> <deltaxml:textGroup deltaxml:deltaV2="A!=B"> <deltaxml:text deltaxml:deltaV2="A">word4</deltaxml:text> <deltaxml:text deltaxml:deltaV2="B">word6</deltaxml:text> </deltaxml:textGroup> </deltaxml:word> <deltaxml:space deltaxml:deltaV2="A=B"> </deltaxml:space"> <deltaxml:word deltaxml:deltaV2="A!=B"> <deltaxml:textGroup deltaxml:deltaV2="A!=B"> <deltaxml:text deltaxml:deltaV2="A">word5</deltaxml:text> <deltaxml:text deltaxml:deltaV2="B">word7</deltaxml:text> </deltaxml:textGroup> </deltaxml:word> </root>
The filter identifies the orphaned word, and changes it into a modified word. So our common word becomes:
<deltaxml:word deltaxml:deltaV2="A!=B"> <deltaxml:textGroup deltaxml:deltaV2="A!=B"> <deltaxml:text deltaxml:deltaV2="A">common-word</deltaxml:text> <deltaxml:text deltaxml:deltaV2="B">common-word</deltaxml:text> </deltaxml:textGroup> </deltaxml:word>
Properties
The filter has two properties, as follows:
orphanedLengthLimit sets a hard limit on the number of unchanged words that would make up an "orphaned" section.
Default value is 2.
orphanedThresholdPercentage sets the percentage threshold for calculating whether a group is orphaned. If the equation:
unchanged words / (changed words before + unchanged words + changed words after) * 100
is less than the value of orphanedThresholdPercentage then the orphaned words are incorporated into the changed words.
Default value is 20.
WordInfilter
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_ORPHANED_THRESHOLD_PERCENTAGE
the default value for orphanedThresholdPercentaeg, set at 20.
|
static int |
DEFAULT_ORPHANED_WORD_LIMIT
the default value for orphanedLengthlimit, set at 2.
|
Constructor and Description |
---|
OrphanedWordOutfilter()
Creates a new instance of
OrphanedWordOutfilter . |
Modifier and Type | Method and Description |
---|---|
void |
setorphanedLengthLimit(java.lang.String orphanedLengthLimit)
Sets the value for the maximum number of words that can be included in an orphaned section.
|
void |
setorphanedThresholdPercentage(java.lang.String orphanedThresholdPercentage)
Sets the percentage value for the calculating of orphaned sections.
|
characters, endElement, isBufferModifiedOnly, setwrapUnchangedText, startElement, startPrefixMapping
endDocument, getClosestAttributeValueFromAncestor, getClosestAttributeValueFromAnyAncestor, getGrandParentLocalName, getParentLocalName, getParentQName, getProperty, hasAncestor, hasAncestor, hasAncestorWithAttr, hasAncestorWithAttrValue, hasAncestorWithAttrValues, outputCharacters, outputEndElement, outputStartElement, popAncestorStack, pushAncestorStack, setProperty, stackDepth, startDocument
attributeDecl, comment, elementDecl, endCDATA, endDTD, endEntity, externalEntityDecl, internalEntityDecl, parse, parse, startCDATA, startDTD, startEntity
endPrefixMapping, error, fatalError, getContentHandler, getDTDHandler, getEntityResolver, getErrorHandler, getFeature, getParent, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setContentHandler, setDocumentLocator, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setParent, skippedEntity, unparsedEntityDecl, warning
public static final int DEFAULT_ORPHANED_WORD_LIMIT
public static final int DEFAULT_ORPHANED_THRESHOLD_PERCENTAGE
public OrphanedWordOutfilter()
OrphanedWordOutfilter
.public void setorphanedLengthLimit(java.lang.String orphanedLengthLimit) throws org.xml.sax.SAXException
orphanedLengthLimit
- the limit to set in maximum number of wordsorg.xml.sax.SAXException
- is thrown when the supplied value is invalidpublic void setorphanedThresholdPercentage(java.lang.String orphanedThresholdPercentage) throws org.xml.sax.SAXException
possibly_orphaned_words_count / preceding_changed_words_count + possibly_orphaned_words_count +
following_changed_words_count * 100
orphanedThresholdPercentage
- the percentage threshold that is used to decide whether words orphaned between two changed
sections are indeed orphaned.org.xml.sax.SAXException
- is thrown when the supplied value is invalid