public class WordInfilter extends PCDataConcatenatingInfilter
This class provides a way of breaking or segmenting large sequences of text (such as paragraphs and sentences) into smaller segments or chunks so that change can be reported at finer granularity. It is designed to be used in combination with other pipeline filters as discussed in Word by Word Text Comparison.
A 'break iterator' is used to break XML text nodes into smaller nodes. This does not use the standard break iterator included in the Java runtime library. At the time of development the break iterators supplied as part of the International Components for Unicode (ICU) project had a more active release schedule responding to the needs of international users. We have made the decision to include the icu4j libraries in the XML Compare product and monitor new releases and propagate bug reports and fixes when supporting our customers.
The ICU Break Iterators are rule driven state-machine and optionally dictionary based. This filter will make use of xml:lang attributes used in XML files and use an appropriate break iterator corresponding to the specified locale. We also have provided mechanisms to allow users to add their own ICU-based break iterators, either for specific locales or for the entire file.
Because ICU break iterators are expensive to create we use a cache of iterators backed by a WeakHashMap. The first occurrence of an canonicalized xml:lang attribute in a file will create a new cache entry and when the locale is changed with a new xml:lang attribute the cache is consulted to see if an existing break iterator can be reused. The cache is initially empty when this class is constructed and methods are provided to explicitly clear the cache or clear the cache for new documents passing through a pipeline.
Constructor and Description |
---|
WordInfilter()
Construct a new WordInfilter.
|
Modifier and Type | Method and Description |
---|---|
void |
addBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator rbbi,
com.ibm.icu.util.ULocale locale)
Adds a user defined break iterator to those considered when selecting a new locale according to xml:lang settings.
|
void |
bufferedCharacters(char[] ch,
int start,
int length)
Overrides the default
bufferedCharacters method with break iterator specific splitting. |
void |
endDocument()
Overrides the default
endDocument method. |
java.lang.String[] |
getCachedLocales()
Reports the locales used as keys in the break iterator cache.
|
boolean |
getCacheResetForNewDocument()
Reports the current cache reset setting.
|
com.ibm.icu.text.RuleBasedBreakIterator |
getDefaultBreakIterator()
Returns the current default Break Iterator.
|
boolean |
getIgnoreXMLLangAttributes()
Reports whether xml:lang attributes are current ignored or considered when using BreakIterators.
|
void |
resetCache()
A reset or clear operation for the Break Iterator cache.
|
void |
setCacheResetForNewDocument(boolean resetNewDoc)
Controls whether the cache of BreakIterators is cleared for each new document.
|
void |
setDefaultBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator rbbi)
Sets the break iterator to be used when xml:lang attributes do not specify a locale in the input XML.
|
void |
setIgnoreXMLLangAttributes(boolean ignore)
Controls whether xml:lang attribute settings are used or ignored.
|
void |
startDocument()
Overrides the default
startDocument method. |
characters, endElement, ignorableWhitespace, startElement, startPrefixMapping
getClosestAttributeValueFromAncestor, getClosestAttributeValueFromAnyAncestor, getGrandParentLocalName, getParentLocalName, getParentQName, getProperty, hasAncestor, hasAncestor, hasAncestorWithAttr, hasAncestorWithAttrValue, hasAncestorWithAttrValues, outputCharacters, outputEndElement, outputStartElement, popAncestorStack, pushAncestorStack, setProperty, stackDepth
attributeDecl, comment, elementDecl, endCDATA, endDTD, endEntity, externalEntityDecl, internalEntityDecl, parse, parse, startCDATA, startDTD, startEntity
endPrefixMapping, error, fatalError, getContentHandler, getDTDHandler, getEntityResolver, getErrorHandler, getFeature, getParent, notationDecl, processingInstruction, resolveEntity, setContentHandler, setDocumentLocator, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setParent, skippedEntity, unparsedEntityDecl, warning
public WordInfilter()
Construct a new WordInfilter.
The Break Iterator cache will be empty and the default BreakIterator will be determined from the JVM default locale.
public void setDefaultBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator rbbi) throws java.lang.IllegalArgumentException
Sets the break iterator to be used when xml:lang attributes do not specify a locale in the input XML.
The default break iterator is used when either a locale has not been specified using xml:lang on an ancestor element and also when the xml:lang attribute is set to the empty string, thereby overriding ancestor settings. It is also used when the setIgnoreXMLLangAttributes is used with a true value and thus when xml:lang attributes are ignored.
When this method is not used a default break iterator is used based on the default locale of the Java Virtual Machine (JVM). The JVM default locale is configured at JVM startup time and may use system properties or environemnt variables from the underlying operating system. For example, to start a JVM with the equivalent of the xml:lang="fr_CA" setting the following system properties would be used:
java -Duser.language=fr -Duser.country=CA ...
The operating system may also provide facilities to change locale settings either on a per login session basis or at
installation time, this typically involves environment variables such as LANG
, LC_CTYPE
or
LC_ALL
.
rbbi
- a break iterator to be used when no locale is specified by the XML.java.lang.IllegalArgumentException
- if the supplied break iterator is nullpublic com.ibm.icu.text.RuleBasedBreakIterator getDefaultBreakIterator()
WordInfilter.setDefaultBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator)
public void setCacheResetForNewDocument(boolean resetNewDoc)
Controls whether the cache of BreakIterators is cleared for each new document.
If processing many similar documents through a pipeline there may be performance benefits from keeping BreakIterators for locales specified in previous documents in memory. The cache is empty when the filter is initialized (normally when the pipeline is constructed). The SAX StartDocument event is the point at which the cache will be cleared when this setting is true. It is false by default.
resetNewDoc
- when true the cache is cleared for each new document, when false the cache is preserved between documentsWordInfilter.getCacheResetForNewDocument()
,
WordInfilter.resetCache()
public boolean getCacheResetForNewDocument()
WordInfilter.setCacheResetForNewDocument(boolean)
public void resetCache()
A reset or clear operation for the Break Iterator cache.
This method calls the clear operation on the underlying WeakHasMap.
public java.lang.String[] getCachedLocales()
Reports the locales used as keys in the break iterator cache.
The keys will correspond to the cananicalized values of the xml:lang attributes found in the XML file(s) that have been processed by the filter since the cache was last reset.
public void setIgnoreXMLLangAttributes(boolean ignore)
Controls whether xml:lang attribute settings are used or ignored.
When this setting is configured to true, xml:lang attributes in the document are ignored and the default Break Iterator is used for all of the content.
ignore
- when true xml:lang settings are ignoredpublic boolean getIgnoreXMLLangAttributes()
WordInfilter.setIgnoreXMLLangAttributes(boolean)
public void addBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator rbbi, com.ibm.icu.util.ULocale locale)
Adds a user defined break iterator to those considered when selecting a new locale according to xml:lang settings.
It is possible to use the BreakIterator.registerInstance(com.ibm.icu.text.BreakIterator, java.util.Locale, int)
static method on underlying ICU BreakIterator class, however
we expect to use RuleBreakBreakIterators (currently the only subtype of the abstract BreakIterator in the ICU 49.1 release)
and so we are using a more specific method signature. This method also ensures that the registered iterator is a Word
iterator.
rbbi
- A user defined Break Iteratorlocale
- The locale which the iterator supportsBreakIterator.registerInstance(com.ibm.icu.text.BreakIterator, java.util.Locale, int)
public void bufferedCharacters(char[] ch, int start, int length) throws org.xml.sax.SAXException
bufferedCharacters
method with break iterator specific splitting.bufferedCharacters
in class PCDataConcatenatingInfilter
ch
- an array of charactersstart
- the start point within the arraylength
- the length of the character segmentorg.xml.sax.SAXException
- the superclass may throw an exception during processingPCDataConcatenatingInfilter.bufferedCharacters(char[], int, int)
public void endDocument() throws org.xml.sax.SAXException
endDocument
method.endDocument
in interface org.xml.sax.ContentHandler
endDocument
in class AncestorXMLFilter
org.xml.sax.SAXException
- the superclass may throw an exception during processingXMLFilterImpl.endDocument()
public void startDocument() throws org.xml.sax.SAXException
startDocument
method.startDocument
in interface org.xml.sax.ContentHandler
startDocument
in class AncestorXMLFilter
org.xml.sax.SAXException
- the superclass may throw an exception during processingXMLFilterImpl.startDocument()