WordInfilter

java.lang.Object
- org.xml.sax.helpers.XMLFilterImpl
- - com.deltaxml.pipe.XMLFilterImpl2
  - - com.deltaxml.pipe.AncestorXMLFilter
    - - com.deltaxml.pipe.filters.PCDataConcatenatingInfilter
      - com.deltaxml.pipe.filters.dx2.wbw.WordInfilter

All Implemented Interfaces:

org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler, org.xml.sax.ext.DeclHandler, org.xml.sax.ext.LexicalHandler, org.xml.sax.XMLFilter, org.xml.sax.XMLReader

Direct Known Subclasses:

FormatAwareWordInfilter
```
public class WordInfilter
extends PCDataConcatenatingInfilter
```
This class provides a way of breaking or segmenting large sequences of text (such as paragraphs and sentences) into smaller segments or chunks so that change can be reported at finer granularity. It is designed to be used in combination with other pipeline filters as discussed in Word by Word Text Comparison.

A 'break iterator' is used to break XML text nodes into smaller nodes. This does not use the standard break iterator included in the Java runtime library. At the time of development the break iterators supplied as part of the International Components for Unicode (ICU) project had a more active release schedule responding to the needs of international users. We have made the decision to include the icu4j libraries in the XML Compare product and monitor new releases and propagate bug reports and fixes when supporting our customers.

The ICU Break Iterators are rule driven state-machine and optionally dictionary based. This filter will make use of xml:lang attributes used in XML files and use an appropriate break iterator corresponding to the specified locale. We also have provided mechanisms to allow users to add their own ICU-based break iterators, either for specific locales or for the entire file.

Because ICU break iterators are expensive to create we use a cache of iterators backed by a WeakHashMap. The first occurrence of an canonicalized xml:lang attribute in a file will create a new cache entry and when the locale is changed with a new xml:lang attribute the cache is consulted to see if an existing break iterator can be reused. The cache is initially empty when this class is constructed and methods are provided to explicitly clear the cache or clear the cache for new documents passing through a pipeline.

Since:

6.3

See Also:

Language Identification section of W3C XML Recommendation, Java SE Documentation - Internationalization Support, ICU Project Website

Constructor Summary

Constructors
Constructor and Description

WordInfilter()
Construct a new WordInfilter.

Constructors
Constructor and Description
`WordInfilter()` Construct a new WordInfilter.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`addBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator rbbi, com.ibm.icu.util.ULocale locale)` Adds a user defined break iterator to those considered when selecting a new locale according to xml:lang settings.
`void`	`bufferedCharacters(char[] ch, int start, int length)` Overrides the default `bufferedCharacters` method with break iterator specific splitting.
`void`	`endDocument()` Overrides the default `endDocument` method.
`java.lang.String[]`	`getCachedLocales()` Reports the locales used as keys in the break iterator cache.
`boolean`	`getCacheResetForNewDocument()` Reports the current cache reset setting.
`com.ibm.icu.text.RuleBasedBreakIterator`	`getDefaultBreakIterator()` Returns the current default Break Iterator.
`boolean`	`getIgnoreXMLLangAttributes()` Reports whether xml:lang attributes are current ignored or considered when using BreakIterators.
`void`	`resetCache()` A reset or clear operation for the Break Iterator cache.
`void`	`setCacheResetForNewDocument(boolean resetNewDoc)` Controls whether the cache of BreakIterators is cleared for each new document.
`void`	`setDefaultBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator rbbi)` Sets the break iterator to be used when xml:lang attributes do not specify a locale in the input XML.
`void`	`setIgnoreXMLLangAttributes(boolean ignore)` Controls whether xml:lang attribute settings are used or ignored.
`void`	`startDocument()` Overrides the default `startDocument` method.

Methods inherited from class com.deltaxml.pipe.filters.PCDataConcatenatingInfilter
characters, endElement, ignorableWhitespace, startElement, startPrefixMapping

Methods inherited from class com.deltaxml.pipe.XMLFilterImpl2
attributeDecl, comment, elementDecl, endCDATA, endDTD, endEntity, externalEntityDecl, internalEntityDecl, parse, parse, startCDATA, startDTD, startEntity

Methods inherited from class org.xml.sax.helpers.XMLFilterImpl
endPrefixMapping, error, fatalError, getContentHandler, getDTDHandler, getEntityResolver, getErrorHandler, getFeature, getParent, notationDecl, processingInstruction, resolveEntity, setContentHandler, setDocumentLocator, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setParent, skippedEntity, unparsedEntityDecl, warning

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - WordInfilter
```
public WordInfilter()
```
    Construct a new WordInfilter.
    
    The Break Iterator cache will be empty and the default BreakIterator will be determined from the JVM default locale.
- Method Detail
  - setDefaultBreakIterator
```
public void setDefaultBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator rbbi)
                             throws java.lang.IllegalArgumentException
```
    Sets the break iterator to be used when xml:lang attributes do not specify a locale in the input XML.
    
    The default break iterator is used when either a locale has not been specified using xml:lang on an ancestor element and also when the xml:lang attribute is set to the empty string, thereby overriding ancestor settings. It is also used when the setIgnoreXMLLangAttributes is used with a true value and thus when xml:lang attributes are ignored.
    
    When this method is not used a default break iterator is used based on the default locale of the Java Virtual Machine (JVM). The JVM default locale is configured at JVM startup time and may use system properties or environemnt variables from the underlying operating system. For example, to start a JVM with the equivalent of the xml:lang="fr_CA" setting the following system properties would be used:
    java -Duser.language=fr -Duser.country=CA ...
    The operating system may also provide facilities to change locale settings either on a per login session basis or at installation time, this typically involves environment variables such as LANG, LC_CTYPE or LC_ALL.
    
    Parameters:
    
    rbbi - a break iterator to be used when no locale is specified by the XML.
    
    Throws:
    
    java.lang.IllegalArgumentException - if the supplied break iterator is null
  - getDefaultBreakIterator
```
public com.ibm.icu.text.RuleBasedBreakIterator getDefaultBreakIterator()
```
    Returns the current default Break Iterator. This is either the iterator set using setDefaultBreakIterator or one determined from the default locale of the JVM.
    
    Returns:
    
    the current default break iterator
    
    See Also:
    
    WordInfilter.setDefaultBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator)
  - setCacheResetForNewDocument
```
public void setCacheResetForNewDocument(boolean resetNewDoc)
```
    Controls whether the cache of BreakIterators is cleared for each new document.
    
    If processing many similar documents through a pipeline there may be performance benefits from keeping BreakIterators for locales specified in previous documents in memory. The cache is empty when the filter is initialized (normally when the pipeline is constructed). The SAX StartDocument event is the point at which the cache will be cleared when this setting is true. It is false by default.
    
    Parameters:
    
    resetNewDoc - when true the cache is cleared for each new document, when false the cache is preserved between documents
    
    See Also:
    
    WordInfilter.getCacheResetForNewDocument(), WordInfilter.resetCache()
  - getCacheResetForNewDocument
```
public boolean getCacheResetForNewDocument()
```
    Reports the current cache reset setting.
    
    Returns:
    
    true if the cache is reset for each new document, false otherwise
    
    See Also:
    
    WordInfilter.setCacheResetForNewDocument(boolean)
  - resetCache
```
public void resetCache()
```
    A reset or clear operation for the Break Iterator cache.
    
    This method calls the clear operation on the underlying WeakHasMap.
    
    See Also:
    
    WordInfilter.setCacheResetForNewDocument(boolean)
  - getCachedLocales
```
public java.lang.String[] getCachedLocales()
```
    Reports the locales used as keys in the break iterator cache.
    
    The keys will correspond to the cananicalized values of the xml:lang attributes found in the XML file(s) that have been processed by the filter since the cache was last reset.
    
    Returns:
    
    An array of strings corresponding to the locales supported by the cached break iterators
  - setIgnoreXMLLangAttributes
```
public void setIgnoreXMLLangAttributes(boolean ignore)
```
    Controls whether xml:lang attribute settings are used or ignored.
    
    When this setting is configured to true, xml:lang attributes in the document are ignored and the default Break Iterator is used for all of the content.
    
    Parameters:
    
    ignore - when true xml:lang settings are ignored
  - getIgnoreXMLLangAttributes
```
public boolean getIgnoreXMLLangAttributes()
```
    Reports whether xml:lang attributes are current ignored or considered when using BreakIterators.
    
    Returns:
    
    the current setting of the ignore xml:lang option
    
    See Also:
    
    WordInfilter.setIgnoreXMLLangAttributes(boolean)
  - addBreakIterator
```
public void addBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator rbbi,
                             com.ibm.icu.util.ULocale locale)
```
    Adds a user defined break iterator to those considered when selecting a new locale according to xml:lang settings.
    
    It is possible to use the BreakIterator.registerInstance(com.ibm.icu.text.BreakIterator, java.util.Locale, int) static method on underlying ICU BreakIterator class, however we expect to use RuleBreakBreakIterators (currently the only subtype of the abstract BreakIterator in the ICU 49.1 release) and so we are using a more specific method signature. This method also ensures that the registered iterator is a Word iterator.
    
    Parameters:
    
    rbbi - A user defined Break Iterator
    
    locale - The locale which the iterator supports
    
    See Also:
    
    BreakIterator.registerInstance(com.ibm.icu.text.BreakIterator, java.util.Locale, int)
  - bufferedCharacters
```
public void bufferedCharacters(char[] ch,
                               int start,
                               int length)
                        throws org.xml.sax.SAXException
```
    Overrides the default bufferedCharacters method with break iterator specific splitting.
    
    Overrides:
    
    bufferedCharacters in class PCDataConcatenatingInfilter
    
    Parameters:
    
    ch - an array of characters
    
    start - the start point within the array
    
    length - the length of the character segment
    
    Throws:
    
    org.xml.sax.SAXException - the superclass may throw an exception during processing
    
    See Also:
    
    PCDataConcatenatingInfilter.bufferedCharacters(char[], int, int)
  - endDocument
```
public void endDocument()
                 throws org.xml.sax.SAXException
```
    Overrides the default endDocument method.
    
    Specified by:
    
    endDocument in interface org.xml.sax.ContentHandler
    
    Overrides:
    
    endDocument in class AncestorXMLFilter
    
    Throws:
    
    org.xml.sax.SAXException - the superclass may throw an exception during processing
    
    See Also:
    
    XMLFilterImpl.endDocument()
  - startDocument
```
public void startDocument()
                   throws org.xml.sax.SAXException
```
    Overrides the default startDocument method.
    
    Specified by:
    
    startDocument in interface org.xml.sax.ContentHandler
    
    Overrides:
    
    startDocument in class AncestorXMLFilter
    
    Throws:
    
    org.xml.sax.SAXException - the superclass may throw an exception during processing
    
    See Also:
    
    XMLFilterImpl.startDocument()

Class WordInfilter

Constructor Summary

Method Summary

Methods inherited from class com.deltaxml.pipe.filters.PCDataConcatenatingInfilter

Methods inherited from class com.deltaxml.pipe.AncestorXMLFilter

Methods inherited from class com.deltaxml.pipe.XMLFilterImpl2

Methods inherited from class org.xml.sax.helpers.XMLFilterImpl

Methods inherited from class java.lang.Object

Constructor Detail

WordInfilter

Method Detail

setDefaultBreakIterator

getDefaultBreakIterator

setCacheResetForNewDocument

getCacheResetForNewDocument

resetCache

getCachedLocales

setIgnoreXMLLangAttributes

getIgnoreXMLLangAttributes

addBreakIterator

bufferedCharacters

endDocument

startDocument