Text Analysis inside Lucene - DataScoutingDataScouting

Lucene (http://lucene.apache.org) is a well-known Informational Retrieval (IR) library, implemented in Java, which allows you to add powerful indexing and searching capabilities to your application.

Briefly, there are 2 steps in using Lucene. First, you “feed” it with text which may come from plain text files or other compound documents, such as .pdf or .doc, after extracting their textual information. This process is called indexing, which creates a suitable data structure that allows for fast random access to words stored inside it. The concept behind it is analogous to an index at the end of a book, which quickly locate pages that discuss certain topics. The second step is to actually use the previously created index, that is search for words to find documents where they appear. Lucene supports a wide range of queries such as single and multiterm queries, phrase, queries, wildcards, result ranking and sorting.

Analysis, in Lucene, is the process of converting field text into its most fundamental indexed representation, terms. These terms are used to determine what documents match a query during searches. For example, if this sentence were indexed, the terms might start with for and example, and so on, as separate terms in sequence. An analyzer is an encapsulation of the analysis process. An analyzer tokenizes text by performing any number of operations on it, which could include extracting words, discarding punctuation, removing accents from characters, lowercasing (also called normalizing), removing common words, reducing words to a root form (stemming), or changing words into the basic form (lemmatization). This process is also called tokenization, and the chunks of text pulled from a stream of text are called tokens. Tokens, combined with their associated field name, are terms.

Analysis Implementation

org.apache.lucene.analysis package defines an abstract Analyzer API for converting text from a java.io.Reader into a TokenStream, an enumeration of Tokens. That is, provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There are three main classes in the package from which all analysis processes are derived. These are:

Analyzer – An Analyzer is responsible for building a TokenStream which can be consumed by the indexing and searching processes. Several implementations are provided, including WhitespaceAnalyzer, SimpleAnalyzer, StopAnalyzer and the grammar-based StandardAnalyzer.

Tokenizer – A Tokenizer is a TokenStream and is responsible for breaking up incoming text into Tokens. In most cases, an Analyzer will use a Tokenizer as the first step in the analysis process.

TokenFilter – A TokenFilter is also a TokenStream and is responsible for modifying Tokens that have been created by the Tokenizer. Common modifications performed by a TokenFilter are: deletion, stemming, synonym injection, and down casing. Not all Analyzers require TokenFilters.

A common usage style of TokenStreams and TokenFilters inside an Analyzer is to use the chaining pattern that lets you build complex analyzers from simple Tokenizer/TokenFilter building blocks. Tokenizers start the analysis process by demarcating the character input into tokens (mostly these correspond to words in the original text). TokenFilters then take over the remainder of the analysis, initially wrapping a Tokenizer and successively wrapping nested TokenFilters. For example, at the heart of the StopAnalyzer the code looks like this:

public TokenStream tokenStream(String fieldName, Reader reader) { return new StopFilter( new LowerCaseTokenizer(reader), stopTable); }

In StopAnalyzer, a LowerCaseTokenizer feeds a StopFilter. The LowerCaseTokenizer emits tokens that are adjacent letters in the original text, lowercasing each of the characters in the process. Nonletter characters form token boundaries and aren’t included in any emitted token. Following this word tokenizer and lowercaser, StopFilter removes words in a stop-word list (stopTable reference).

Analysis effect

Analysis occurs at two spots when using Lucene: during indexing and when using QueryParser (class that supports constructs of human-readable query expressions). The following example shows the output of the analysis process when using four built-in analyzer implemenations for two phrases [1]:

Analyzing "The quick brown fox jumped over the lazy dogs"

WhitespaceAnalyzer:

[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

SimpleAnalyzer:

[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

StopAnalyzer:

[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

StandardAnalyzer:

[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

Analyzing "XY&Z Corporation - xyz@example.com"

WhitespaceAnalyzer:

[XY&Z] [Corporation] [-] [xyz@example.com]

SimpleAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com]

StopAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com]

StandardAnalyzer:

[xy&z] [corporation] [xyz@example.com]

Some key points to note for the example are as follows:

WhitespaceAnalyzer didn’t lowercase, left in the dash, and did the bare minimum of tokenizing at whitespace boundaries.

SimpleAnalyzer left in what may be considered irrelevant (stop) words, but it did lowercase and tokenize at nonalphabetic character boundaries.

Both SimpleAnalyzer and StopAnalyzer mangled the corporation name by splitting XY&Z and removing the ampersand.

StopAnalyzer and StandardAnalyzer threw away occurrences of the word the.

StandardAnalyzer kept the corporation name intact and lowercased it, removed the dash, and kept the e-mail address together. No other built-in analyzer is this thorough.

Using Lucene for text analysis can go further by using looking inside tokens and accessing meta-data like offsets, position increments and token types.

References:
[1] – “Lucene In Action”, Manning 2004, http://www.manning.com/hatcher2/
[2] – Lucene 2.3.1 API Javadocs, http://lucene.apache.org/java/2_3_1/api/index.html