web:search:lucene

About

Lucene ¹⁾ is a text search engine library.

The following application are Lucene application (ie build on it):

Solr
Elastic Search
New Relic Logs
…

Structure

The text data model of Lucene is based on the following concept: ²⁾:

index,
document,
field
and term.

An index contains a sequence of documents.

A document is a sequence of fields (json based)
A field is a named sequence of terms.
A term is a sequence of bytes. (The same sequence of bytes in two different fields is considered a different term. Thus terms are represented as a pair: the string naming the field, and the bytes within the field.)

Document

A document is a basic unit of information that can be indexed.

For example, you can have a document for:

a single customer,
a single product,
a single order

Index

An index is a collection of documents that have somewhat similar characteristics.

Lucene's terms index falls into the family of indexes known as an inverted index because it can list, for a term, the documents that contain it. This is the inverse of the natural relationship, in which documents list terms.

For example, you can have an index for:

customer data,
product catalog,
order data.

Query

Lucene comes with a rich query language ³⁾

Syntax:

[field:]expression

where:

field is the document field where the expression applies. It's optional and default to the field text

Cheetsheat:

Relation	Expression
equals	attribute:“value”
does not equal	attribute:-“value”
contains	attribute:value
does not contain	attribute:-value
starts with	attribute:value*
ends with	attribute:*value
has	has:attribute
missing	missing:attribute

Example:

Search the term go in the field text

text:go
# same as
go

Search the term way in the field title and the term go in the field text

title:"The Right Way" and text:go 
# same as
title:"The Right Way" and go

Anatomy of a Lucene Application

To create an lucene application, you should ⁴⁾:

Create Documents by adding Fields;
Create an IndexWriter and add documents to it with addDocument();
Call QueryParser.parse() to build a query from a string; and
Create an IndexSearcher and pass the query to its search() method.

Example:

Analyzer analyzer = new StandardAnalyzer();

Path indexPath = Files.createTempDirectory("tempIndex");
Directory directory = FSDirectory.open(indexPath);
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();

// Now search the index:
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser("fieldname", analyzer);
Query query = parser.parse("text");
ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
assertEquals(1, hits.length);
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {
    Document hitDoc = isearcher.doc(hits[i].doc);
    assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
}
ireader.close();
directory.close();
IOUtils.rm(indexPath);

Example on how to index and query

Simple examples in the repository ⁵⁾ are:

to creates an index for all the files contained in a directory IndexFiles.java
to queries and searches an index SearchFiles.java

Usage:

java -cp lucene-core.jar:lucene-demo.jar:lucene-analysis-common.jar \
    org.apache.lucene.demo.IndexFiles \
    -index index \
    -docs your/directory/path

adding rec.food.recipes/soups/abalone-chowder
      [ ... ]

java -cp lucene-core.jar:lucene-demo.jar:lucene-queryparser.jar:lucene-analysis-common.jar \
   org.apache.lucene.demo.SearchFiles

Query: chowder
Searching for: chowder
34 total matching documents
...

¹⁾

https://lucene.apache.org/

²⁾

Lucene File Format

³⁾

New Relic Query Syntax

⁴⁾

This example comes from the package index - minimal application.

⁵⁾

Demo documentation

Table of Contents