About
An inverted index is an index data structure storing a mapping from:
- token (content), such as words or numbers,
- to its locations (in a database file, document or a set of documents)
- In text search, a forward index maps documents in a data set to the tokens they contain. This is also called the natural relationship in which documents list token (terms).
- An inverted index supports the inverse mapping where it can list, for a term, the documents that contain it.
Given a set of documents, an inverted index is a dictionary where each word is associated with a list of the document identifiers in which that word appears.
Procedure
- The token are stemmed
- stops words are not included
Aysnchronous Update
The full-text index is updated asynchronously (crawled) rather than being maintained transactionally.
Example
map reduce without reduce function as we will get already an (iterator|list) structure of the tweets.
Input:
tweet1, ("I love pancakes for breakfast") \\
tweet2, ("I dislike pancakes") \\
tweet3, ("What should we eat for breakfast?")
tweet4, ("I love to eat")
Desired output:
"pancakes", (tweet1, tweet2)
"breakfast", (tweet1, tweet3)
"eat", (tweet3, tweet4)
"love" (tweet1, tweet4)
Application
Reduce running time of comparison
An inverted index is a data structure that allow to avoid making quadratically the running time of token comparisons. It maps each token in the dataset to the list of documents that contain the token. So, instead of comparing, record by record, each token to every other token to see if they match, the inverted indices is used to look up records that match on a particular token.
Management
Exclude
To not include a Web Page in a search index, the page need to have the noindex directive in a header or meta tag
Example:
- No index for all bot
<meta name="robots" content="noindex">
- No index for googlebot
<meta name="googlebot" content="noindex">
Include
To include a Web Page in a search index, the page need to have the index directive in a header or meta tag
Example:
- index for all bot
<meta name="robots" content="index">
- index for googlebot
<meta name="googlebot" content="index">