NLP - (Word Stem|Stemming)

Text Mining

About

To stem words is to remove word endings like -s and -ing.

Stemming is replacing words with their stems.

Stemming is the process of reducing search tokens to their root (or “stem”)

A search for different type of a word will still yield results.

Example

For instance

  • “bikes” is replaced with “bike”; now query “bike” can find both documents containing “bike” and those containing “bikes”.
  • “search”, “searching” and “searched” can all be reduced to the stem “search”.

Algorithm

SnowBall is a good stemming algorithm.

  • See the attribute filter StringsToWordVector in Weka

Weka Stemmer Stringtowordvector

Documentation / Reference





Discover More
Text Mining
Natural Language - Token (Word|Term)

In Natural Language processing, Tokens can be things like: word, numbers, acronyms, word-roots or fixed-length character strings. A token is the result of parsing (tokenization) the document...
Text Mining
Natural Language Processing - (Tokenization|Parser|Text Segmentation|Word Break rules|Text Analysis)

Tokenization is the process of breaking input text into small indexing elements – tokens. Parsing and Tokenization are often call Text Analysis or Analysis in NLP. The tokens (or terms) are used either:...
Text Mining
Search Engine - Search Index - (Postings|Inverted) (Index|File) - Natural Language Processing

An inverted index is an index data structure storing a mapping from: token (content), such as words or numbers, to its locations (in a database file, document or a set of documents) In text search,...
What is a Full Text Search Engine ?

Search Engine (Full Text Search) Full-text search is a battle between: * precision—returning as few irrelevant documents as possible * and recall—returning as many relevant documents as possible....



Share this page:
Follow us:
Task Runner