Table of Contents


  • Apache Nutch: open source web crawler (Nutch can crawl and post to Apache Solr for search/index.)
  • Apache Tika: detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF)
    • Lucene Core: text search engine library
    • Apache Solr: search platform from the Apache LuceneTM project
Library Language Open Source Note
NLTK Python Yes
Gensim Python Yes Python Yes
ElasticSearch (Index and Search) Java Apache 2 (based on Lucene) Guide, Crat (query / SQL layer on top of elasticsearch)
Solr (Index and Search) Java Apache 2 (based on Lucene) Solr
Apache OpenNLP Java Yes
Deepleaerning Java, Scala Yes
Weka Java GPL See
Standford NLP Java GPL Demo (Part of Speech, Named Entity Recognition, Coreference, Basic dependencies, Collapsed dependencies, Collapsed CC-processed dependencies) Github: Online Run:
LingPipe Java No Topic Classification, Named Entity Recognition (NER), Sentiment Analysis, …
tm R Yes
rWeka R Yes rJava via JNI
openNLP R Yes rJava via JNI
OCR Tesseract
TweetNLP Java Yes tokenizer, a part-of-speech tagger, hierarchical word clusters, and a dependency parser for tweets
Smile Java LGPL Statistical Machine Intelligence and Learning Engine
