(Natural|Human) Language - Text (Mining|Analytics)


See What is Unstructured data? known also as structure-later, schema-later or schema on read

  • Tweet
  • Web site comments
  • Weblogs
  • Forum comment

A tweet is analyzed differently than a long blog post and a blog comment is analyzed differently than a tweet.

If you want to use any method of machine learning to work with natural language you have to pre-process your data first. Depending on you problem it could for example:

  • mean stemming,
  • lemmatization,
  • computing n-gram statistics,

Training classifier is the last step.

Natural language processing is a field of artificial intelligence concerned with the interactions between computers and human languages. Computers can be trained to model a language.

Analyses Type

Extract or analyze category by order of importance

Text Mining

Topics and Theme

document classification


Named Entity

Named entity (extraction and disambiguation),

  • people extraction
  • company extraction
  • geographic location (Town, …)
  • author extraction


Abstract groups of entities. Concept tagging.


  • author extraction
  • publication date
  • language detection,
  • title
  • headers

Other entities

  • phone number
  • part/product
  • e-mail
  • street/address
  • keyword extraction,
  • quotations extraction,

Information Extraction (IE)

Relationships and fact


  • web page cleaning,
  • intent mining,
  • Clustering


Unstructured Information Management applications

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.


See Natural language processing


  • Apache Unstructured Information Management Architecture. The major goal of UIMA is to transform unstructured information to structured information by orchestrating analysis engines to detect entities or relations and thus to build the bridge between the unstructured and the structured world. UIMA is, by itself, an empty framework.

Natural Language


Task Runner