(Natural|Human) Language - Text (Mining|Analytics)
About
See What is Unstructured data? known also as structure-later, schema-later or schema on read
- Tweet
- Web site comments
- Weblogs
- Forum comment
- …
A tweet is analyzed differently than a long blog post and a blog comment is analyzed differently than a tweet.
If you want to use any method of machine learning to work with natural language you have to pre-process your data first. Depending on you problem it could for example:
- mean stemming,
- lemmatization,
- computing n-gram statistics,
Training classifier is the last step.
Natural language processing is a field of artificial intelligence concerned with the interactions between computers and human languages. Computers can be trained to model a language.
Analyses Type
Extract or analyze category by order of importance
Topics and Theme
- topic categorization.
- Topic Classification (Baesian Classifier)
Sentiment
- Opinion
- Attitudes
- Perceptions
- Intent
Named Entity
Named entity (extraction and disambiguation),
- people extraction
- company extraction
- geographic location (Town, …)
- author extraction
Concept
Abstract groups of entities. Concept tagging.
Metadata
- author extraction
- publication date
- language detection,
- title
- headers
Other entities
- phone number
- part/product
- e-mail
- street/address
- keyword extraction,
- quotations extraction,
Information Extraction (IE)
Relationships and fact
- Relations extraction,
- Collapsed dependencies - Online Example
- Basic dependencies - Online Example
Others
- Document Structure
- Part-of-Speech Tagging
- Sentence Detection
- Coreference - Online Example
- Collapsed CC-processed dependencies - Online Example
- web page cleaning,
- intent mining,
- Clustering
- …
Statistics
Unstructured Information Management applications
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
Library
Application
- Apache Unstructured Information Management Architecture. The major goal of UIMA is to transform unstructured information to structured information by orchestrating analysis engines to detect entities or relations and thus to build the bridge between the unstructured and the structured world. UIMA is, by itself, an empty framework.