Natural Language - Crawler

Text Mining

Natural Language - Crawler


A crawler is an application (bot) that reads a document (such as web page, word file, ..) and parse them to extract meaningful information.

Software for scanning large bodies of text such as collections of Web pages to find occurrences of words, phrases or other patterns.

They are implemented as finite automata


The most known crawler are web crawler

Documentation / Reference

Recommended Pages
Word Recognition Automaton
Automata - Finite Automata

A finite automaton is an automaton that has a set of states and its control moves from state to state in response to external inputs. It has a start and an end state and there are only a finite number...
Thomas Bayes
Data Mining - Content Analysis and Acquisition

Apache Tika (content analysis toolkit) - The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents (PDF, OppenOffice, Word, ...) using existing parser...
Dataquality Metrics
Data Quality - Verification with an external directory website (Scrapping)

The data quality often must use external database to control the validation of the data. It's often the case with the address cleaning. And what better tools that all the data that you can find on the...
Search Engine - Bot

A search engine bot is a bot that crawl the web to build a index queried by the search engine Bing (Ip Range) ...
Text Mining
Search Engine - Search Index - (Postings|Inverted) (Index|File) - Natural Language Processing

An inverted index is an index data structure storing a mapping from: token (content), such as words or numbers, to its locations (in a database file, document or a set of documents) In text search,...
Abuse Detection Github
Security - Abuse Detection

Abuse detection mechanism are generally based on: rate limiting. behavioral analysis or machine learning. ie: scoring every request by how different it is from the baseline. a sort of bot score...
Web - Headless browser (Test automation)

A headless browser is an application/library that emulates a web browser but without a graphical user interface ie (without DOM / without the Web api) They are the basis to build a web bot. Build...
Robots Useragent
Web - Robots (Wanderers | Crawlers | Spiders)

This page is in a web context. Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are crawler program that scan the web generally in order to: create an search engine. See or seo...
Robots Useragent
Web Robot - Crawler

A web crawler is an crawler application that reads web resources (mostly a web page) and parse them to extract meaningful information. A crawl cycle consists of 4 steps: Selects the urls to fetch...

Share this page:
Follow us:
Task Runner