Apache Tika (content analysis toolkit) - The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents (PDF, OppenOffice, Word, …) using existing parser libraries.
Text mining
Natural Language - Crawler