Lexalytics - Entity extraction

About

Entity Extraction in Lexalytics

How is it implemented

Part of speech patterns

To recognizes company names. Salience is looking for words or phrases that fit certain part of speech patterns. At the lowest level, Salience is tagging individual words and phrases with parts of speech based on a tagging model.

Model

Additionally, it is using a statistical model based on a large dataset trained by humans.

Out of the box, Salience performs most entity extraction via a model which has been trained to detect major items such as the names of companies, people, places, and products.

It does not perform an exhaustive lookup through massive internal lists of every possible company in the world for example, it relies on the model to deduce that a certain entity is a company or a person, etc. This works reasonably if you have a large stream of content and you don’t know what you’re looking for.

White list

Non-case sensitivity

Finally, Salience leverages a white list of company names to give it extra clues.

Example:

  • while list cdl file mycompanies.cdl in data\user\salience\entities\companies.
Best Buy<tab>Best Buy

The two instances of Best Buy are tab separated – replace the <tab> with an actual tab. The line tells Salience to look for Best Buy, treat it as a company, and to normalize it to Best Buy.

Case sensitivity

All CDL files that need case sensitivity scdl instead (sensitive CDL) and we’ll need to create a new rules.ptn file in data\user\salience\entities\companies if you don’t already have one. In the new rules.ptn, you enter this line, replacing the word <tab> with an actual tab.

**<tab>*.scdl<tab>label="Company”,call("score.dat"),
score = 100.0, hashset(label, "*.scdl", 1, mention,false),
hashset(normalized, "*.scdl", 2, mention, false)

This is pretty daunting looking, I’ll admit. Let’s step through the pieces:

  • ** tells Salience to consider a series of words of any length
  • *.scdl tells Salience to look in files with a .scdl extension
  • label=”Company” tells Salience to label these all as companies
  • call(“score.dat”) tells Salience to look in the file score.dat for any additional instructions
  • score=100.0 tells Salience to set the score to 100% for anything in the scdl file
  • The hashset operator tells Salience pull in all .scdl files and treat the left hand side of entries as the label and the right hand side as the normalized form.

Note that we did not specify case sensitivity. That’s because by default, case sensitivity is on in pattern files.

Customization

Word List

The data directory provided with Salience Engine provides numerous endpoints for tweaking and tuning your results. In the area of entity extraction, the items you’re most likely to work with are:

These are simple tab-delimited files that can be placed within your user directories to augment the model-based entity extraction.

Pattern

Pattern files regexp ?

Model

Tools exist for the development of custom entity extraction models

Documentation / Reference


Powered by ComboStrap