Entity Extraction in Lexalytics
How is it implemented
Part of speech patterns
To recognizes company names. Salience is looking for words or phrases that fit certain part of speech patterns. At the lowest level, Salience is tagging individual words and phrases with parts of speech based on a tagging model.
Additionally, it is using a statistical model based on a large dataset trained by humans.
Out of the box, Salience performs most entity extraction via a model which has been trained to detect major items such as the names of companies, people, places, and products.
It does not perform an exhaustive lookup through massive internal lists of every possible company in the world for example, it relies on the model to deduce that a certain entity is a company or a person, etc. This works reasonably if you have a large stream of content and you don’t know what you’re looking for.
Finally, Salience leverages a white list of company names to give it extra clues.
- while list cdl file mycompanies.cdl in data\user\salience\entities\companies.
Best Buy<tab>Best Buy
The two instances of Best Buy are tab separated – replace thewith an actual tab. The line tells Salience to look for Best Buy, treat it as a company, and to normalize it to Best Buy.
All CDL files that need case sensitivity scdl instead (sensitive CDL) and we’ll need to create a new rules.ptn file in data\user\salience\entities\companies if you don’t already have one. In the new rules.ptn, you enter this line, replacing the wordwith an actual tab.
**<tab>*.scdl<tab>label="Company”,call("score.dat"), score = 100.0, hashset(label, "*.scdl", 1, mention,false), hashset(normalized, "*.scdl", 2, mention, false)
This is pretty daunting looking, I’ll admit. Let’s step through the pieces:
- ** tells Salience to consider a series of words of any length
- *.scdl tells Salience to look in files with a .scdl extension
- label=”Company” tells Salience to label these all as companies
- call(“score.dat”) tells Salience to look in the file score.dat for any additional instructions
- score=100.0 tells Salience to set the score to 100% for anything in the scdl file
- The hashset operator tells Salience pull in all .scdl files and treat the left hand side of entries as the label and the right hand side as the normalized form.
Note that we did not specify case sensitivity. That’s because by default, case sensitivity is on in pattern files.
The data directory provided with Salience Engine provides numerous endpoints for tweaking and tuning your results. In the area of entity extraction, the items you’re most likely to work with are:
- and normalization files.
These are simple tab-delimited files that can be placed within your user directories to augment the model-based entity extraction.
Pattern files regexp ?
Tools exist for the development of custom entity extraction models