data_mining:base_rate

About

Every accurate (model|test) can be useless as detection tools if the studied case is sufficiently rare among the general population.

The data model will produce too many false positives or false negatives.

Rare event occurs always if the population is big enough.

If the model returns more events that you would expect in the actual population, you can assume most of the events are bogus (false positives or false negatives).

If the data behaves randomly and with enough data, you will always discover suspicious event or pattern (not random).

Articles Related

Example

Face Recognition

Face Recognition as an identification mechanism

The idea

Automatic face-scanning systems for airports and other public gathering places like sports stadiums.

The idea is to put cameras at security checkpoints and have automatic face-recognition software continuously scan the crowd for suspected terrorists. When the software identifies a suspect, it alerts the authorities, who swoop down and arrest the miscreant.

Accuracy

Assume that some hypothetical face-scanning software is magically effective (much better than is possible today)—99.9 percent accurate.

That is:

if someone is a terrorist, there is a 1-in-1,000 chance that the software fails to indicate “terrorist”(false negative)
if someone is not a terrorist, there is a 1-in-1,000 chance that the software falsely indicates “terrorist.” (false positive)

In other words, the defensive-failure rate and the usage-failure rate are both 0.1 percent.

Evaluation

Assume additionally that 1 in 10 million stadium attendees, on average, is a known terrorist. (This system won’t catch any unknown terrorists who are not in the photo database.)

Despite the high (99.9 percent) level of accuracy, because of the very small percentage of terrorists in the general population of stadium attendees, the hypothetical system will generate 10,000 false alarms for every one real terrorist.

Conclusion

That kind of usage-failure rate renders such a system almost worthless.

The guards who use this system will rapidly learn that it’s always wrong, and that every alarm from the face-scanning system is a false alarms. Eventually they’ll just ignore it. When a real terrorist is flagged by the system, they’ll be likely to treat it as just another false alarms.

Rhine Paradox

Joseph Rhine was a parapsychologist in the 1950’s with the following experiment: subjects have to guess whether 10 hidden cards were red or blue. Rhine found that about 1 person in 1,000 had extra sensory perception. They could correctly guess the color of all 10 cards. After calling back the “psychic” subjects and had them repeat test, they all failed. Rhine concluded that act of telling psychics that they have psychic abilities causes them to lose it.

A wrong conclusion because probability tells us that there is:

<math>2^{10}</math> = 1,024 combinations of red and blue of length 10
<math>{\frac{1}{2}}^{10} = 0.0098 </math> chance to have a good guess which is at least 1 subject in 1,000.

Evil-doers in Hotel

The problem

A certain groups of evil-doers are meeting occasionally in hotels to plot doing evil.

We want to find people who at least twice have stayed at the same hotel on the same day.

If everyone behaves randomly (I.e., no evil-doers) will the data mining model detect anything suspicious?

Data

<math>10^9</math> people being tracked.
1000 days.
Each person stays in a hotel 1% of the time.
Each person stays then 10 days. 10 days = 1000 days * 1%
Hotels hold 100 people (so <math>10^5</math> hotels).

Probability Calculation

Probability that:

persons p and q will choose a specific day:

<MATH> \frac{1}{100} \times \frac{1}{100} = 10^{-4} </MATH>

persons p and q choose a specific hotel

<MATH> \frac{1}{10^{-5}} \times \frac{1}{10^{-5}} = 10^{-10} </MATH>

persons p and q choose the same hotel (as it's the sum of exclusive event, this is the probability to choose a specific hotel multiplied by the number of hotels)

<MATH> \frac{1}{10^{-5}} \times \frac{1}{10^{-5}} \times 10^{-5} = 10^{-5} </MATH>

persons p and q will choose the same specific day to be in the same hotel

<MATH> 10^{-4} \times 10^{-5} = 10^{-9} </MATH>

p and q will be at the same hotel on two given days:

<MATH> 10^{-9} \times 10^{-9} = 10^{-18} </MATH>

p and q will be at the same hotel on some two days (There is <math>5 \times 10^5</math> pairs of days. See Mathematics - Combination (Binomial coefficient|n choose k))

<MATH> ( 5 \times 10^{5} ) \times 10^{-18} = 5 \times 10^{-13} </MATH>

The expected number of suspicious pairs of people is then (There is <math>5 \times 10^{17}</math> pairs of people. See Mathematics - Combination (Binomial coefficient|n choose k)) <MATH> (5 \times 10^{17}) \times (5 \times 10^{-13}) = 250000 </MATH>

Conclusion

Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. Analysts have to sift through 250,010 candidates to find the 10 real cases. It's not gonna happen.

Table of Contents

Statistics - (Base rate fallacy|Bonferroni's principle)

About

Articles Related

Example

Face Recognition

The idea

Accuracy

Evaluation

Conclusion

Rhine Paradox

Evil-doers in Hotel

The problem

Data

Probability Calculation

Conclusion

Documentation / Reference