Table of Contents

Statistics - (Base rate fallacy|Bonferroni's principle)

About

Every accurate (model|test) can be useless as detection tools if the studied case is sufficiently rare among the general population.

The data model will produce too many false positives or false negatives.

Rare event occurs always if the population is big enough.

If the model returns more events that you would expect in the actual population, you can assume most of the events are bogus (false positives or false negatives).

If the data behaves randomly and with enough data, you will always discover suspicious event or pattern (not random).

Example

Face Recognition

Face Recognition as an identification mechanism

The idea

Automatic face-scanning systems for airports and other public gathering places like sports stadiums.

The idea is to put cameras at security checkpoints and have automatic face-recognition software continuously scan the crowd for suspected terrorists. When the software identifies a suspect, it alerts the authorities, who swoop down and arrest the miscreant.

Accuracy

Assume that some hypothetical face-scanning software is magically effective (much better than is possible today)—99.9 percent accurate.

That is:

In other words, the defensive-failure rate and the usage-failure rate are both 0.1 percent.

Evaluation

Assume additionally that 1 in 10 million stadium attendees, on average, is a known terrorist. (This system won’t catch any unknown terrorists who are not in the photo database.)

Despite the high (99.9 percent) level of accuracy, because of the very small percentage of terrorists in the general population of stadium attendees, the hypothetical system will generate 10,000 false alarms for every one real terrorist.

Conclusion

That kind of usage-failure rate renders such a system almost worthless.

The guards who use this system will rapidly learn that it’s always wrong, and that every alarm from the face-scanning system is a false alarms. Eventually they’ll just ignore it. When a real terrorist is flagged by the system, they’ll be likely to treat it as just another false alarms.

Rhine Paradox

Joseph Rhine was a parapsychologist in the 1950’s with the following experiment: subjects have to guess whether 10 hidden cards were red or blue. Rhine found that about 1 person in 1,000 had extra sensory perception. They could correctly guess the color of all 10 cards. After calling back the “psychic” subjects and had them repeat test, they all failed. Rhine concluded that act of telling psychics that they have psychic abilities causes them to lose it.

A wrong conclusion because probability tells us that there is:

Evil-doers in Hotel

The problem

A certain groups of evil-doers are meeting occasionally in hotels to plot doing evil.

We want to find people who at least twice have stayed at the same hotel on the same day.

If everyone behaves randomly (I.e., no evil-doers) will the data mining model detect anything suspicious?

Data

Probability Calculation

Probability that:

<MATH> \frac{1}{100} \times \frac{1}{100} = 10^{-4} </MATH>

<MATH> \frac{1}{10^{-5}} \times \frac{1}{10^{-5}} = 10^{-10} </MATH>

<MATH> \frac{1}{10^{-5}} \times \frac{1}{10^{-5}} \times 10^{-5} = 10^{-5} </MATH>

<MATH> 10^{-4} \times 10^{-5} = 10^{-9} </MATH>

<MATH> 10^{-9} \times 10^{-9} = 10^{-18} </MATH>

<MATH> ( 5 \times 10^{5} ) \times 10^{-18} = 5 \times 10^{-13} </MATH>

The expected number of suspicious pairs of people is then (There is <math>5 \times 10^{17}</math> pairs of people. See Mathematics - Combination (Binomial coefficient|n choose k)) <MATH> (5 \times 10^{17}) \times (5 \times 10^{-13}) = 250000 </MATH>

Conclusion

Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. Analysts have to sift through 250,010 candidates to find the 10 real cases. It's not gonna happen.

Documentation / Reference