Table of Contents

About

Cosine similarity applied to document similarity.

Implementation

Each document becomes a vector in some high dimensional space. To compare two documents we compute the cosine of the angle between their two document vectors.

The dot product and norm computations are simple functions of the bag-of-words document representations.

The geometric interpretation is more intuitive. When the angle between two document vectors is small, they are pointing roughly the same direction because they share many tokens in common.

  • If the angle is small (they share many words in common), the cosine is large.
  • If the angle is large (and they have few words in common), the cosine is small.