# What is Similarity?

Simliarity is determined as being the closest distance between 2 objects in a set.

You can find similarities by looking at:

• Were they created at roughly the same time?
• Do they tend to get the same ratings?
• user behavior (browsing, playing, searching)

What’s similar depends on who you’re talking about. Take director Pedro Almodóvar. You might have four very different movies by Almodóvar. But he’s such a strong voice that, by himself, he makes those videos similar to one another. For a different director—say, Spielberg—that might not be the case.

Similarity is a symmetric function.

## Function

• Regular (“Euclidean”) distance? (sum of squares of differences). Regular Euclidean distance: normally the square root but as we compare only two instances we don't need to take the square root
• Manhattan (“city‐block”) distance? (sum of absolute differences)
• Nominal attributes? Distance = 1 if different, 0 if same
• What is the Cosine Similarity or Cosine Distance? (Measure of Angle) String similarity Product of vector by the cosinus

Discover More
Data Quality - Entity (Resolution|Disambiguation) - Record (linkage|matching) - Conflation

Entity Resolution, or Record linkage is the process of (joining|matching) records from one data source with another that describe the same Entity. Also known as : entity disambiguation/linking, ...
Linear Algebra - (Dot|Scalar|Inner) Product of two vectors

A dot Product is the multiplication of two two equal-length sequences of numbers (usually coordinate vectors) that produce a scalar (single number) Dot-product is also known as: scalar product or...
Machine Learning - K-Nearest Neighbors (KNN) algorithm - Instance based learning

“Nearest‐neighbor” learning is also known as “Instance‐based” learning. K-Nearest Neighbors, or KNN, is a family of simple: classification and regression algorithms based on Similarity...
Machine Learning - Logistic regression (Classification Algorithm)

The prediction from a logistic regression model can be interpreted as the probability that the label is 1. linear regression can also be used to perform classification problem. Just by transforming the...
Machine Learning - Rote Classifier

The “rote” classifier classifies data items based on exact matches to the training set. Otherwise, it search in the training set for one that’s “most like” it. The key concept here is the description...
Multidimensional scaling ( similarity of individual cases in a dataset)

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. Multidimensional scaling
Relational Operator - Cross Product

The set cartesian product applied to a set of row (ie a tables) create all pairs of tuples (row). A cross-join (also called Cartesian join) occurs when a request does not have a join condition between...
Statistics - Kernel

A kernel is a similarity function. It is a function that takes two inputs and spits out how similar they are. See Kernel_(statistics) Kernel_(statistics)
What is a Distance?

Distance is a numerical description of how far apart objects are. Same as: In most cases, “distance from A to B” is interchangeable with “distance between B and A”. In physics...
What is a Term-document Matrix?

A term-document matrix is an important representation for text analytics. Each row of the matrix is a document vector, with one column for every term in the entire corpus. Naturally, some documents...