What is Similarity?

Thomas Bayes

About

Simliarity is determined as being the closest distance between 2 objects in a set.

You can find similarities by looking at:

  • the metadata:
    • Were they created at roughly the same time?
    • Do they tend to get the same ratings?
  • user behavior (browsing, playing, searching)

What’s similar depends on who you’re talking about. Take director Pedro Almodóvar. You might have four very different movies by Almodóvar. But he’s such a strong voice that, by himself, he makes those videos similar to one another. For a different director—say, Spielberg—that might not be the case.

Similarity is a symmetric function.

Function

  • Regular (“Euclidean”) distance? (sum of squares of differences). Regular Euclidean distance: normally the square root but as we compare only two instances we don't need to take the square root
  • Manhattan (“city‐block”) distance? (sum of absolute differences)
  • Nominal attributes? Distance = 1 if different, 0 if same
  • What is the Cosine Similarity or Cosine Distance? (Measure of Angle) String similarity Product of vector by the cosinus





Discover More
Dataquality Metrics
Data Quality - Entity (Resolution|Disambiguation) - Record (linkage|matching) - Conflation

Entity Resolution, or Record linkage is the process of (joining|matching) records from one data source with another that describe the same Entity. Also known as : entity disambiguation/linking, ...
Card Puncher Data Processing
Linear Algebra - (Dot|Scalar|Inner) Product of two vectors

A dot Product is the multiplication of two two equal-length sequences of numbers (usually coordinate vectors) that produce a scalar (single number) Dot-product is also known as: scalar product or...
Regression Mean
Machine Learning - K-Nearest Neighbors (KNN) algorithm - Instance based learning

“Nearest‐neighbor” learning is also known as “Instance‐based” learning. K-Nearest Neighbors, or KNN, is a family of simple: classification and regression algorithms based on Similarity...
Thomas Bayes
Machine Learning - Logistic regression (Classification Algorithm)

The prediction from a logistic regression model can be interpreted as the probability that the label is 1. linear regression can also be used to perform classification problem. Just by transforming the...
Thomas Bayes
Machine Learning - Rote Classifier

The “rote” classifier classifies data items based on exact matches to the training set. Otherwise, it search in the training set for one that’s “most like” it. The key concept here is the description...
Thomas Bayes
Multidimensional scaling ( similarity of individual cases in a dataset)

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. Multidimensional scaling
Relational Algebra Between Sql And Query Plan
Relational Operator - Cross Product

The set cartesian product applied to a set of row (ie a tables) create all pairs of tuples (row). A cross-join (also called Cartesian join) occurs when a request does not have a join condition between...
Thomas Bayes
Statistics - Kernel

A kernel is a similarity function. It is a function that takes two inputs and spits out how similar they are. See Kernel_(statistics) Kernel_(statistics)
Thomas Bayes
What is a Distance?

Distance is a numerical description of how far apart objects are. Same as: In most cases, “distance from A to B” is interchangeable with “distance between B and A”. In physics...
Text Mining
What is a Term-document Matrix?

A term-document matrix is an important representation for text analytics. Each row of the matrix is a document vector, with one column for every term in the entire corpus. Naturally, some documents...



Share this page:
Follow us:
Task Runner