“Nearest‐neighbor” learning is also known as “Instance‐based” learning.
K-Nearest Neighbors, or KNN, is a family of simple:
based on Similarity (Distance) calculation between instances.
Nearest Neighbor implements rote learning. It's based on a local average calculation.
It's a smoother algorithm.
Some experts have written that k-nearest neighbours do the best about one third of the time. It's so simple that, in the game of doing classification, you always want to have it in your toolbox.
If you use the nearest neighbor algorithm, take into account the fact that points near the boundary have fewer neighbors because some neighbors may be outside the boundary. You need to correct this bias.
Of all Y value for a X.
When they are too few data points for a X value in order to calculate the mean, the neighbourhood can be used.
Nearest neighbour averaging can be pretty good for a small number of variable (See accurate ) and large N (in order to get enough point to calculate the average).
The nearest neighbor methode produces a linear decision boundary. It's a little bit more complicate as it produces a piece-wise linear decision of the decision boundary with sometimes a bunch of little linear pieces.
The perpendicular bisector of the line that joins the two closest points:
Knn assumes that all attributes are equally important. Remedy: attribute selection or weights
Noisy instance are instance with a bad target class.
If the dataset is noisy , then by accident we might find an incorrectly classified training instance as the nearest one to our test instance.
You can guard against that by using:
The K parameter is the first method to guard this method against noisy dataset.
An obvious issue with k nearest neighbour is how to choose a suitable value for the number of nearest neighbours used.
weka uses cross-validation to select the best value.
The X axis shows the formula 1/K because since K is the number of neighbours, 1/k ⇐ 1.
If we set k to be an extreme value, close to the size of the whole data set, then we're taking the distance of the test instance to all of points in the dataset and averaging those which will probably gives us something close to the baseline accuracy.
There is a theoretical guarantee that with a huge dataset and large values of k, you're going to get good results from nearest neighbour learning.
Nearest neighbourhood methods can be lousy when p (the number of variable) is large because of the curse of dimensionality. In high dimension, it's really difficult to stay local.
slow as you need to scan entire training data to make each prediction.
By changing the k data set if the accuracy changes a lot between 1 and 5 k it's may be a noisy data set If the data set is noisy, the accuracy figures improves as k got little bit larger but then it would be starting to decrease again.
In weka it's called IBk (instance-bases learning with parameter k) and it's in the lazy class folder. KNN is the K parameter. IBk's KNN parameter specifies the number of nearest neighbors to use when classifying a test instance, and the outcome is determined by majority vote.
Weka's IBk implementation has the “cross-validation” option that can help by choosing the best value automatically Weka uses cross-validation to select the best value for KNN (which is the same as k).
R - K-Nearest Neighbors (KNN) Analysis