Understanding the KNN algorithm
To recognize different types of motion activities, we will train the KNN classifier. The idea of the method is to find k training samples closest to the sample with an unknown label, and predict the label as a most frequent class among those k. That's it:
In fact, the algorithm is so simple, that it's tempting to formulate it in more complicated terms. Let's do it. The secret sauce of a KNN is a distance metric: function, which defines how close to each other two samples are. We have discussed several of them already: Euclidean, Manhattan, Minkowski, edit distance, and DTW. Following the terminology, samples are points in some n-dimensional space, where n equals to the number of features in each sample. This space is called feature space, and samples are distributed in it as clouds of points. Classification of an unknown data point happens in three steps:
- Calculate distances from the point to all points in a training set
- Choose the k-closest neighbors to the unknown point
- Perform a majority vote among them
The surface that separates one class of points from another class is known as a decision boundary. The KNN algorithm creates piecewise linear decision boundaries that can approximate a decision boundary of any complexity by adding more and more training samples:
You may think that this algorithm is too simple to be used for anything but some toy tasks. But over the years, KNN has demonstrated to be successfully employed for a wide range of problems, such as handwriting recognition, and satellite photo classification. It's also worth noting that it's easy to turn this classification algorithm into regression—you just need to replace categorical labels with the real numbers, and add an interpolation function.
Many restrictions of the linear regressions come from the fact that it assumes that data is normally distributed. The class of statistical models which makes explicit assumptions about the statistical distribution underlying data is called parametric models.
Unlike linear regression, KNN makes no assumptions about the distribution from which samples are generated. That's why we call them non-parametric. This is the right tool to choose in situations where data has unusual distribution, and the decision boundary is irregular.