上QQ阅读APP看书，第一时间看更新

Understanding the KNN algorithm

To recognize different types of motion activities, we will train the KNN classifier. The idea of the method is to find k training samples closest to the sample with an unknown label, and predict the label as a most frequent class among those k. That's it:

Figure 3.5: KNN classification algorithm. The new data point marked with ? gets classified based on the classes of its neighbors.

Note how the choice of neighbor number affects the result of classification.

In fact, the algorithm is so simple, that it's tempting to formulate it in more complicated terms. Let's do it. The secret sauce of a KNN is a distance metric: function, which defines how close to each other two samples are. We have discussed several of them already: Euclidean, Manhattan, Minkowski, edit distance, and DTW. Following the terminology, samples are points in some n-dimensional space, where n equals to the number of features in each sample. This space is called feature space, and samples are distributed in it as clouds of points. Classification of an unknown data point happens in three steps:

Calculate distances from the point to all points in a training set
Choose the k-closest neighbors to the unknown point
Perform a majority vote among them

The surface that separates one class of points from another class is known as a decision boundary. The KNN algorithm creates piecewise linear decision boundaries that can approximate a decision boundary of any complexity by adding more and more training samples:

Figure 3.6: Voronoi cells graph shows the closest neighbor at each point with a color. Depending on the distance metric you choose, the graph looks quite different. From the left to the right: Manhattan ( c = 1), Euclidean ( c = 2), and Minkowski ( c = 3) distance metrics.

Algorithms similar to KNN are also known as non-generalizing machine learning. In Chapter 6 , Linear Regression and Gradient Descent, we will discuss a linear regression, an algorithm that constructs general representation of all data points—a straight line, because it assumes that all data points lie along the line. Unlike linear regression, KNN makes no assumption about the underlying structure of the data, it just stores all the training samples. Both approaches have their advantages and downsides.

You may think that this algorithm is too simple to be used for anything but some toy tasks. But over the years, KNN has demonstrated to be successfully employed for a wide range of problems, such as handwriting recognition, and satellite photo classification. It's also worth noting that it's easy to turn this classification algorithm into regression—you just need to replace categorical labels with the real numbers, and add an interpolation function.

Parametric versus non-parametric models
Many restrictions of the linear regressions come from the fact that it assumes that data is normally distributed. The class of statistical models which makes explicit assumptions about the statistical distribution underlying data is called parametric models.

Unlike linear regression, KNN makes no assumptions about the distribution from which samples are generated. That's why we call them non-parametric. This is the right tool to choose in situations where data has unusual distribution, and the decision boundary is irregular.