K-nearest neighbor algorithm
From Wikipedia, the free encyclopedia
- The correct title of this article is k-nearest neighbor algorithm. The initial letter is shown capitalized due to technical restrictions.
In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy-learning where the function is only approximated locally and all computation is deferred until classification.
The training examples are mapped into multidimensional feature space. The space is partitioned into regions by class labels of the training samples. A point in the space is assigned to the class c if it is the most frequent class label among the k nearest training samples. Usually Euclidean distance is used.
The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the actual classification phase, the same features as before are computed for the test sample (whose class is not known). Distances from the new vector to all stored vectors are computed and k closest samples are selected. The new point is predicted to belong to the most numerous class within the set.
The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on the classification, but make boundaries between classes less distinct. A good k can be selected by parameter optimization using, for example, cross-validation. The special case where the class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbour algorithm.
The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the features scales are not consistent with their relevance. Van der Walt and Barnard (see reference section) has also shown that the optimal value of k is influenced by the amount of output noise in the data. They also show that the archilles heel of the kNN classifier is the constant distance metric that it uses.[original research?] Much research effort has been placed into selecting or scaling features to improve classification. A particularly popular approach is the use of evolutionary algorithms to optimize feature scaling. Another popular approach is to scale features by the mutual information of the training data with the training classes.
The algorithm is easy to implement, but it is computationally intensive, especially when the size of the training set grows. Many optimizations have been proposed over the years; these generally seek to reduce the number of distances actually computed. Some optimizations involve partitioning the feature space, and only computing distances within specific nearby volumes. Several different types of nearest neighbour finding algorithms include:
- Linear scan
- Kd-trees
- Balltrees
- Metric trees
- Locality-sensitive hashing (LSH)
- Agglomerative-Nearest-Neighbor
The nearest neighbor algorithm has some strong consistency results. As the amount of data approaches infinity, the algorithm is guaranteed to yield an error rate no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data). k-nearest neighbor is guaranteed to approach the Bayes error rate, for some value of k (where k increases as a function of the number of data points).
By other way, k-nn algorithm can be used for estimating continuous variables. This mean that its an inverse distance weighted average with the k-nearest multivariate neighbors. This algoritm functions as follows: 1. Compute euclidean or mahalanobis distance from target plot to those that were sampled. 2. Order samples taking for account calculated distances. 3. Choose heuristically optimal k nearest neighbor based on RMSE done by cross validation technique. 4. Calculate an inverse distance weighted average with the k-nearest multivariate neighbors. For further information please read carefully any recommended paper.
[edit] See also
- Artificial intelligence
- Data mining
- Parzen window
- Machine learning
- Pattern recognition
- Predictive analytics
- Statistics
- Support vector machine
[edit] References
- C.M. van der Walt and E. Barnard,“Data characteristics that determine classifier perfromance”, in Proceedings of the Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, pp.160-165, 2006. Available [Online] http://www.patternrecognition.co.za
- Belur V. Dasarathy, editor (1991) Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, ISBN 0-8186-8930-7
- Nearest-Neighbor Methods in Learning and Vision, edited by Shakhnarovish, Darrell, and Indyk, The MIT Press, 2005, ISBN 0-262-19547-X
- Estimation of forest stand volumes by Landsat TM imagery and stand-level field-inventory data. Forest Ecology and Management, Volume 196, Issues 2-3, 26 July 2004, Pages 245-255. Helena Mäkelä and Anssi Pekkarinen
- Estimation and mapping of forest stand density, volume, and cover type using the k-nearest neighbors method. Remote Sensing of Environment, Volume 77, Issue 3, September 2001, Pages 251-274. Hector Franco-Lopez, Alan R. Ek and Marvin E. Bauer
[edit] External links
- Matlab Implementation of kNN classifier
- k-nearest neighbor algorithm in Visual Basic (includes executable and source code)
- k-nearest neighbor tutorial using MS Excel
- [http://www.cse.ohio-state.edu/~kerwin/language_classification.pdf Classification of natural language based on
character frequency using the k-nearest neighbor algorithm]
'만들기 / Programming > research' 카테고리의 다른 글
BRDF (0) | 2007.06.28 |
---|---|
Eigenvalue, eigenvector and eigenspace (0) | 2007.04.20 |
Support vector machine (0) | 2007.04.05 |
Principal components analysis (0) | 2007.04.04 |
Image:Mandel zoom 00 mandelbrot set.jpg (0) | 2007.04.04 |