Title: | Stabilized Nearest Neighbor Classifier |
---|---|
Description: | Implement K-nearest neighbor classifier, weighted nearest neighbor classifier, bagged nearest neighbor classifier, optimal weighted nearest neighbor classifier and stabilized nearest neighbor classifier, and perform model selection via 5 fold cross-validation for them. This package also provides functions for computing the classification error and classification instability of a classification procedure. |
Authors: | Wei Sun, Xingye Qiao, and Guang Cheng |
Maintainer: | Wei Sun <[email protected]> |
License: | GPL-3 |
Version: | 1.1 |
Built: | 2025-02-20 05:44:30 UTC |
Source: | https://github.com/cran/snn |
A package for implementations of various nearest neighbor classifiers, including K-nearest neighbor classifier, weighted nearest neighbor classifier, bagged nearest neighbor classifier, optimal weighted nearest neighbor classifier, and a new stabilized nearest neighbor classifier. This package also provides functions for computing the classification error and classification instability of a classification procedure.
Package: | snn |
Type: | Package |
Version: | 1.0 |
Date: | 2015-07-31 |
License: | GPL-3 |
The package "snn" provides 8 main functions: (1) the classification error. (2) the classification instability. (3) the K-nearest neighbor classifier. (4) the weighted neighbor classifier. (5) the bagged nearest neighbor classifier. (6) the optimal nearest neighbor classifier. (7) the stabilized nearest neighbor classifier. (8) the model selection via cross-validation for K-nearest neighbor classifier, bagged nearest neighbor classifier, optimal nearest neighbor classifier, and stabilized nearest neighbor classifier.
Wei Sun, Xingye Qiao, and Guang Cheng
Maintainer: Wei Sun <[email protected]>
W. Sun, X. Qiao, and G. Cheng (2015) Stabilized Nearest Neighbor Classifier and Its Statistical Properties. Available at arxiv.org/abs/1405.6642.
Implement the tuning procedure for K-nearest neighbor classifier, bagged nearest neighbor classifier, optimal weighted nearest neighbor classifier, and stabilized nearest neighbor classifier.
cv.tune(train, numgrid = 20, classifier = "snn")
cv.tune(train, numgrid = 20, classifier = "snn")
train |
Matrix of training data sets. An n by (d+1) matrix, where n is the sample size and d is the dimension. The last column is the class label. |
numgrid |
Number of grids for search |
classifier |
The classifier for tuning. Possible choices are knn, bnn, ownn, snn. |
For the K-nearest neighbor classifier (knn), the grids for search are equal spaced integers in [1, n/2].
Given the best k for the K-nearest neighbor classifier, the best parameter for the bagged nearest neighbor classifier (bnn) is computed via (3.5) in Samworth (2012).
Given the best k for the K-nearest neighbor classifier, the best parameter for Samworth's optimal weighted nearest neighbor classifier (ownn) is computed via (2.9) in Samworth (2012).
For the stabilized nearest neighbor classifier (snn), we first identify a set of lambda's whose corresponding risks are among the lower 10th percentiles, and then choose from them an optimal one which has the minimal estimated classification instability. The grids of lambda's are chosen such that each one is corresponding to an evenly spaced grid of k in [1, n/2]. See Sun et al. (2015) for details.
The returned list contains:
parameter.opt |
The best tuning parameter for the chosen classifier. For example, the best K for knn and ownn, the best ratio for bnn, and the best lambda for snn. |
parameter.list |
The list of parameters in the grid search for the chosen classifier. |
Wei Sun, Xingye Qiao, and Guang Cheng
R.J. Samworth (2012), "Optimal Weighted Nearest Neighbor Classifiers," Annals of Statistics, 40:5, 2733-2763.
W. Sun, X. Qiao, and G. Cheng (2015) Stabilized Nearest Neighbor Classifier and Its Statistical Properties. Available at arxiv.org/abs/1405.6642.
set.seed(1) n = 100 d = 10 DATA = mydata(n, d) ## Tuning procedure out.tune = cv.tune(DATA, classifier = "knn") out.tune
set.seed(1) n = 100 d = 10 DATA = mydata(n, d) ## Tuning procedure out.tune = cv.tune(DATA, classifier = "knn") out.tune
Implement the bagged nearest neighbor classification algorithm to predict the label of a new input using a training data set.
mybnn(train, test, ratio)
mybnn(train, test, ratio)
train |
Matrix of training data sets. An n by (d+1) matrix, where n is the sample size and d is the dimension. The last column is the class label. |
test |
Vector of a test point. It also admits a matrix input with each row representing a new test point. |
ratio |
Resampling ratio. |
The bagged nearest neighbor classifier is asymptotically equivalent to a weighted nearest neighbor classifier with the i-th weight a function of the resampling ratio, the sample size n, and i. See Hall and Samworth (2005) for details. The tuning parameter ratio can be tuned via cross-validation, see cv.tune function for the tuning procedure.
It returns the predicted class label of the new test point. If input is a matrix, it returns a vector which contains the predicted class labels of all the new test points.
Wei Sun, Xingye Qiao, and Guang Cheng
Hall, P. and Samworth, R. (2005). Properties of Bagged Nearest Neighbor Classifiers. Journal of the Royal Statistical Society, Series B, 67, 363-379.
# Training data set.seed(1) n = 100 d = 10 DATA = mydata(n, d) # Testing data set.seed(2015) ntest = 100 TEST = mydata(ntest, d) TEST.x = TEST[,1:d] # bagged nearest neighbor classifier mybnn(DATA, TEST.x, ratio = 0.5)
# Training data set.seed(1) n = 100 d = 10 DATA = mydata(n, d) # Testing data set.seed(2015) ntest = 100 TEST = mydata(ntest, d) TEST.x = TEST[,1:d] # bagged nearest neighbor classifier mybnn(DATA, TEST.x, ratio = 0.5)
Compute the classification instability of a classification procedure.
mycis(predict1, predict2)
mycis(predict1, predict2)
predict1 |
The list of predicted labels based on one training data set. |
predict2 |
The list of predicted labels based on another training data set. |
CIS of a classification procedure is defined as the probability that the same object is classified to two different classes by this classification procedure trained from two i.i.d. data sets. Therefore, the arguments predict1 and predict2 are generated on the same test data from the same classification procedure trained on two i.i.d. training data sets. CIS is among [0,1] and a smaller CIS represents a more stable classification procedure. See Section 2 of Sun et al. (2015) for details.
Wei Sun, Xingye Qiao, and Guang Cheng
W. Sun, X. Qiao, and G. Cheng (2015) Stabilized Nearest Neighbor Classifier and Its Statistical Properties. Available at arxiv.org/abs/1405.6642.
# Training data set.seed(1) n = 100 d = 10 DATA = mydata(n, d) # Testing data set.seed(2015) ntest = 100 TEST = mydata(ntest, d) TEST.x = TEST[,1:d] ## Compute classification instability for knn, bnn, ownn, and snn with given parameters nn=floor(n/2) permIndex = sample(n) predict1.knn = myknn(DATA[permIndex[1:nn],], TEST.x, K = 5) predict2.knn = myknn(DATA[permIndex[-(1:nn)],], TEST.x, K = 5) predict1.bnn = mybnn(DATA[permIndex[1:nn],], TEST.x, ratio = 0.5) predict2.bnn = mybnn(DATA[permIndex[-(1:nn)],], TEST.x, ratio = 0.5) predict1.ownn = myownn(DATA[permIndex[1:nn],], TEST.x, K = 5) predict2.ownn = myownn(DATA[permIndex[-(1:nn)],], TEST.x, K = 5) predict1.snn = mysnn(DATA[permIndex[1:nn],], TEST.x, lambda = 10) predict2.snn = mysnn(DATA[permIndex[-(1:nn)],], TEST.x, lambda = 10) mycis(predict1.knn, predict2.knn) mycis(predict1.bnn, predict2.bnn) mycis(predict1.ownn, predict2.ownn) mycis(predict1.snn, predict2.snn)
# Training data set.seed(1) n = 100 d = 10 DATA = mydata(n, d) # Testing data set.seed(2015) ntest = 100 TEST = mydata(ntest, d) TEST.x = TEST[,1:d] ## Compute classification instability for knn, bnn, ownn, and snn with given parameters nn=floor(n/2) permIndex = sample(n) predict1.knn = myknn(DATA[permIndex[1:nn],], TEST.x, K = 5) predict2.knn = myknn(DATA[permIndex[-(1:nn)],], TEST.x, K = 5) predict1.bnn = mybnn(DATA[permIndex[1:nn],], TEST.x, ratio = 0.5) predict2.bnn = mybnn(DATA[permIndex[-(1:nn)],], TEST.x, ratio = 0.5) predict1.ownn = myownn(DATA[permIndex[1:nn],], TEST.x, K = 5) predict2.ownn = myownn(DATA[permIndex[-(1:nn)],], TEST.x, K = 5) predict1.snn = mysnn(DATA[permIndex[1:nn],], TEST.x, lambda = 10) predict2.snn = mysnn(DATA[permIndex[-(1:nn)],], TEST.x, lambda = 10) mycis(predict1.knn, predict2.knn) mycis(predict1.bnn, predict2.bnn) mycis(predict1.ownn, predict2.ownn) mycis(predict1.snn, predict2.snn)
Generate random data from mixture Gaussian distribution.
mydata(n, d, mu = 0.8, portion = 1/2)
mydata(n, d, mu = 0.8, portion = 1/2)
n |
The number of observations (sample size). |
d |
The number of variables (dimension). |
mu |
In the Gaussian mixture model, the first Gaussian is generated with zero mean and identity covariance matrix. The second Gaussian is generated with mean a d-dimensional vector with all mu and identity covariance matrix. |
portion |
The prior probability for the first Gaussian component. |
Return the data matrix with n rows and d + 1 columns. Each row represents a sample generated from the mixture Gaussian distribution. The first d columns are features and the last column is the class label of the corresponding sample.
Wei Sun, Xingye Qiao, and Guang Cheng
set.seed(1) n = 100 d = 10 DATA = mydata(n, d) DATA.x = DATA[,1:d] DATA.y = DATA[,d+1]
set.seed(1) n = 100 d = 10 DATA = mydata(n, d) DATA.x = DATA[,1:d] DATA.y = DATA[,d+1]
Compute the error of the predict list given the true list.
myerror(predict, true)
myerror(predict, true)
predict |
The list of predicted labels |
true |
The list of true labels |
It returns the errors of the predicted labels from a classification algorithm.
Wei Sun, Xingye Qiao, and Guang Cheng
# Training data set.seed(1) n = 100 d = 10 DATA = mydata(n, d) # Testing data set.seed(2015) ntest = 100 TEST = mydata(ntest, d) TEST.x = TEST[,1:d] TEST.y = TEST[,d+1] ## Compute the errors for knn, bnn, ownn, and snn with given parameters. predict.knn = myknn(DATA, TEST.x, K = 5) predict.bnn = mybnn(DATA, TEST.x, ratio = 0.5) predict.ownn = myownn(DATA, TEST.x, K = 5) predict.snn = mysnn(DATA, TEST.x, lambda = 10) myerror(predict.knn, TEST.y) myerror(predict.bnn, TEST.y) myerror(predict.ownn, TEST.y) myerror(predict.snn, TEST.y)
# Training data set.seed(1) n = 100 d = 10 DATA = mydata(n, d) # Testing data set.seed(2015) ntest = 100 TEST = mydata(ntest, d) TEST.x = TEST[,1:d] TEST.y = TEST[,d+1] ## Compute the errors for knn, bnn, ownn, and snn with given parameters. predict.knn = myknn(DATA, TEST.x, K = 5) predict.bnn = mybnn(DATA, TEST.x, ratio = 0.5) predict.ownn = myownn(DATA, TEST.x, K = 5) predict.snn = mysnn(DATA, TEST.x, lambda = 10) myerror(predict.knn, TEST.y) myerror(predict.bnn, TEST.y) myerror(predict.ownn, TEST.y) myerror(predict.snn, TEST.y)
Implement the K nearest neighbor classification algorithm to predict the label of a new input using a training data set.
myknn(train, test, K)
myknn(train, test, K)
train |
Matrix of training data sets. An n by (d+1) matrix, where n is the sample size and d is the dimension. The last column is the class label. |
test |
Vector of a test point. It also admits a matrix input with each row representing a new test point. |
K |
Number of nearest neighbors considered. |
The tuning parameter K can be tuned via cross-validation, see cv.tune function for the tuning procedure.
It returns the predicted class label of the new test point. If input is a matrix, it returns a vector which contains the predicted class labels of all the new test points.
Wei Sun, Xingye Qiao, and Guang Cheng
Fix, E. and Hodges, J. L., Jr. (1951). Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Randolph Field, Texas, Project 21-49-004, Report No.4.
# Training data set.seed(1) n = 100 d = 10 DATA = mydata(n, d) # Testing data set.seed(2015) ntest = 100 TEST = mydata(ntest, d) TEST.x = TEST[,1:d] # K nearest neighbor classifier myknn(DATA, TEST.x, K = 5)
# Training data set.seed(1) n = 100 d = 10 DATA = mydata(n, d) # Testing data set.seed(2015) ntest = 100 TEST = mydata(ntest, d) TEST.x = TEST[,1:d] # K nearest neighbor classifier myknn(DATA, TEST.x, K = 5)
Implement Samworth's optimal weighted nearest neighbor classification algorithm to predict the label of a new input using a training data set.
myownn(train, test, K)
myownn(train, test, K)
train |
Matrix of training data sets. An n by (d+1) matrix, where n is the sample size and d is the dimension. The last column is the class label. |
test |
Vector of a test point. It also admits a matrix input with each row representing a new test point. |
K |
Number of nearest neighbors considered. |
The tuning parameter K can be tuned via cross-validation, see cv.tune function for the tuning procedure.
It returns the predicted class label of the new test point. If input is a matrix, it returns a vector which contains the predicted class labels of all the new test points.
Wei Sun, Xingye Qiao, and Guang Cheng
R.J. Samworth (2012), "Optimal Weighted Nearest Neighbor Classifiers," Annals of Statistics, 40:5, 2733-2763.
# Training data set.seed(1) n = 100 d = 10 DATA = mydata(n, d) # Testing data set.seed(2015) ntest = 100 TEST = mydata(ntest, d) TEST.x = TEST[,1:d] # optimal weighted nearest neighbor classifier myownn(DATA, TEST.x, K = 5)
# Training data set.seed(1) n = 100 d = 10 DATA = mydata(n, d) # Testing data set.seed(2015) ntest = 100 TEST = mydata(ntest, d) TEST.x = TEST[,1:d] # optimal weighted nearest neighbor classifier myownn(DATA, TEST.x, K = 5)
Implement the stabilized nearest neighbor classification algorithm to predict the label of a new input using a training data set. The stabilized nearest neighbor classifier contains the K-nearest neighbor classifier and the optimal weighted nearest neighbor classifier as two special cases.
mysnn(train, test, lambda)
mysnn(train, test, lambda)
train |
Matrix of training data sets. An n by (d+1) matrix, where n is the sample size and d is the dimension. The last column is the class label. |
test |
Vector of a test point. It also admits a matrix input with each row representing a new test point. |
lambda |
Tuning parameter controlling the degree of stabilization of the nearest neighbor classification procedure. The larger lambda, the more stable the procedure is. |
The tuning parameter lambda can be tuned via cross-validation, see cv.tune for the tuning procedure.
It returns the predicted class label of the new test point. If input is a matrix, it returns a vector which contains the predicted class labels of all the new test points.
Wei Sun, Xingye Qiao, and Guang Cheng
W. Sun, X. Qiao, and G. Cheng (2015) Stabilized Nearest Neighbor Classifier and Its Statistical Properties. Available at arxiv.org/abs/1405.6642.
# Training data set.seed(1) n = 100 d = 10 DATA = mydata(n, d) # Testing data set.seed(2015) ntest = 100 TEST = mydata(ntest, d) TEST.x = TEST[,1:d] # stabilized nearest neighbor classifier mysnn(DATA, TEST.x, lambda = 10)
# Training data set.seed(1) n = 100 d = 10 DATA = mydata(n, d) # Testing data set.seed(2015) ntest = 100 TEST = mydata(ntest, d) TEST.x = TEST[,1:d] # stabilized nearest neighbor classifier mysnn(DATA, TEST.x, lambda = 10)
Implement the weighted nearest neighbor classification algorithm to predict the label of a new input using a training data set.
mywnn(train, test, weight)
mywnn(train, test, weight)
train |
Matrix of training data sets. An n by (d+1) matrix, where n is the sample size and d is the dimension. The last column is the class label. |
test |
Vector of a test point. |
weight |
The weight vector for all n nearest neighbors. |
It returns the predicted class label of the new test point.
Wei Sun, Xingye Qiao, and Guang Cheng
set.seed(1) n = 100 d = 10 DATA = mydata(n, d) ## weighted nearest neighbor classifier weight.vec = c(rep(0.02,50), rep(0,50)) mywnn(DATA, rep(-5,d), weight = weight.vec)
set.seed(1) n = 100 d = 10 DATA = mydata(n, d) ## weighted nearest neighbor classifier weight.vec = c(rep(0.02,50), rep(0,50)) mywnn(DATA, rep(-5,d), weight = weight.vec)