knn分类算法详解与代码实现

Posted by mudux on February 1, 2020

1. KNN理论介绍

 K近邻(k-nearest neighbors, KNN)是一种基本的机器学习方法,KNN既可以做分类,也可以做回归。 KNN做回归和分类的主要区别在于最后做预测时候的决策方式不同。KNN做分类预测时,一般选择多数表决法,即训练集里和预测的样本特征最近的K个样本,预测为里面有最多类别数的类别。而KNN做回归时,一般选择平均法,即最近的K个样本的输出的平均值作为回归预测值。这里主要介绍KNN分类方法。 可以看出KNN不具有显式的学习过程。KNN实际上利用训练数据集对特征向量空间进行划分。k值得选择、距离度量及分类决策规则是KNN的三个基本要素。

KNN算法三要素

  1. k值选择
     对于k值的选择,没有固定的经验,一般根据样本的分布,选择一个较小的值,可以通过交叉验证选择一个合适的k值。
     选择较小的k值,就相当于用较小的邻域中的训练实例进行预测,学习的近似误差会减小,只有与输入实例较近的训练实例才会对预测结果起作用。但缺点是学习的估计误差会增大,预测结果会对近邻的实例点非常敏感。如果邻近的实例点恰巧是噪声,预测就会出错。换句话说,k值的减小就意味着整体模型变得复杂,容易发生过拟合。
     选择较大的k值,就相当于用较大邻域中的训练实例进行预测,其优点是可以减少学习的估计误差。但缺点是学习的近似误差会增大。这时候,与输入实例较远训练实例也会对预测起作用,使预测发生错误。k值得增大就意味着整体的模型变得简单。

  2. 距离度量
     常用的是欧氏距离,也可以是其他距离。
  3. 分类决策规则
     KNN中的分类决策规则往往是多数表决,即由输入实例的k个近邻的训练实例中的多数类决定输入实例的类别。
     多数表决规则有如下解释:如果分类的损失函数为0-1损失函数,分类函数为

 那么误分类的概率是

 对给定的实例$x \in {\rm X}$,其最近邻的k个训练实例点构成集合$N_k(x)$,如果涵盖$N_k(x)$区域的类别是$c_j$,那么误分类率是

 要使误分类率最小即经验风险最小,就要使最大,所以多数表决规则等价于经验风险最小化。

2. C++实现

2.1 C++代码

knn.h文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
#include <iostream>
#include <vector>
#include <unordered_map>
#include <algorithm>
#include <cmath>

class KNN {
public:
	KNN(int k)
	{
		this->k = k;
		trainX.clear();
		trainY.clear();
		mx.clear();
		mn.clear();
	}

	void normalizeFeature()
	{
		for (int i = 0; i < numInstance; ++i)
		{
			for (int j = 0; j < numFeature; ++j)
			{
				mx[j] = std::max(mx[j], trainX[i][j]);
				mn[j] = std::min(mn[j], trainX[i][j]);
			}
		}

		for (int i = 0; i < numInstance; ++i)
		{
			for (int j = 0; j < numFeature; ++j)
				trainX[i][j] = (trainX[i][j] - mn[j]) / (mx[j] - mn[j]);
		}
	}

	void normalizeFeature(std::vector<std::vector<double>>& testX)
	{
		if (testX[0].size() != numFeature) {
			std::cerr << "Number of test feature not equal to train feature!" << std::endl;
			exit(1);
		}
		for (int i = 0; i < testX.size(); ++i)
		{
			for (int j = 0; j < numFeature; ++j)
				testX[i][j] = (testX[i][j] - mn[j]) / (mx[j] - mn[j]);
		}
	}


	void fit(std::vector<std::vector<double>>& inputX, std::vector<int>& inputY)
	{
		if (inputX.size() != inputY.size()) {
			std::cerr << "fatal error! equal size of input feature and labels." << std::endl;
			exit(1);
		}
		trainX = inputX;
		trainY = inputY;
		mx = inputX[0];
		mn = inputX[0];
		numInstance = trainX.size();
		numFeature = trainX[0].size();		
	}

	std::vector<int> predict()
	{
		normalizeFeature();
		std::vector<int> res;
		for (int i = 0; i < trainX.size(); ++i)
		{
			// std::cout << "test instance: " << i + 1 << std::endl;
			std::unordered_map<int, double> distanceMap = getAllDistance(trainX[i], i);
			std::unordered_map<int, int> labelMap = getLabel(distanceMap, k);
			std::vector<std::pair<int, int>> vecLabelMap(labelMap.begin(), labelMap.end());
			auto cmp = [](std::pair<int, int>& a, std::pair<int, int>& b) {return a.second > b.second; };
			sort(vecLabelMap.begin(), vecLabelMap.end(), cmp);
			res.push_back(vecLabelMap.begin()->first);
		}
		return res;
	}

	std::vector<int> predict(std::vector<std::vector<double>>& testX)
	{
		normalizeFeature();
		normalizeFeature(testX);
		std::vector<int> res;
		for (int i = 0; i < testX.size(); ++i)
		{
			// std::cout << "test instance: " << i + 1 << std::endl;
			std::unordered_map<int, double> distanceMap = getAllDistance(testX[i]);
			std::unordered_map<int, int> labelMap = getLabel(distanceMap, k);
			std::vector<std::pair<int, int>> vecLabelMap(labelMap.begin(), labelMap.end());
			auto cmp = [](std::pair<int, int>& a, std::pair<int, int>& b) {return a.second > b.second; };
			sort(vecLabelMap.begin(), vecLabelMap.end(), cmp);
			res.push_back(vecLabelMap.begin()->first);
		}
		return res;
	}

	double score(std::vector<int>& pred, std::vector<int>& y)
	{
		int cnt = 0;
		for (int i = 0; i < pred.size(); ++i)
		{
			if (pred[i] == y[i])
				++cnt;
		}
		return cnt / (double)y.size();
	}

private:
	double getDistance(std::vector<double>& a, std::vector<double>& b)
	{
		if (a.size() != b.size() || a.size() != numFeature) {
			std::cerr << "fatal error! Unequal length of two vector" << std::endl;
			exit(1);
		}
		double sum = 0.;
		for (int i = 0; i < numFeature; ++i)
		{
			sum += pow((a[i] - b[i]), 2);
		}
		return sqrt(sum);
	}

	std::unordered_map<int, double> getAllDistance(std::vector<double>& testX)
	{
		std::unordered_map<int, double> distanceMap;
		for (int i = 0; i < numInstance; ++i)
		{
			double distance = getDistance(trainX[i], testX);
			distanceMap[i] = distance;
		}
		return distanceMap;
	}


	std::unordered_map<int, double> getAllDistance(std::vector<double>& testX, int idx)
	{
		std::unordered_map<int, double> distanceMap;
		for (int i = 0; i < numInstance; ++i)
		{
			if (i == idx)
				continue;
			double distance = getDistance(trainX[i], testX);
			distanceMap[i] = distance;
		}
		return distanceMap;
	}

	std::unordered_map<int, int> getLabel(std::unordered_map<int, double>& distanceMap, int k)
	{
		auto cmp = [](std::pair<int, double>& a, std::pair<int, double>& b) {return a.second < b.second; };
		std::vector<std::pair<int, double>> vecDistanceMap(distanceMap.begin(), distanceMap.end());
		sort(vecDistanceMap.begin(), vecDistanceMap.end(), cmp);
		// for (auto& a : vecDistanceMap)
		// {
		// 	std::cout << "instance: " << a.first << " distance: " << a.second << std::endl;
		// }
		std::unordered_map<int, int> labelMap;
		for (std::vector<std::pair<int, double>>::iterator iter = vecDistanceMap.begin(); iter < vecDistanceMap.begin() + k; ++iter)
		{
			labelMap[trainY[iter->first]]++;
		}
		return labelMap;
	}

	struct CmpByValueInt {
		bool operator()(const std::pair<int, int>& lhs, const std::pair<int, int>& rhs)
		{
			return lhs.second > rhs.second;
		}	
	};

	struct CmpByValueDouble {
		bool operator()(const std::pair<int, double>& lhs, const std::pair<int, double>& rhs)
		{
			return lhs.second < rhs.second;
		}
	};

	std::vector<std::vector<double>> trainX;
	std::vector<int> trainY;
	std::vector<std::vector<double>> testX;
	std::vector<double> mx;
	std::vector<double> mn;
	int k;
	int numInstance = 0;
	int numFeature = 0;
};

2.2 C++测试

 这里采用iris数据集。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#include "knn.h"
#include <fstream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>

using namespace std;

int main()
{
    ifstream in(".\\iris.txt");
    if (!in.is_open()){
        cerr << "fatal error! unable to open file" << endl;
        return 0;
    }
    vector<vector<double>> data;
    stringstream ins;
    string s;
    while (getline(in, s)){
        ins << s;
        vector<double> row;
        double t;
        while (ins >> t){
            row.push_back(t);
        }
        data.push_back(row);
        ins.clear();
    }
    in.close();

    vector<vector<double>> X;
    vector<int> Y;
    for (int i = 0; i < data.size(); ++i)
    {
        Y.push_back((int)data[i].back());
        X.push_back(data[i]);
        X[i].pop_back();
    }
    cout << "train label:" << endl;
    for (int i = 0; i < Y.size(); ++i)
        cout << Y[i] << " ";
    cout << endl;

    KNN knn = KNN(3);
    knn.fit(X, Y);
    std::vector<int> ans = knn.predict(X);
    double acc = knn.score(ans, Y);
    cout << "Accuracy: " << acc << endl;
    return 0;
}  

 输出为: cppTest

3. python实现

3.1 python代码

 knn.py文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from collections import defaultdict

class KNN():
    def __init__(self, k=3, disMeasure='Euc', normalize=True):
        self.k = k
        self.disMeasure = disMeasure
        self.normalize = normalize
        self.x = None
        self.y = None
    

    def scaler(self, data):
        scaler = MinMaxScaler()
        data = scaler.fit_transform(data)
        return data

    def fit(self, x, y):
        if self.normalize:
            x = self.scaler(x)
        self.x = x
        self.y = y
    
    def predict(self, testData):
        testNum = testData.shape[0]
        trainNum = self.x.shape[0]
        output = np.zeros((testNum, 1))
        for i in range(testNum):
            dis = []
            for j in range(trainNum):
                if self.disMeasure == 'Euc':
                    dis.append(np.linalg.norm(testData[i] - self.x[j, :]))
            index = sorted(range(len(dis)), key=dis.__getitem__)                    
            cntDict = defaultdict(int)
            for j in range(self.k):
                cntDict[self.y[index[j]]] += 1
            a = sorted(cntDict.items(), reverse=True)
            output[i] = a[0][0]
        self.label_ = output
        return output
    
    def score(self, testX, testY):
        if self.normalize:
            testX = self.scaler(testX)
        pred = self.predict(testX)
        self.label_ = pred
        testY = np.reshape(testY, (-1, 1))
        acc = np.mean(pred == testY)
        return acc

3.2 python测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
import pandas as pd
from sklearn import datasets
from knn import KNN

data = pd.read_csv('./knn/iris.txt', header=None, delimiter=' ')
data = data.values
trainX = data[:, :-1]
trainY = data[:, -1]

knn = KNN(3)
knn.fit(trainX, trainY)
# pred = knn.predict(trainX)
acc = knn.score(trainX, trainY)
print(acc)

 输出为: pythonTest

4. sklearn库实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler

data = pd.read_csv('./knn/iris.txt', header=None, delimiter=' ')
data = data.values
trainX = data[:, :-1]
trainY = data[:, -1]
scaler = MinMaxScaler()
trainX = scaler.fit_transform(trainX)

knn = KNeighborsClassifier(3, algorithm='brute')
knn.fit(trainX, trainY)
acc = knn.score(trainX, trainY)
print(acc)

 输出为: skKnnTest

可以看出三种实现方式运行结果一样,符合预期

5. 后续

 kd树原理介绍

6. Reference

KNN python实现
knn原理-刘建平
《统计学习方法第三章-李航》