kmeans聚类算法详解与代码实现

kmeans

Posted by mudux on January 15, 2020

最近开始复习机器学习知识,这里总结下聚类里面的Kmeans算法。

1. 理论介绍

1.1 KMeans理论介绍

 K-平均算法(英文:k-means clustering)源于信号处理中的一种向量量化方法,现在则更多地作为一种聚类分析方法流行于数据挖掘领域。k-平均聚类的目的是:把n个点(可以是样本的一次观察或一个实例)划分到k个聚类中,使得每个点都属于离他最近的均值(此即聚类中心)对应的聚类,以之作为聚类的标准。这个问题将归结为一个把数据空间划分为Voronoi cells的问题。
 给定样本集 ,k-means算法针对聚类所得簇划分 最小化平方误差

 其中${\mu_i}=\frac{1}{\left|{C_i} \right|}\sum\nolimits_{x\in{C_i}}x$是簇$C_i$的均值向量。$E$刻画了簇内样本围绕簇均值向量的紧密程度,$E$越小则簇内样本相似度越高。 最小化$E$需要考察样本集$D$所有可能的簇划分,这是一个NP难问题。因此,k均值算法采用了贪心策略,通过迭代优化来近似求解。算法流程如下图所示: kmeansAlg

注:来自周志华《机器学习第九章p.203》

可以看出KMeans算法中有几个关键要素:

  1. K值选择:一般根据对数据的先验经验选择一个合适的K值,或者通过交叉验证选择一个合适的K值;
  2. 距离度量:常用欧式距离,曼哈顿距离等;
  3. 初始质心选择:初始质心对最后聚类结果有较大影响。

1.2 KMeans变体

1.2.1 KMeans初始化优化KMeans++

 k个初始化质心的位置的选择对最后的聚类结果和运行时间都有很大的影响,需要选择合适的质心。如果随机选择,有可能导致算法收敛很慢。KMeans++算法就是对KMeans随机初始化质心的方法的优化。  原理如下:

  • 从输入的数据点集合中随机选择一个点作为第一个聚类中心${\mu_1}$;
  • 对于数据集中的每一个点$x_i$,计算它与已选择的聚类中心中最近聚类中心的距离
  • 选择一个新的数据作为新的聚类中心,选择原则:$D(x)$较大的点被选取作为聚类中心的概率较大;
  • 重复2,3步直到选择出K个聚类中心;
  • 利用这K个质心作为初始化质心去运行标准的KMeans算法。

1.2.2 KMeans距离计算优化elkan KMeans

 在传统的KMeans算法中,每轮迭代中,要计算所有样本点到所有质心的距离,这个过程比较耗时,如何优化?。elkan KMeans算法就是从这块入手加以改进。它的目标是减少不必要的距离计算。何为不必要的距离计算?
 elkan KMeans利用了三角形两边之和大于等于第三边以及两边之差小于第三边的性质。

  • 对于一个样本点$x$和两个质心${\mu_i}$和${\mu_j}$,如果预先计算出了两个质心的距离$D(j_1, j_2)$,则如果计算发现$2D\left({x,j_1}\right) \le D\left({j_1,j_2} \right)$,就可知$D\left({x,j_1}\right) \le D\left( {x,{j_2}} \right)$。此时就不需要计算$D(x, j_2)$。
  • 对于一个样本点$x$和两个质心${\mu_i}$和${\mu_j}$。可以得到

 利用上边两个规律,elkan KMeans比起传统的KMeans迭代速度有了很大提高。但如果样本特征是稀疏的,有缺失值的话,这个方法就不适用了。

1.2.3 大样本优化Mini Batch KMeans

 在统的K-Means算法中,要计算所有的样本点到所有的质心的距离。如果样本量非常大,比如达到10万以上,特征有100以上,此时用传统的K-Means算法非常的耗时,就算加上elkan K-Means优化也依旧。在大数据时代,这样的场景越来越多。此时Mini Batch K-Means应运而生。
 顾名思义,Mini Batch,也就是用样本集中的一部分的样本来做传统的K-Means,这样可以避免样本量太大时的计算难题,算法收敛速度大大加快。当然此时的代价就是我们的聚类的精确度也会有一些降低。一般来说这个降低的幅度在可以接受的范围之内。

1.3 KMeans优缺点

1.3.1 优点

 KMeans的主要优点有:

  • 原理简单,实现容易,收敛速度快;
  • 聚类效果较优;
  • 算法可解释性比较强;
  • 主要调参的参数仅为簇数K

1.3.2 缺点:

 KMeans的主要缺点有:

  • K值得选取不好把握;
  • 对于不是凸的数据集比较难收敛;
  • 如果各隐含类别的数据不平衡,则聚类效果不佳;
  • 采用迭代方法,得到的结果可能只是局部最优;
  • 对噪声和异常值敏感。

2. c++实现

kmeans.h文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
#include <iostream>
#include <vector>
#include <cmath>
#include <fstream>
#include <sstream>
#include <string>
#include <algorithm>

class Point{
private:
    int pointId;
    int clusterId;
    int dimensions;
    std::vector<double> values;
public:
    Point(int id, std::string line)
    {
        dimensions = 0;
        pointId = id;
        std::stringstream is(line);
        double val;
        while (is >> val){
            values.push_back(val);
            ++dimensions;
        }
        clusterId = 0;  //Output id of each cluster is 1, 2, ...
    }

    int getDimensions()
    {
        return dimensions;
    }

    int getCluster()
    {
        return clusterId;
    }

    int getID()
    {
        return pointId;
    }

    void setCluster(int id)
    {
        clusterId = id;
    }

    double getVal(int pos)
    {
        return values[pos];
    }
};


class Cluster{
private:
    int clusterId;
    std::vector<double> centroid;
    std::vector<Point> points;
public:
    Cluster(int clusterId, Point centroid)
    {
        this->clusterId = clusterId;
        for (int i = 0; i < centroid.getDimensions(); ++i)
            this->centroid.push_back(centroid.getVal(i));
        this->addPoint(centroid);
    }
    
    void addPoint(Point p)
    {
        p.setCluster(this->clusterId);
        points.push_back(p);
    }

    bool removePoint(int pointId)
    {
        int size = points.size();
        for (int i = 0; i < size; ++i)
        {
            if (points[i].getID() == pointId)
            {
                points.erase(points.begin() + i);
                return true;
            }
        }
        return false;
    }

    int getId()
    {
        return clusterId;
    }

    Point getPoint(int pos)
    {
        return points[pos];
    }

    int getSize()
    {
        return points.size();
    }

    double getCentroidByPos(int pos)
    {
        return centroid[pos];
    }

    void setCentroidByPos(int pos, double val)
    {
        this->centroid[pos] = val;
    }
};


class KMeans{
private:
    int K;
    int iters;
    int dimensions;
    int total_points;
    std::vector<Cluster> clusters;

    /**
     * @description: Find the nearest cluster centroid, return it's id for input point 
     * @param point: point for clustering
     * @return: nearest cluster id
     */    
    int getNearestClusterId(Point point)
    {
        double sum = 0.0;
        double min_dist = 0;
        int nearestClusterId;
        for (int i = 0; i < dimensions; ++i)
            sum += pow(clusters[0].getCentroidByPos(i) - point.getVal(i), 2.0);
        min_dist = sqrt(sum);
        nearestClusterId = clusters[0].getId();

        for (int i = 1; i < K; ++i)
        {
            double dist = 0;
            sum = 0.0;
            for (int j = 0; j < dimensions; ++j)
                sum += pow(clusters[i].getCentroidByPos(j) - point.getVal(j), 2.0);
            dist = sqrt(sum);
            if (dist < min_dist){
                min_dist = dist;
                nearestClusterId = clusters[i].getId();
            }

        }
        return nearestClusterId;
    }

public:
    KMeans(int K, int iterations)
    {
        this->K = K;
        this->iters = iterations;
    }

    void run(std::vector<Point>& all_points)
    {
        total_points = all_points.size();
        dimensions = all_points[0].getDimensions();
        
        // Initializing Clusters;
        std::vector<int> used_pointIds;
        for (int i = 1; i <= K; ++i)
        {
            while (true){
                int index = rand() % total_points;   
                if (find(used_pointIds.begin(), used_pointIds.end(), index) == used_pointIds.end()){
                    used_pointIds.push_back(index);
                    all_points[index].setCluster(i);
                    Cluster cluster(i, all_points[index]);
                    clusters.push_back(cluster);
                    break;
                }
            }
            
        }
        std::cout << "Clusters initialized = " << clusters.size() << std::endl;

        std::cout << "Running KMeans Clustering.." << std::endl;

        int iter = 1;
        while (true){
            std::cout << "Iter - " << iter << "/" << iters << std::endl;
            bool done = true;

            // Add all points to their nearest cluster
            for (int i = 0; i < total_points; ++i)
            {
                int currentClusterId = all_points[i].getCluster();
                int nearestClusterId = getNearestClusterId(all_points[i]);
                if (currentClusterId != nearestClusterId){
                    // currentClusterId = 0 for the first running 
                    if (currentClusterId != 0){
                        for (int j = 0; j < K; ++j)
                        {
                            if (clusters[j].getId() == currentClusterId)
                                clusters[j].removePoint(all_points[i].getID());
                        }
                    }

                    for (int j = 0; j < K; ++j)
                    {
                        if (clusters[j].getId() == nearestClusterId)
                            clusters[j].addPoint(all_points[i]);
                    }
                    all_points[i].setCluster(nearestClusterId);
                    done = false;
                }
            }

            // Update the center of each cluster
            for (int i = 0; i < K; ++i)
            {
                int ClusterSize = clusters[i].getSize();
                for (int j = 0; j < dimensions; ++j)
                {
                    double sum = 0.0;
                    if (ClusterSize > 0){
                        for (int p = 0; p < ClusterSize; ++p)
                            sum += clusters[i].getPoint(p).getVal(j);
                        clusters[i].setCentroidByPos(j, sum / ClusterSize);
                    }
                }
            }

            if (done | iter > iters){
                std::cout << "Clustering completed in iteration : " << iter << std::endl;
                break;
            }
            ++iter;
        }
        // Print pointIds in each cluster
        for (int i = 0; i < K; ++i)
        {
            std::cout << "Points in cluster " << clusters[i].getId() << " : ";
            for (int j = 0; j < clusters[i].getSize(); ++j)
                std::cout << clusters[i].getPoint(j).getID() << " ";
            std::cout << std::endl;
        }
        // Write cluster centers to file
        std::ofstream outfile;
        outfile.open("clustersCenter.txt");
        if (outfile.is_open()){
            for (int i = 0; i < K; ++i)
            {
                std::cout << "Cluster " << clusters[i].getId() << " centroid: ";
                for (int j = 0; j < dimensions; ++j)
                {
                    std::cout << clusters[i].getCentroidByPos(j) << " ";
                    outfile << clusters[i].getCentroidByPos(j) << " ";
                }
                std::cout << std::endl;
                outfile << std::endl;
            }
            outfile.close();
        }
        else {
            std::cout << "Fatal error: Unable to write to clustersCenter.txt";
        }
    }
};

测试一下

kmeansTest.cpp主文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#include "kmeans.h"

int main(int argc, char **argv)
{
    if (argc != 3){
        std::cout << "Error: command-line argument count mismatch.";
        return 1;
    }

    int K = atoi(argv[2]);

    std::string filename = argv[1];
    std::ifstream infile(filename.c_str());

    if (!infile.is_open()){
        std::cout << "Error: Failed to open file." << std::endl;
        return 1;
    }

    //Fetching points from file
    int pointId = 1;
    std::vector<Point> all_points;
    std::string line;
    while (getline(infile, line)){
        Point point(pointId, line);
        all_points.push_back(point);
        ++pointId;
    }
    infile.close();
    std::cout << "\nData fetched successfully!" << std::endl;

    // Return if number of clusters > number of points
    if (all_points.size() < K){
        std::cout << "Error: Number of clusters greater than number of points." << std::endl;
        return 1;
    }

    int iters = 100;
    KMeans kmeans(K, iters);
    kmeans.run(all_points);
    return 0;
}

输入为data.txt,包括 9 9
1 1
-1 -1
3 3
10 10
-2 -2
7 8
0.2 0
-1 0
6 10
命令行输入,输出为 kmeansTest

3. python实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np

class KMeans():
    """class for kmeans clustering algorithm"""
    def __init__(self, n_clusters, filePath=None, max_iter=300, verbose=False):
        self.n_clusters = n_clusters
        self.filePath = filePath
        self.max_iter = max_iter
        self.verbose = verbose
    '''
    @description: read file to numpy matrix
    @param {filename: open file path} 
    @return: numpy matrix
    '''
    def loadDataSet(self, fileName):
        dataMat = []
        print(fileName)
        fr = open(fileName)
        for line in fr.readlines():
            curLine = line.strip().split('\t')
            fltLine = list(map(float, curLine))
            dataMat.append(fltLine)
        dataMat = np.array(dataMat)
        # self.dataSet = dataMat
        return dataMat

    def distEclud(self, vecA, vecB):
        return np.sqrt(sum(np.power(vecA - vecB, 2)))

    def randCent(self, dataSet, k):
        n = np.shape(dataSet)[1]
        centroids = np.zeros((k, n))
        for j in range(n):
            minJ = min(dataSet[:, j])
            maxJ = max(dataSet[:, j])
            rangeJ = float(maxJ - minJ)
            centroids[:, j] = minJ + rangeJ * np.random.rand(k, 1).reshape(k,)
        return centroids

    def fit(self, dataSet=None, distMeasure=distEclud, createCent=randCent):
        if dataSet is None:
            assert self.filePath is not None, 'filePath is None and dataSet not provided'
            dataSet = self.loadDataSet(self.filePath)
        k = self.n_clusters
        m = np.shape(dataSet)[0]
        clusterAssment = np.zeros((m, 2))
        centroids = createCent(self, dataSet, k)
        clusterChanged = True
        iter = 0
        while clusterChanged and iter < self.max_iter:
            iter += 1
            clusterChanged = False
            for i in range(m):
                minDist = np.inf
                minIndex = -1
                for j in range(k):
                    distJI = distMeasure(self, centroids[j, :], dataSet[i, :])
                    if distJI < minDist:
                        minDist = distJI
                        minIndex = j
                if clusterAssment[i, 0] != minIndex:
                    clusterChanged = True
                clusterAssment[i, :] = minIndex, minDist**2
            print(centroids)
            for cent in range(k):
                ptsInClust = dataSet[clusterAssment[:, 0] == cent]
                centroids[cent, :] = np.mean(ptsInClust, axis=0)
        if iter == self.max_iter and self.verbose:
            print("Iteration times reach max_iter %d" % (self.max_iter))
        self.n_iter_ = iter
        self.cluster_centers_  = centroids
        self.labels_ = clusterAssment[:, 0]
        self.inertia_ = clusterAssment[:, 1]
        return centroids, clusterAssment

    def check_is_fitted(self, attributes, msg=None, all_or_any=all):
        if msg is None:
            msg = ("This %(name)s instance is not fitted yet. Call 'fit' with "
                   "appropriate arguments before using this method.")
        if not hasattr(self, 'fit'):
            raise TypeError("This is not an estimator instance.")
        if not isinstance(attributes, (list, tuple)):
            attributes = [attributes]
        if not all_or_any([hasattr(self, attr) for attr in attributes]):
            raise TypeError(msg % {'name': type(self).__name__})

    def predict(self, testData, disMeature=distEclud):
        self.check_is_fitted('cluster_centers_')
        if self.verbose:
            print("Computing label assignment and total inertia")
        n_samples = testData.shape[0]
        labels = np.full(n_samples, -1, np.int32)
        k = self.n_clusters
        distanceMat = np.zeros((n_samples, k))
        for i in range(n_samples):
            for j in range(k):
                distanceMat[i, j] = disMeature(self, testData[i], self.cluster_centers_[j])
            labels[i] = np.argmin(distanceMat[i, :])
        return labels

测试一下

1
2
3
4
5
6
from kmeans import KMeans
kmeans = KMeans(2, 'kmeansTest.txt')
myCentroids, clusterAssing = kmeans.fit()

print(kmeans.labels_)
print(kmeans.cluster_centers_)

输出为: kmeansTestMyPython

4. sklearn包实现

1
2
3
4
5
6
7
8
from sklearn.cluster import KMeans
import numpy as np

X = np.array([[9, 9], [1, 1], [-1, -1], [3, 3], [10, 10],
            [-2, -2], [7, 8], [0.2, 0], [-1, 0], [6, 10]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.labels_)
print(kmeans.cluster_centers_)

输出为: kmeansTestPython

可以看出三种实现方式运行结果一样,符合预期。

5. 后续

  • python版本KMeans实现-done[2020-01-16 18:46]
  • KMeans与GMM联系
  • 其余聚类算法实现

6. bisecting KMeans

 当KMeans初始质心选择不当时可能会收敛到局部极小值,可以对生成的簇进行后处理,此时可以选择:

  • 合并最近的质心;
  • 合并两个使得SSE增幅最小的质心。

 所以有人提出了二分K-均值(bisecting KMeans)算法,该算法首先将所有点作为一个簇,然后将该簇一分为二。之后选择其中一个簇继续进行划分,选择哪一个簇进行划分取决于对齐划分是否可以最大程度降低SSE的值。上述基于SSE的划分过程不断重复,直到得到用户指定的簇数目为止。
 python实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np

class BiKMeans():
    """class for bikmeans clustering algorithm"""
    def __init__(self, n_clusters, filePath=None, max_iter=300, verbose=False):
        self.n_clusters = n_clusters
        self.filePath = filePath
        self.max_iter = max_iter
        self.verbose = verbose
    '''
    @description: read file to numpy matrix
    @param {filename: open file path} 
    @return: numpy matrix
    '''
    def loadDataSet(self, fileName):
        dataMat = []
        if self.verbose:
            print(fileName)
        fr = open(fileName)
        for line in fr.readlines():
            curLine = line.strip().split()
            fltLine = list(map(float, curLine))
            dataMat.append(fltLine)
        dataMat = np.array(dataMat)
        # self.dataSet = dataMat
        return dataMat

    def distEclud(self, vecA, vecB):
        return np.sqrt(sum(np.power(vecA - vecB, 2)))

    def randCent(self, dataSet, k):
        n = np.shape(dataSet)[1]
        centroids = np.zeros((k, n))
        for j in range(n):
            minJ = min(dataSet[:, j])
            maxJ = max(dataSet[:, j])
            rangeJ = float(maxJ - minJ)
            centroids[:, j] = minJ + rangeJ * np.random.rand(k, 1).reshape(k,)
        return centroids

    def kmeans(self, dataSet, k, distMeasure=distEclud, createCent=randCent):
        m = np.shape(dataSet)[0]
        clusterAssment = np.zeros((m, 2))
        centroids = createCent(self, dataSet, k)
        clusterChanged = True
        while clusterChanged:
            clusterChanged = False
            for i in range(m):
                minDist = np.inf
                minIndex = -1
                for j in range(k):
                    distJI = distMeasure(self, centroids[j, :], dataSet[i, :])
                    if distJI < minDist:
                        minDist = distJI
                        minIndex = j
                if clusterAssment[i, 0] != minIndex:
                    clusterChanged = True
                clusterAssment[i, :] = minIndex, minDist**2
            if self.verbose:
                print(centroids)
            for cent in range(k):
                ptsInClust = dataSet[clusterAssment[:, 0] == cent]
                centroids[cent, :] = np.mean(ptsInClust, axis=0)
        return centroids, clusterAssment

    def fit(self, dataSet=None, distMeasure=distEclud):
        if dataSet is None:
            assert self.filePath is not None, 'filePath is None and dataSet not provided'
            dataSet = self.loadDataSet(self.filePath)
        k = self.n_clusters
        m = np.shape(dataSet)[0]
        clusterAssment = np.zeros((m, 2))
        centroid0 = np.mean(dataSet, axis=0)
        cenList = [centroid0]
        for j in range(m):
            clusterAssment[j, 1] = distMeasure(self, centroid0, dataSet[j, :])**2
        while (len(cenList) < k):
            lowerestSSE = np.inf
            for i in range(len(cenList)):
                ptsInCurrCluster = dataSet[clusterAssment[:, 0]==i]
                centroidMat, splitClustAss = self.kmeans(ptsInCurrCluster, 2, distMeasure)
                sseSplit = np.sum(splitClustAss[:, 1])
                sseNotSplit = np.sum(clusterAssment[clusterAssment[:, 0] != i, 1])
                if self.verbose:
                    print("sseSplit, and notSplit: ", sseSplit, sseNotSplit)
                if (sseSplit + sseNotSplit < lowerestSSE):
                    bestCentToSplit = i
                    bestNewCents = centroidMat
                    bestClustAss = splitClustAss.copy()
                    lowerestSSE = sseSplit + sseNotSplit
            bestClustAss[bestClustAss[:, 0] == 1] = len(cenList)
            bestClustAss[bestClustAss[:, 0] == 0] = bestCentToSplit
            if self.verbose:
                print("The bestCentToSplit is: ", bestCentToSplit)
                print("The len of bestClustAss is: ", len(bestClustAss))
            cenList[bestCentToSplit] = bestNewCents[0, :]
            cenList.append(bestNewCents[1, :])
            clusterAssment[clusterAssment[:, 0]==bestCentToSplit] = bestClustAss
        cenArr = np.array(cenList)
        self.cluster_centers_ = cenArr
        self.labels_ = clusterAssment[:, 0]
        self.inertia_ = clusterAssment[:, 1]
        return cenArr, clusterAssment

    def check_is_fitted(self, attributes, msg=None, all_or_any=all):
        if msg is None:
            msg = ("This %(name)s instance is not fitted yet. Call 'fit' with "
                   "appropriate arguments before using this method.")
        if not hasattr(self, 'fit'):
            raise TypeError("This is not an estimator instance.")
        if not isinstance(attributes, (list, tuple)):
            attributes = [attributes]
        if not all_or_any([hasattr(self, attr) for attr in attributes]):
            raise TypeError(msg % {'name': type(self).__name__})

    def predict(self, testData, disMeature=distEclud):
        self.check_is_fitted('cluster_centers_')
        if self.verbose:
            print("Computing label assignment and total inertia")
        n_samples = testData.shape[0]
        labels = np.full(n_samples, -1, np.int32)
        k = self.n_clusters
        distanceMat = np.zeros((n_samples, k))
        for i in range(n_samples):
            for j in range(k):
                distanceMat[i, j] = disMeature(self, testData[i], self.cluster_centers_[j])
            labels[i] = np.argmin(distanceMat[i, :])
        return labels

7. reference

kmeans-github
刘建平Pinard - K-Means聚类算法原理
《机器学习-周志华》
《机器学习实战-第十章》