dbscan聚类算法详解与代码实现

Posted by mudux on February 1, 2020

1. DBSCAN理论介绍

 DBSCAN(Density-Based Spatial Clustering of Applicaions with Noise,具有噪声的基于密度的聚类方法)是一种很典型的聚类算法,和KMeans,BIRCH这些一般只适用于凸样本集的聚类相比,DBSCAN既可以适用于凸样本集,也可以适用于非凸样本集。

1.1 密度聚类原理

 DBSCAN是一种基于密度的聚类算法,这类密度聚类算法一般假定类别可以通过样本分布的紧密程度决定。同一类别的样本,他们之间是紧密相连的,也就是说,在该类别任意样本周围不远处一定有同类别的样本存在。

1.2 DBSCAN密度定义

 DBSCAN基于一组邻域参数$\left( {\varepsilon ,MinPts} \right)$来刻画样本分布的紧密程度。给定数据集,定义下面几个概念:

  • $\varepsilon$-邻域:对${x_j} \in D$,其$\varepsilon$-邻域包含样本集$D$中与$x_j$的距离不大于$\varepsilon$的样本,即 ;
  • 核心对象(core object): 若$x_j$的$\varepsilon$-邻域至少包含$MinPts$个样本,即 ,则$x_j$是一个核心对象;
  • 密度直达(directly density-reachable): 若$x_j$位于$x_i$的$\varepsilon$-邻域中,且$x_i$是核心对象,则称$x_j$由$x_i$密度直达;
  • 密度可达(density-reachable): 对$x_i$与$x_j$,若存在样本序列${p_1},{p_2},…,{p_n}$,其中$p_1=x_i,p_n=x_j$且$p_{i+1}$由$p_i$密度直达,则称$x_j$由$x_i$密度可达;
  • 密度相连(density-connected): 对$x_i$与$x_j$, 若存在$x_k$使得$x_i$与$x_j$均由$x_k$密度可达,则称x_i$与$x_j$密度相连。

 基于这些概念,DBSCAN将“簇”定义为:由密度可达关系导出的最大的密度相连样本集合。

 那么怎么才能找到这样的簇样本集合?DBSCAN使用的方法很简单,它任意选择一个没有类别的核心对象作为种子,然后找到所有这个核心对象密度可达的样本集合,即为一个聚类簇。接着继续选择另一个没有类别的核心对象去寻找密度可达的样本集合,这样就得到了另一个聚类簇。一直运行到所有核心对象都有类别为止。

 基本上这就是DBSCAN算法的主要内容,但还有三个问题要考虑。
 第一个是一些异常样本点或者说少量游离于簇外的样本点,这些点不在任何一个核心对象周围,在DBCSAN中,我们一般将这些样本点标记为噪音点。
 第二个是距离的度量问题,即如何计算某样本和核心对象样本的距离。在DBSCAN中,一般采样最近邻思想,采用某一种距离度量来衡量样本距离,比如欧式距离。
 第三个问题比较特殊,某些样本可能到两个核心对象的距离都小于$\varepsilon$,但是这两个核心对象由于不是密度直达,又不属于同一个聚类簇,那么如何界定这个样本的类别呢?一般来说DBSCAN采用先来后到,先进行聚类的类别簇会标记为这个样本的类别。也就是说DBSCAN算法不是完全稳定的算法。

1.3 DBSCAN算法

kmeansAlg

注:来自周志华《机器学习第九章p.213》

1.4 DBSCAN小结

 和传统的KMeans算法相比,DBSCAN最大的不同就是不需要输入类别数k,它最大的优势是可以发现任意形状的聚类簇,而不是像KMeans一般仅仅适用于凸的样本聚类。同时它在聚类的同时还可以找出异常点,这点和BIRCH算法类似。 那么我们什么时候需要用DBSCAN来聚类呢?一般来说,如果数据集是稠密的,并且数据集不是凸的,那么DBSCAN会比KMeans聚类效果好很多。如果数据集不是稠密的,则不推荐用DBSCAN来聚类。
 DBSCAN的主要优点有:

  • 可以对任意形状的稠密数据集进行聚类,相对的,KMeans之类的算法一般只适用于凸数据集;
  • 可以在聚类的同时发现异常点,对数据集中的异常点不敏感;
  • 聚类结果没有偏倚,相对的,KMeans之类的聚类算法初始值对聚类结果有很大影响。

 DBSCAN的主要缺点有:

  • 如果样本集的密度不均匀,聚类间距差相差很大是,聚类质量较差,这时用DBSCAN聚类一般不适合;
  • 如果样本集较大时,聚类收敛时间较长,此时可以对搜索最近邻时建立的KD树或者球树进行规模限制来改进;
  • 调参相对于传统的KMeans之类的聚类算法稍复杂,主要需要对距离阈值$\varepsilon$,邻域样本数阈值$MinPts$联合调参,不同的参数组合对最后的聚类效果有较大影响。

2. C++实现

2.1 C++代码

dbscan.h文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
#include <iostream>
#include <vector>
#include <cmath>
#include <algorithm>

class PointDB{
public:
    std::vector<double> feature;
    int pointCnt;
    int cluster;

    double getDistance(const PointDB& ot)
    {
        double sum = 0;
        for (int i = 0; i < feature.size(); ++i)
        {
            sum += (feature[i] - ot.feature[i]) * (feature[i] - ot.feature[i]);
        }
        return sqrt(sum);
    }
};

class DBSCAN{
public:
    DBSCAN(double eps, int minPts, std::vector<PointDB> &points);
    DBSCAN(double eps, int minPts, std::vector<std::vector<double>> &points);
    ~DBSCAN();
    std::vector<std::vector<double>> normalizeFeature(std::vector<std::vector<double>> &feature);
    void run();
    void dfs(int now, int clusterId);
    void checkNearPoints();
    bool isCoreObject(int idx);
    std::vector<std::vector<int>> getCluster();

    std::vector<int> getPointsCluster();
    int getNumCluster();

private:
    int minPts;
    double eps;
    std::vector<PointDB> points;
    int size;
    std::vector<std::vector<int>> adjPoints;
    std::vector<std::vector<int>> cluster;
    int clusterIdx;
    const int NOISE = -1;
    const int NOT_CLASSIFIED = -2;    
};


DBSCAN::DBSCAN(double eps, int minPts, std::vector<PointDB>& points)
{
	this->eps = eps;
	this->minPts = minPts;
	this->points = points;
	this->size = (int)points.size();
	adjPoints.resize(this->size);
	this->clusterIdx = -1;
}


DBSCAN::DBSCAN(double eps, int minPts, std::vector<std::vector<double>>& features)
{
	std::vector<PointDB> tPoints(features.size());
	std::vector<std::vector<double>> localFeature = normalizeFeature(features);
	for (int i = 0; i < localFeature.size(); ++i)
	{

		tPoints[i].feature = localFeature[i];
		tPoints[i].pointCnt = 0;
		tPoints[i].cluster = NOT_CLASSIFIED;
	}
	this->eps = eps;
	this->minPts = minPts;
	this->points = tPoints;
	this->size = (int)points.size();
	this->adjPoints.resize(size);
	this->clusterIdx = -1;
}


DBSCAN::~DBSCAN()
{

}

std::vector<std::vector<double>> DBSCAN::normalizeFeature(std::vector<std::vector<double>>& features)
{
	int instance_num = features.size();
	int feature_num = features[0].size();
	std::vector<std::vector<double>> tmp(feature_num, std::vector<double>({ {INT_MAX, INT_MIN} }));
	for (int i = 0; i < feature_num; ++i)
	{
		for (int j = 0; j < instance_num; ++j)
		{
			tmp[i][0] = std::min(tmp[i][0], features[j][i]);
			tmp[i][1] = std::max(tmp[i][1], features[j][i]);
		}
	}
	std::vector<std::vector<double>> res(features);
	for (int i = 0; i < instance_num; ++i)
	{
		for (int j = 0; j < feature_num; ++j)
			res[i][j] = (res[i][j] - tmp[j][0]) / (tmp[j][1] - tmp[j][0]);
	}
	return res;
}


void DBSCAN::checkNearPoints()
{
	for (int i = 0; i < size; ++i)
	{
		for (int j = 0; j < size; ++j)
		{
			if (i == j)
				continue;
			if (points[i].getDistance(points[j]) <= eps) {
				points[i].pointCnt++;
				adjPoints[i].push_back(j);
			}
		}
	}
}

bool DBSCAN::isCoreObject(int idx)
{
	return points[idx].pointCnt >= minPts;
}

void DBSCAN::dfs(int now, int clusterIdx)
{
	points[now].cluster = clusterIdx;
	if (!isCoreObject(now))
		return;
	for (auto& next : adjPoints[now])
	{
		if (points[next].cluster != NOT_CLASSIFIED)
			continue;
		dfs(next, clusterIdx);
	}
}

void DBSCAN::run()
{
	checkNearPoints();
	for (int i = 0; i < size; ++i)
	{
		if (points[i].cluster != NOT_CLASSIFIED)
			continue;
		if (isCoreObject(i))
			dfs(i, ++clusterIdx);
		else
			points[i].cluster = NOISE;
	}

	cluster.resize(clusterIdx + 1);
	for (int i = 0; i < size; ++i)
	{
		if (points[i].cluster != NOISE)
			cluster[points[i].cluster].push_back(i);
	}
	std::vector<int> noiseIdx;
	for (int i = 0; i < size; ++i)
	{
		if (points[i].cluster == NOISE)
			noiseIdx.push_back(i);
	}
	if (noiseIdx.size() > 0)
		cluster.push_back(noiseIdx);
	int cnt = 0;
	for (auto& row : cluster)
		cnt += row.size();
	std::cout << "total cnt " << cnt << std::endl;
}

std::vector<std::vector<int>> DBSCAN::getCluster()
{
	return cluster;
}

std::vector<int> DBSCAN::getPointsCluster()
{
	std::vector<int> pointCluster;
	for (int i = 0; i < size; ++i)
		pointCluster.push_back(points[i].cluster);
	return pointCluster;
}

int DBSCAN::getNumCluster()
{
	return clusterIdx + 1;
}

2.2 C++测试

main.cpp文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#include <fstream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>
#include "dbscan.h"

using namespace std;

int main()
{
	ifstream in(".\\dbscanTest2.txt");
	if (!in.is_open()) {
		cerr << "fatal error! unable to open file" << endl;
		return 0;
	}
	vector<vector<double>> data;
	stringstream ins;
	string s;
	while (getline(in, s)) {
		ins << s;
		vector<double> row;
		double t;
		while (ins >> t) {
			row.push_back(t);
		}
		data.push_back(row);
		ins.clear();
	}
	cout << "data size: " << data.size() << endl;
	in.close();
	DBSCAN dbscan = DBSCAN(0.1, 5, data);
    dbscan.run();
	std::vector<std::vector<int>> ans = dbscan.getCluster();
	for (auto& row : ans)
	{
		cout << "new cluster:" << endl;
		for (int& a : row)
			cout << a << " ";
		cout << endl;
	}
	return 0;
}

输出为 cppTest

3. python实现

3.1 python代码

dbscan.py文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import numpy as np
import random
from sklearn.preprocessing import MinMaxScaler

class DBSCAN():
    """class for dbscan clustering algorithm"""
    def __init__(self, epsilon, minPts, filePath=None, verbose=False):
        self.epsilon = epsilon
        self.minPts = minPts
        self.filePath = filePath
        self.verbose = verbose
        self.dataSet = None
        self.nearNeighbors = None
    
    def loadDataSet(self, fileName):
        dataMat = []
        if self.verbose:
            print(fileName)
        fr = open(fileName)
        for line in fr.readlines():
            curLine = line.strip().split()
            fltLine = list(map(float, curLine))
            dataMat.append(fltLine)
        dataMat = np.array(dataMat)
        return dataMat

    def calDist(self, a, b):
        return np.sqrt(np.sum((a - b) ** 2))

    def getNeighbor(self, data, dataSet, epsilon):
        if self.verbose:
            print("Find neighbors of ", data)
        res = list()
        for i in range(len(dataSet)):
            if self.calDist(data, dataSet[i]) <= epsilon:
                res.append(i)
        return res

    def normalizeFeature(self, dataSet):
        scaler = MinMaxScaler()
        return scaler.fit_transform(dataSet)


    def checkNearObject(self, dataSet):
        res = []
        for i in range(len(dataSet)):
            newArr = []
            for j in range(len(dataSet)):
                if i == j:
                    continue
                if self.calDist(dataSet[i], dataSet[j]) <= self.epsilon:
                    newArr.append(j)
            res.append(newArr)
        self.nearNeighbors = np.array(res)

    def isCoreObject(self, idx):
        return len(self.nearNeighbors[idx]) >= self.minPts

    def dfs(self, idx, notAccess, l=[]):
        l.append(idx)
        notAccess[idx] = True
        if self.isCoreObject(idx):
            for i in self.nearNeighbors[idx]:
                if notAccess[i]:
                    continue
                self.dfs(i, notAccess, l)

        
    def fit(self, dataSet=None):
        if dataSet is None:
            assert self.filePath is not None, 'filePath is None and dataSet is not provide'
            dataSet = self.loadDataSet(self.filePath)
            self.dataSet = dataSet
            self.dataSet = self.normalizeFeature(self.dataSet)
        else:
            self.dataSet = self.normalizeFeature(dataSet)
        self.checkNearObject(self.dataSet)
        cluster = {}    
        instanceNum = len(self.dataSet)
        notAccess = np.full(instanceNum, False, np.bool)
        clusterIdx = 0
        for i in range(instanceNum):
            if notAccess[i]:
                continue
            if self.isCoreObject(i):
                l = []
                self.dfs(i, notAccess, l)
                l = sorted(l)
                cluster[clusterIdx] = l
                clusterIdx += 1
        noiseArr = []
        for i in range(instanceNum):
            if ~notAccess[i]:
                noiseArr.append(i)   
        if len(noiseArr) > 0:
            cluster[clusterIdx] = noiseArr
        self.cluster_ = cluster
        return cluster
    
    def check_is_fitted(self, attributes, msg=None, all_or_any=all):
        if msg is None:
            msg = ("This %(name)s instance is not fitted yet. Call 'fit' with "
                   "appropriate arguments before using this method.")
        if not hasattr(self, 'fit'):
            raise TypeError("This is not an estimator instance")
        if not isinstance(attributes, (list, tuple)):
            attributes = [attributes]
        if not all_or_any([hasattr(self, attr) for attr in attributes]):
            raise TypeError(msg % {'name': type(self).__name__})


    def predict(self, testData):
        self.check_is_fitted('cluster_')
        if self.verbose:
            print("Predict the class of test data")
        n_samples = testData.shape[0]
        labels = np.full(n_samples, -1, np.int32)
        for i in range(n_samples):
            for core in list(self.cluster_.keys()):
                arr = self.cluster_[core]
                for a in arr:
                    if self.calDist(testData[i], a) <= self.epsilon:
                        labels[i] = core
        self.labels_ = labels
        return labels

3.2 python测试

1
2
3
4
5
6
7
from dbscan import DBSCAN

dbscan = DBSCAN(0.1, 5, filePath='./dbscan/dbscanTest2.txt')
cluster = dbscan.fit()

for key in cluster.keys():
    print(cluster[key])

输出为 pythonTest

4. sklearn库

1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import MinMaxScaler

data = pd.read_csv('./dbscan/dbscanTest2.txt', sep=' ', header=None)
data = data.values
scaler = MinMaxScaler()
data = scaler.fit_transform(data)

dbscan = DBSCAN(0.1, 5).fit(data)
print(dbscan.labels_)

输出为 skDbscanTest

可以看出三种实现方式运行结果一样,符合预期

5. Reference

DBSCAN密度聚类算法-刘建平
DBSCAN-cpp-github
《机器学习第九章-周志华》