逻辑斯谛回归,其包含 “逻辑” 一词,因为其使用逻辑函数(Sigmoid)将输入特征的线性组合转换为概率。逻辑斯谛回归的输出是一个介于 0-1 之间的连续值,因此其名称中还包含 “回归” 一词,但通常作为二分类器使用,通过选择一个 “阈值” ,将概率大于阈值的输入分类为 "正类“ ,低于阈值的分类为 负类”

逻辑斯谛回归算法核心思想

线性回归分类

分类和回归有一定的相似性,其区别在于输出值不同

  • 分类输出离散值(如垃圾邮件和分类邮件)
  • 回归输出连续值(如房子的价格)
    那么在 “线性回归” 的基础上加上一个 “阈值” 就可以处理分类问题了吧。
    但是,这种方法有很大的缺陷,其分类非常不稳定,对异常值很敏感。

逻辑斯谛回归核心思想

由于【线性回归+阈值】的方式不足以构建一个良好的分类器,因此我们需要找一个单调可微函数(像Sigmoid函数)将分类任务的真实标记yy 与线性回归模型的预测值联系起来。

下面举一个例子来理解一下:

逻辑斯谛回归三要素

我们知道机器学习方法三要素是:

  • 模型:根据具体问题,确定假设空间。
  • 策略:根据评价标准,确定选取最优模型的策略(通常产出一个“损失函数”)。
  • 算法:求解损失函数,确定最优模型。

逻辑斯谛回归模型

其为如下的条件概率分布:

P(Y=1x)=ewx+b1+ewx+bP(Y=1|x)=\frac{e^{w\cdot x+b}}{1+e^{w\cdot x+b}}

P(Y=0x)=11+ewx+bP(Y=0|x)=\frac{1}{1+e^{w\cdot x+b}}

这里,xRnx\in R^n是输入,Y{0,1}Y\in\{0,1\}是输出,wRnw\in R^nbRnb\in R^n是参数,ww称为权值向量,bb称为偏置,wxw\cdot xwwxx的内积。

逻辑斯谛回归策略

使用二元交叉熵损失函数来作为优化目标函数。

L(w,b)=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]L(w,b) = -\frac{1}{N} \sum_{i=1}^N \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]

其中:

  • NN 是样本总数
  • yiy_i 是第ii 个样本的真实标签 (0 或 1)
  • y^i=e(wxi+b)1+e(wxi+b)\hat{y}_i = \frac{e^{(w \cdot x_i + b)}}{1 + e^{(w \cdot x_i + b)}} 是模型预测第ii 个样本属于正类的概率

逻辑斯谛回归算法

采用 梯度下降法

损失函数L(w,b)L(w, b) 的梯度为:
L(w,b)=(L(w,b)w,L(w,b)b)\nabla L(w, b) = \left( \frac{\partial L(w, b)}{\partial w}, \frac{\partial L(w, b)}{\partial b} \right)
L(w,b)w=1Ni=1N[y^iyi]xi\frac{\partial L(w, b)}{\partial w} = \frac{1}{N} \sum_{i=1}^N \left[ \hat{y}_i - y_i \right] x_i
L(w,b)b=1Ni=1N[y^iyi]\frac{\partial L(w, b)}{\partial b} = \frac{1}{N} \sum_{i=1}^N \left[ \hat{y}_i - y_i \right]

梯度向量矢量化表示为:L(w,b)=1NXT(y^y)\nabla L(w, b)=\frac{1}{N}X^{\prime T}(\hat{y}-y)

使用二元交叉熵损失函数作为优化目标函数的梯度下降算法如下:
输入:目标函数f(w)f(w),梯度函数g(w)=f(w)g(w) = \nabla f(w) ,计算精度ϵ\epsilon
输出f(w)f(w) 的极小值ww^*

  1. 取初始值w(0)Rnw^{(0)} \in \mathbb{R}^n ,置k=0k = 0
  2. 计算目标函数值:
    [f(w(k))=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]f(w^{(k)}) = -\frac{1}{N} \sum_{i=1}^N \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] ]
  3. 计算梯度:gk=g(w(k))=1NXT(y^y)g_k=g(w^{(k)})=\frac{1}{N}X^{\prime T}(\hat{y}-y),当gk<ϵ\| g_k \| < \epsilon 时,停止迭代,令w=w(k)w^* = w^{(k)} ;否则,令pk=g(w(k))p_k = -g(w^{(k)}) ,求λk\lambda_k,使得

f(w(k)+λkpk)=minλ0f(w(k)+λpk)f(w^{(k)} + \lambda_k p_k) = \min_{\lambda \geqslant 0} f(w^{(k)} + \lambda p_k)

  1. w(k+1)=w(k)+λkpkw^{(k+1)} = w^{(k)} + \lambda_k p_k ,计算f(w(k+1))f(w^{(k+1)})
    f(w(k+1))f(w(k))<ϵ\| f(w^{(k+1)}) - f(w^{(k)}) \| < \epsilonw(k+1)w(k)<ϵ\| w^{(k+1)} - w^{(k)} \| < \epsilon 时,停止迭代,令w=w(k+1)w^* = w^{(k+1)}
  2. 否则,置k=k+1k = k + 1 ,转步骤(3)。

逻辑斯谛回归模型实现

批量梯度下降实现

批量梯度下降在每次迭代中使用整个训练数据集来计算梯度并更新参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
from scipy.optimize import fminbound
from pylab import mpl
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei']

class LogisticRegression:
def __init__(self, max_iter=10000, distance=3, epsilon=1e-6):
"""
逻辑斯谛回归
:param max_iter: 最大迭代次数
:param distance: 一维搜索的长度范围
:param epsilon: 迭代停止阈值
"""
self.max_iter = max_iter
self.epsilon = epsilon
self.w = None
self.distance = distance
self._X = None
self._y = None

@staticmethod
def preprocessing(X):
"""
预处理数据:在原始X末尾加上一列,该列数值全部为1(对应偏置项)
:param X: 输入特征矩阵
:return: 增加偏置项后的特征矩阵
"""
row = X.shape[0]
y = np.ones(row).reshape(row, 1)
return np.hstack((X, y))

@staticmethod
def sigmoid(x):
"""
Sigmoid函数
:param x: 输入值
:return: Sigmoid函数值
"""
return 1 / (1 + np.exp(-x))

def grad(self, w):
"""
计算梯度
:param w: 当前权重
:return: 梯度向量
"""
z = np.dot(self._X, w.T)
grad = self._X * (self._y - self.sigmoid(z))
grad = grad.sum(axis=0)
return grad

def like_func(self, w):
"""
计算对数似然函数值
:param w: 当前权重
:return: 对数似然函数值
"""
z = np.dot(self._X, w.T)
f = self._y * z - np.log(1 + np.exp(z))
return np.sum(f)

def fit_bgd(self, data_x, data_y):
"""
训练逻辑斯谛回归模型
:param data_x: 输入特征矩阵
:param data_y: 真实标签
"""
self._X = self.preprocessing(data_x)
self._y = data_y.T
# (1)取初始化w
w = np.array([[0] * self._X.shape[1]], dtype=float)
k = 0
# (2)计算f(w)
fw = self.like_func(w)
for _ in range(self.max_iter):
# 计算梯度g(w)
grad = self.grad(w)
# (3)当梯度g(w)的模长小于精度时,停止迭代
if (np.linalg.norm(grad, axis=0, keepdims=True) < self.epsilon).all():
self.w = w
break

# 梯度方向的一维函数
def f(x):
z = w - np.dot(x, grad)
return -self.like_func(z)

# (3)进行一维搜索,找到使得函数最大的lambda
_lambda = fminbound(f, -self.distance, self.distance)

# (4)设置w(k+1)
w1 = w - np.dot(_lambda, grad)
fw1 = self.like_func(w1)

# (4)当f(w(k+1))-f(w(k))的二范数小于精度,或w(k+1)-w(k)的二范数小于精度
if np.linalg.norm(fw1 - fw) < self.epsilon or \
(np.linalg.norm((w1 - w), axis=0, keepdims=True) < self.epsilon).all():
self.w = w1
break

# (5) 设置k=k+1
k += 1
w, fw = w1, fw1

self.grad_ = grad
self.n_iter_ = k
self.coef_ = self.w[0][:-1]
self.intercept_ = self.w[0][-1]

def predict(self, x):
"""
根据训练好的模型进行预测
:param x: 输入特征矩阵
:return: 预测结果
"""
p = self.sigmoid(np.dot(self.preprocessing(x), self.w.T))
p[np.where(p > 0.5)] = 1
p[np.where(p < 0.5)] = 0
return p

def score(self, X, y):
"""
计算模型准确率
:param X: 输入特征矩阵
:param y: 真实标签
:return: 模型准确率
"""
y_c = self.predict(X)
# 计算准确率
error_rate = np.sum(np.abs(y_c - y.T)) / y_c.shape[0]
return 1 - error_rate

def draw(self, X, y):
"""
绘制三维散点图和决策边界
:param X: 输入特征矩阵
:param y: 真实标签
"""
y = y[0]
X_po = X[np.where(y == 1)]
X_ne = X[np.where(y == 0)]
# 绘制数据集散点图
ax = plt.axes(projection='3d')
x_1 = X_po[:, 0]
y_1 = X_po[:, 1]
z_1 = X_po[:, 2]
x_2 = X_ne[:, 0]
y_2 = X_ne[:, 1]
z_2 = X_ne[:, 2]
ax.scatter(x_1, y_1, z_1, c="r", label="正实例")
ax.scatter(x_2, y_2, z_2, c="b", label="负实例")
ax.legend(loc='best')
x = np.linspace(-3, 3, 3)
y = np.linspace(-3, 3, 3)
x_3, y_3 = np.meshgrid(x, y)
a, b, c, d = self.w[0]
z_3 = -(a * x_3 + b * y_3 + d) / c
ax.plot_surface(x_3, y_3, z_3, alpha=0.5)
plt.show()

1
2
3
4
5
6
7
8
9
10
11
# 训练数据集
X_train = np.array([[3, 3, 3], [4, 3, 2], [2, 1, 2], [1, 1, 1], [-1, 0, 1], [2, -2, 1]])
y_train = np.array([[1, 1, 1, 0, 0, 0]])
# 构建实例,进行训练
clf = LogisticRegression(epsilon=1e-6)
clf.fit_bgd(X_train, y_train)
clf.draw(X_train, y_train)
print("迭代次数:{}次".format(clf.n_iter_))
print("梯度:{}".format(clf.grad_))
print("权重:{}".format(clf.w[0]))
print("模型准确率:%0.2f%%" % (clf.score(X_train, y_train) * 100))

png

迭代次数:2095次
梯度:[ 7.33881397e-05  2.44073067e-05  2.52604176e-04 -5.13424350e-04]
权重:[  4.34496173   2.05340452   9.64074166 -22.85079478]
模型准确率:100.00%

随机梯度下降实现

随机梯度下降在每次迭代中使用单个样本来计算梯度并更新参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
from scipy.optimize import fminbound
from pylab import mpl
import numpy as np
import matplotlib.pyplot as plt
import random
%matplotlib inline


mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei']

class LogisticRegression:
def __init__(self, max_iter=10000, distance=3, epsilon=1e-6):
"""
逻辑斯谛回归
:param max_iter: 最大迭代次数
:param distance: 一维搜索的长度范围
:param epsilon: 迭代停止阈值
"""
self.max_iter = max_iter
self.epsilon = epsilon
# 权重
self.w = None
self.distance = distance
self._X = None
self._y = None

@staticmethod
def preprocessing(X):
"""
预处理数据:在原始X末尾加上一列,该列数值全部为1(对应偏置项)
:param X: 输入特征矩阵
:return: 增加偏置项后的特征矩阵
"""
row = X.shape[0]
y = np.ones(row).reshape(row, 1)
return np.hstack((X, y))

@staticmethod
def sigmoid(x):
"""
Sigmoid函数
:param x: 输入值
:return: Sigmoid函数值
"""
return 1 / (1 + np.exp(-x))

def grad(self, w, xi, yi):
"""
计算梯度
:param w: 当前权重
:param xi: 单个样本的特征向量
:param yi: 单个样本的标签
:return: 梯度向量
"""
z = np.dot(xi, w.T)
grad = xi * (yi - self.sigmoid(z))
return grad

def like_func(self, w):
"""
计算对数似然函数值
:param w: 当前权重
:return: 对数似然函数值
"""
z = np.dot(self._X, w.T)
f = self._y * z - np.log(1 + np.exp(z))
return np.sum(f)

def fit_sgd(self, data_x, data_y):
"""
训练逻辑斯谛回归模型,使用随机梯度下降
:param data_x: 输入特征矩阵
:param data_y: 真实标签
"""
self._X = self.preprocessing(data_x)
self._y = data_y.T.reshape(-1, 1)
# (1)取初始化w
w = np.array([[0] * self._X.shape[1]], dtype=float)
k = 0
# (2)计算f(w)
fw = self.like_func(w)
for _ in range(self.max_iter):
# 随机选取一个样本
idx = random.randint(0, self._X.shape[0] - 1)
xi = self._X[idx, :].reshape(1, -1)
yi = self._y[idx, 0]

# 计算梯度g(w)
grad = self.grad(w, xi, yi)

# (3)当梯度g(w)的模长小于精度时,停止迭代
if (np.linalg.norm(grad, axis=0, keepdims=True) < self.epsilon).all():
self.w = w
break

# 梯度方向的一维函数
def f(x):
z = w - np.dot(x, grad)
return -self.like_func(z)

# (3)进行一维搜索,找到使得函数最大的lambda
_lambda = fminbound(f, -self.distance, self.distance)

# (4)设置w(k+1)
w1 = w - np.dot(_lambda, grad)
fw1 = self.like_func(w1)

# (4)当f(w(k+1))-f(w(k))的范数小于精度,或w(k+1)-w(k)的范数小于精度
if np.linalg.norm(fw1 - fw) < self.epsilon or \
(np.linalg.norm((w1 - w), axis=0, keepdims=True) < self.epsilon).all():
self.w = w1
break

# (5) 设置k=k+1
k += 1
w, fw = w1, fw1

self.grad_ = grad
self.n_iter_ = k
self.coef_ = self.w[0][:-1]
self.intercept_ = self.w[0][-1]

def predict(self, x):
"""
根据训练好的模型进行预测
:param x: 输入特征矩阵
:return: 预测结果
"""
p = self.sigmoid(np.dot(self.preprocessing(x), self.w.T))
p[np.where(p > 0.5)] = 1
p[np.where(p < 0.5)] = 0
return p

def score(self, X, y):
"""
计算模型准确率
:param X: 输入特征矩阵
:param y: 真实标签
:return: 模型准确率
"""
y_c = self.predict(X)
# 计算准确率
error_rate = np.sum(np.abs(y_c - y.T)) / y_c.shape[0]
return 1 - error_rate

def draw(self, X, y):
"""
绘制三维散点图和决策边界
:param X: 输入特征矩阵
:param y: 真实标签
"""
y = y[0]
X_po = X[np.where(y == 1)]
X_ne = X[np.where(y == 0)]
# 绘制数据集散点图
ax = plt.axes(projection='3d')
x_1 = X_po[:, 0]
y_1 = X_po[:, 1]
z_1 = X_po[:, 2]
x_2 = X_ne[:, 0]
y_2 = X_ne[:, 1]
z_2 = X_ne[:, 2]
ax.scatter(x_1, y_1, z_1, c="r", label="正实例")
ax.scatter(x_2, y_2, z_2, c="b", label="负实例")
ax.legend(loc='best')
x = np.linspace(-3, 3, 3)
y = np.linspace(-3, 3, 3)
x_3, y_3 = np.meshgrid(x, y)
a, b, c, d = self.w[0]
z_3 = -(a * x_3 + b * y_3 + d) / c
ax.plot_surface(x_3, y_3, z_3, alpha=0.5)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
# 训练数据集
X_train = np.array([[3, 3, 3], [4, 3, 2], [2, 1, 2], [1, 1, 1], [-1, 0, 1], [2, -2, 1],[3, -2, 1]])
y_train = np.array([[1, 1, 1, 0, 0, 0, 1]])

# 构建实例,进行训练
clf = LogisticRegression(epsilon=1e-6)
clf.fit_sgd(X_train, y_train)
clf.draw(X_train, y_train)
print("迭代次数:{}次".format(clf.n_iter_))
print("梯度:{}".format(clf.grad_))
print("权重:{}".format(clf.w[0]))
print("模型准确率:%0.2f%%" % (clf.score(X_train, y_train) * 100))


png

迭代次数:1次
梯度:[[-0.69839141 -0.69839141 -0.69839141 -0.69839141]]
权重:[0.20991178 0.20991178 0.20991178 0.20991178]
模型准确率:57.14%

L2正则化的逻辑斯谛回归

L2 正则化通过向损失函数添加一个L2范数αi=1nwi2\alpha \sum_{i=1}^n w_i^2来防止过拟合,该惩罚项是权重的平方和

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fminbound
from pylab import mpl
import random

# 图像显示中文
mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei']

class LogisticRegressionL2:
def __init__(self, max_iter=10000, distance=3, epsilon=1e-6, alpha=1.0):
"""
逻辑斯谛回归 + L2正则化
:param max_iter: 最大迭代次数
:param distance: 一维搜索的长度范围
:param epsilon: 迭代停止阈值
:param alpha: 正则化系数
"""
self.max_iter = max_iter
self.epsilon = epsilon
self.alpha = alpha
self.w = None
self.distance = distance
self._X = None
self._y = None

@staticmethod
def preprocessing(X):
row = X.shape[0]
y = np.ones(row).reshape(row, 1)
return np.hstack((X, y))

@staticmethod
def sigmoid(x):
return 1 / (1 + np.exp(-x))

def grad(self, w):
z = np.dot(self._X, w.T)
grad = np.dot(self._X.T, (self.sigmoid(z) - self._y))
grad = grad.reshape(1, -1) + self.alpha * w # L2正则化
return grad

def grad2(self, w, xi, yi):
"""
计算梯度
:param w: 当前权重
:param xi: 单个样本的特征向量
:param yi: 单个样本的标签
:return: 梯度向量
"""
z = np.dot(xi, w.T)
grad = xi * (yi - self.sigmoid(z))
grad = grad.reshape(1, -1) + self.alpha * w # L2正则化
return grad
def like_func(self, w):
z = np.dot(self._X, w.T)
f = self._y * z - np.log(1 + np.exp(z))
reg = 0.5 * self.alpha * np.sum(w**2) # L2正则化
return np.sum(f) - reg

def fit_bgd(self, data_x, data_y):
self._X = self.preprocessing(data_x)
self._y = data_y.T
w = np.zeros((1, self._X.shape[1]))
k = 0
fw = self.like_func(w)
for _ in range(self.max_iter):
grad = self.grad(w)
if (np.linalg.norm(grad, axis=1, keepdims=True) < self.epsilon).all():
self.w = w
break

def f(x):
z = w - np.dot(x, grad)
return -self.like_func(z)

_lambda = fminbound(f, -self.distance, self.distance)
w1 = w - np.dot(_lambda, grad)
fw1 = self.like_func(w1)

if np.linalg.norm(fw1 - fw) < self.epsilon or \
(np.linalg.norm((w1 - w), axis=1, keepdims=True) < self.epsilon).all():
self.w = w1
break

k += 1
w, fw = w1, fw1

self.grad_ = grad
self.n_iter_ = k
self.coef_ = self.w[0][:-1]
self.intercept_ = self.w[0][-1]

def fit_sgd(self, data_x, data_y):
self._X = self.preprocessing(data_x)
self._y = data_y.T.reshape(-1, 1)
w = np.zeros((1, self._X.shape[1]))
k = 0
fw = self.like_func(w)
for _ in range(self.max_iter):
idx = random.randint(0, self._X.shape[0] - 1)
xi = self._X[idx, :].reshape(1, -1)
yi = self._y[idx, 0]
grad = self.grad2(w,xi,yi)

if (np.linalg.norm(grad, axis=1, keepdims=True) < self.epsilon).all():
self.w = w
break

def f(x):
z = w - np.dot(x, grad)
return -self.like_func(z)

_lambda = fminbound(f, -self.distance, self.distance)
w1 = w - np.dot(_lambda, grad)
fw1 = self.like_func(w1)

if np.linalg.norm(fw1 - fw) < self.epsilon or \
(np.linalg.norm((w1 - w), axis=1, keepdims=True) < self.epsilon).all():
self.w = w1
break

k += 1
w, fw = w1, fw1

self.grad_ = grad
self.n_iter_ = k
self.coef_ = self.w[0][:-1]
self.intercept_ = self.w[0][-1]

def predict(self, x):
p = self.sigmoid(np.dot(self.preprocessing(x), self.w.T))
p = (p > 0.5).astype(int)
return p

def score(self, X, y):
y_c = self.predict(X)
error_rate = np.sum(np.abs(y_c - y.T)) / y_c.shape[0]
return 1 - error_rate

def draw(self, X, y):
y = y[0]
X_po = X[np.where(y == 1)]
X_ne = X[np.where(y == 0)]
ax = plt.axes(projection='3d')
x_1 = X_po[:, 0]
y_1 = X_po[:, 1]
z_1 = X_po[:, 2]
x_2 = X_ne[:, 0]
y_2 = X_ne[:, 1]
z_2 = X_ne[:, 2]
ax.scatter(x_1, y_1, z_1, c="r", label="正实例")
ax.scatter(x_2, y_2, z_2, c="b", label="负实例")
ax.legend(loc='best')
x = np.linspace(-3, 3, 3)
y = np.linspace(-3, 3, 3)
x_3, y_3 = np.meshgrid(x, y)
a, b, c, d = self.w[0]
z_3 = -(a * x_3 + b * y_3 + d) / c
ax.plot_surface(x_3, y_3, z_3, alpha=0.5)
plt.show()

# 训练数据集
X_train = np.array([[3, 3, 3], [4, 3, 2], [2, 1, 2], [1, 1, 1], [-1, 0, 1], [2, -2, 1]])
y_train = np.array([[1, 1, 1, 0, 0, 0]])

# 构建实例,进行训练
clf = LogisticRegressionL2(epsilon=1e-6, alpha=0.1)
clf.fit_bgd(X_train, y_train)
clf.draw(X_train, y_train)
print("批量梯度下降 - 迭代次数:{}次".format(clf.n_iter_))
print("批量梯度下降 - 梯度:{}".format(clf.grad_))
print("批量梯度下降 - 权重:{}".format(clf.w[0]))
print("批量梯度下降 - 模型准确率:%0.2f%%" % (clf.score(X_train, y_train) * 100))

# 使用随机梯度下降进行训练
clf_sgd = LogisticRegressionL2(epsilon=1e-6, alpha=0.1)
clf_sgd.fit_sgd(X_train, y_train)
clf_sgd.draw(X_train, y_train)
print("随机梯度下降 - 迭代次数:{}次".format(clf_sgd.n_iter_))
print("随机梯度下降 - 梯度:{}".format(clf_sgd.grad_))
print("随机梯度下降 - 权重:{}".format(clf_sgd.w[0]))
print("随机梯度下降 - 模型准确率:%0.2f%%" % (clf_sgd.score(X_train, y_train) * 100))


png

批量梯度下降 - 迭代次数:70次
批量梯度下降 - 梯度:[[-0.00092171 -0.00043103 -0.00189263  0.00019492]]
批量梯度下降 - 权重:[ 0.83360921  0.95912031  0.48172953 -2.59686518]
批量梯度下降 - 模型准确率:100.00%

png

随机梯度下降 - 迭代次数:19次
随机梯度下降 - 梯度:[[-0.12778056  0.37250462 -0.23284058 -0.26200624]]
随机梯度下降 - 权重:[ 1.33386189  1.11338873 -1.02257894 -1.31423674]
随机梯度下降 - 模型准确率:83.33%

L1正则化的逻辑斯谛回归

L1正则化通过向损失函数添加一个L1范数αi=1nwi\alpha \sum_{i=1}^n |w_i|来防止过拟合,该惩罚项是权重的绝对值和

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
class LogisticRegressionL1:
def __init__(self, max_iter=10000, distance=3, epsilon=1e-6, alpha=1.0):
"""
逻辑斯谛回归 + L1正则化
:param max_iter: 最大迭代次数
:param distance: 一维搜索的长度范围
:param epsilon: 迭代停止阈值
:param alpha: 正则化系数
"""
self.max_iter = max_iter
self.epsilon = epsilon
self.alpha = alpha
self.w = None
self.distance = distance
self._X = None
self._y = None

@staticmethod
def preprocessing(X):
row = X.shape[0]
y = np.ones(row).reshape(row, 1)
return np.hstack((X, y))

@staticmethod
def sigmoid(x):
return 1 / (1 + np.exp(-x))

def grad(self, w):
z = np.dot(self._X, w.T)
grad = self._X * (self._y - self.sigmoid(z))
grad = grad.sum(axis=0)
grad = grad.reshape(1, -1)
grad += self.alpha * np.sign(w) # L1正则化
return grad

def grad2(self, w, xi, yi):
"""
计算梯度
:param w: 当前权重
:param xi: 单个样本的特征向量
:param yi: 单个样本的标签
:return: 梯度向量
"""
z = np.dot(xi, w.T)
grad = xi * (yi - self.sigmoid(z))
grad += self.alpha * np.sign(w)
return grad

def like_func(self, w):
z = np.dot(self._X, w.T)
f = self._y * z - np.log(1 + np.exp(z))
reg = self.alpha * np.sum(np.abs(w)) # L1正则化
return np.sum(f) - reg

def fit_bgd(self, data_x, data_y):
self._X = self.preprocessing(data_x)
self._y = data_y.T
w = np.array([[0] * self._X.shape[1]], dtype=float)
k = 0
fw = self.like_func(w)
for _ in range(self.max_iter):
grad = self.grad(w)
if (np.linalg.norm(grad, axis=0, keepdims=True) < self.epsilon).all():
self.w = w
break

def f(x):
z = w - np.dot(x, grad)
return -self.like_func(z)

_lambda = fminbound(f, -self.distance, self.distance)
w1 = w - np.dot(_lambda, grad)
fw1 = self.like_func(w1)

if np.linalg.norm(fw1 - fw) < self.epsilon or \
(np.linalg.norm((w1 - w), axis=0, keepdims=True) < self.epsilon).all():
self.w = w1
break

k += 1
w, fw = w1, fw1

self.grad_ = grad
self.n_iter_ = k
self.coef_ = self.w[0][:-1]
self.intercept_ = self.w[0][-1]

def fit_sgd(self, data_x, data_y):
self._X = self.preprocessing(data_x)
self._y = data_y.T.reshape(-1, 1)
w = np.array([[0] * self._X.shape[1]], dtype=float)
k = 0
fw = self.like_func(w)
for _ in range(self.max_iter):
idx = random.randint(0, self._X.shape[0] - 1)
xi = self._X[idx, :].reshape(1, -1)
yi = self._y[idx, 0]
grad = self.grad2(w,xi,yi)

if (np.linalg.norm(grad, axis=0, keepdims=True) < self.epsilon).all():
self.w = w
break

def f(x):
z = w - np.dot(x, grad)
return -self.like_func(z)

_lambda = fminbound(f, -self.distance, self.distance)
w1 = w - np.dot(_lambda, grad)
fw1 = self.like_func(w1)

if np.linalg.norm(fw1 - fw) < self.epsilon or \
(np.linalg.norm((w1 - w), axis=0, keepdims=True) < self.epsilon).all():
self.w = w1
break

k += 1
w, fw = w1, fw1

self.grad_ = grad
self.n_iter_ = k
self.coef_ = self.w[0][:-1]
self.intercept_ = self.w[0][-1]

def predict(self, x):
p = self.sigmoid(np.dot(self.preprocessing(x), self.w.T))
p[np.where(p > 0.5)] = 1
p[np.where(p < 0.5)] = 0
return p

def score(self, X, y):
y_c = self.predict(X)
error_rate = np.sum(np.abs(y_c - y.T)) / y_c.shape[0]
return 1 - error_rate

def draw(self, X, y):
y = y[0]
X_po = X[np.where(y == 1)]
X_ne = X[np.where(y == 0)]
ax = plt.axes(projection='3d')
x_1 = X_po[:, 0]
y_1 = X_po[:, 1]
z_1 = X_po[:, 2]
x_2 = X_ne[:, 0]
y_2 = X_ne[:, 1]
z_2 = X_ne[:, 2]
ax.scatter(x_1, y_1, z_1, c="r", label="正实例")
ax.scatter(x_2, y_2, z_2, c="b", label="负实例")
ax.legend(loc='best')
x = np.linspace(-3, 3, 3)
y = np.linspace(-3, 3, 3)
x_3, y_3 = np.meshgrid(x, y)
a, b, c, d = self.w[0]
z_3 = -(a * x_3 + b * y_3 + d) / c
ax.plot_surface(x_3, y_3, z_3, alpha=0.5)
plt.show()

# 训练数据集
X_train = np.array([[3, 3, 3], [4, 3, 2], [2, 1, 2], [1, 1, 1], [-1, 0, 1], [2, -2, 1]])
y_train = np.array([[1, 1, 1, 0, 0, 0]])

# 构建实例,进行训练
clf_l1 = LogisticRegressionL1(epsilon=1e-6, alpha=0.1)
clf_l1.fit_bgd(X_train, y_train)
clf_l1.draw(X_train, y_train)
print("批量梯度下降 - 迭代次数:{}次".format(clf_l1.n_iter_))
print("批量梯度下降 - 梯度:{}".format(clf_l1.grad_))
print("批量梯度下降 - 权重:{}".format(clf_l1.w[0]))
print("批量梯度下降 - 模型准确率:%0.2f%%" % (clf_l1.score(X_train, y_train) * 100))

# 使用随机梯度下降进行训练
clf_l1_sgd = LogisticRegressionL1(epsilon=1e-6, alpha=0.1)
clf_l1_sgd.fit_sgd(X_train, y_train)
clf_l1_sgd.draw(X_train, y_train)
print("随机梯度下降 - 迭代次数:{}次".format(clf_l1_sgd.n_iter_))
print("随机梯度下降 - 梯度:{}".format(clf_l1_sgd.grad_))
print("随机梯度下降 - 权重:{}".format(clf_l1_sgd.w[0]))
print("随机梯度下降 - 模型准确率:%0.2f%%" % (clf_l1_sgd.score(X_train, y_train) * 100))


png

批量梯度下降 - 迭代次数:20次
批量梯度下降 - 梯度:[[ 0.15139036  0.0450223  -0.01778915 -0.26590759]]
批量梯度下降 - 权重:[ 2.09399897  1.63663209 -0.45160074 -3.78733898]
批量梯度下降 - 模型准确率:100.00%

png

随机梯度下降 - 迭代次数:27次
随机梯度下降 - 梯度:[[ 0.11711828  0.1        -0.11711828 -0.11711828]]
随机梯度下降 - 权重:[ 1.50984213  1.11760232 -0.91330033 -1.62427959]
随机梯度下降 - 模型准确率:83.33%

弹性网络正则化的逻辑斯谛回归

弹性网络正则化结合了 L1 和 L2 正则化,控制两者的权重比例

Reg(w)=α(ρi=1nwi2+(1ρ)i=1nwi)\mathrm{Reg}(w)=\alpha\left(\rho\sum_{i=1}^nw_i^2+(1-\rho)\sum_{i=1}^n|w_i|\right)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
class LogisticRegressionElasticNet:
def __init__(self, max_iter=10000, distance=3, epsilon=1e-6, alpha=1.0, l1_ratio=0.5):
"""
逻辑斯谛回归 + 弹性网络正则化
:param max_iter: 最大迭代次数
:param distance: 一维搜索的长度范围
:param epsilon: 迭代停止阈值
:param alpha: 正则化系数
:param l1_ratio: L1正则化的比例
"""
self.max_iter = max_iter
self.epsilon = epsilon
self.alpha = alpha
self.l1_ratio = l1_ratio
self.w = None
self.distance = distance
self._X = None
self._y = None

@staticmethod
def preprocessing(X):
row = X.shape[0]
y = np.ones(row).reshape(row, 1)
return np.hstack((X, y))

@staticmethod
def sigmoid(x):
return 1 / (1 + np.exp(-x))

def grad(self, w):
z = np.dot(self._X, w.T)
grad = self._X * (self._y - self.sigmoid(z))
grad = grad.sum(axis=0)
grad = grad.reshape(1, -1) # Reshape grad to have the same shape as w
grad += self.alpha * (self.l1_ratio * np.sign(w) + (1 - self.l1_ratio) * w) # 弹性网络正则化
return grad
def grad2(self, w, xi, yi):
"""
计算梯度
:param w: 当前权重
:param xi: 单个样本的特征向量
:param yi: 单个样本的标签
:return: 梯度向量
"""
z = np.dot(xi, w.T)
grad = xi * (yi - self.sigmoid(z))
grad += self.alpha * (self.l1_ratio * np.sign(w) + (1 - self.l1_ratio) * w) # 弹性网络正则化
return grad

def like_func(self, w):
z = np.dot(self._X, w.T)
f = self._y * z - np.log(1 + np.exp(z))
reg = self.alpha * (self.l1_ratio * np.sum(np.abs(w)) + 0.5 * (1 - self.l1_ratio) * np.sum(w**2)) # 弹性网络正则化
return np.sum(f) - reg

def fit_bgd(self, data_x, data_y):
self._X = self.preprocessing(data_x)
self._y = data_y.T
w = np.array([[0] * self._X.shape[1]], dtype=float)
k = 0
fw = self.like_func(w)
for _ in range(self.max_iter):
grad = self.grad(w)
if (np.linalg.norm(grad, axis=0, keepdims=True) < self.epsilon).all():
self.w = w
break

def f(x):
z = w - np.dot(x, grad)
return -self.like_func(z)

_lambda = fminbound(f, -self.distance, self.distance)
w1 = w - np.dot(_lambda, grad)
fw1 = self.like_func(w1)

if np.linalg.norm(fw1 - fw) < self.epsilon or \
(np.linalg.norm((w1 - w), axis=0, keepdims=True) < self.epsilon).all():
self.w = w1
break

k += 1
w, fw = w1, fw1

self.grad_ = grad
self.n_iter_ = k
self.coef_ = self.w[0][:-1]
self.intercept_ = self.w[0][-1]

def fit_sgd(self, data_x, data_y):
self._X = self.preprocessing(data_x)
self._y = data_y.T.reshape(-1, 1)
w = np.array([[0] * self._X.shape[1]], dtype=float)
k = 0
fw = self.like_func(w)
for _ in range(self.max_iter):
idx = random.randint(0, self._X.shape[0] - 1)
xi = self._X[idx, :].reshape(1, -1)
yi = self._y[idx, 0]
grad = self.grad2(w,xi,yi)

if (np.linalg.norm(grad, axis=0, keepdims=True) < self.epsilon).all():
self.w = w
break

def f(x):
z = w - np.dot(x, grad)
return -self.like_func(z)

_lambda = fminbound(f, -self.distance, self.distance)
w1 = w - np.dot(_lambda, grad)
fw1 = self.like_func(w1)

if np.linalg.norm(fw1 - fw) < self.epsilon or \
(np.linalg.norm((w1 - w), axis=0, keepdims=True) < self.epsilon).all():
self.w = w1
break

k += 1
w, fw = w1, fw1

self.grad_ = grad
self.n_iter_ = k
self.coef_ = self.w[0][:-1]
self.intercept_ = self.w[0][-1]

def predict(self, x):
p = self.sigmoid(np.dot(self.preprocessing(x), self.w.T))
p[np.where(p > 0.5)] = 1
p[np.where(p < 0.5)] = 0
return p

def score(self, X, y):
y_c = self.predict(X)
error_rate = np.sum(np.abs(y_c - y.T)) / y_c.shape[0]
return 1 - error_rate

def draw(self, X, y):
y = y[0]
X_po = X[np.where(y == 1)]
X_ne = X[np.where(y == 0)]
ax = plt.axes(projection='3d')
x_1 = X_po[:, 0]
y_1 = X_po[:, 1]
z_1 = X_po[:, 2]
x_2 = X_ne[:, 0]
y_2 = X_ne[:, 1]
z_2 = X_ne[:, 2]
ax.scatter(x_1, y_1, z_1, c="r", label="正实例")
ax.scatter(x_2, y_2, z_2, c="b", label="负实例")
ax.legend(loc='best')
x = np.linspace(-3, 3, 3)
y = np.linspace(-3, 3, 3)
x_3, y_3 = np.meshgrid(x, y)
a, b, c, d = self.w[0]
z_3 = -(a * x_3 + b * y_3 + d) / c
ax.plot_surface(x_3, y_3, z_3, alpha=0.5)
plt.show()

# 训练数据集
X_train = np.array([[3, 3, 3], [4, 3, 2], [2, 1, 2], [1, 1, 1], [-1, 0, 1], [2, -2, 1]])
y_train = np.array([[1, 1, 1, 0, 0, 0]])

# 构建实例,进行训练
clf_en = LogisticRegressionElasticNet(epsilon=1e-6, alpha=0.1, l1_ratio=0.5)
clf_en.fit_bgd(X_train, y_train)
clf_en.draw(X_train, y_train)
print("批量梯度下降 - 迭代次数:{}次".format(clf_en.n_iter_))
print("批量梯度下降 - 梯度:{}".format(clf_en.grad_))
print("批量梯度下降 - 权重:{}".format(clf_en.w[0]))
print("批量梯度下降 - 模型准确率:%0.2f%%" % (clf_en.score(X_train, y_train) * 100))

# 使用随机梯度下降进行训练
clf_en_sgd = LogisticRegressionElasticNet(epsilon=1e-6, alpha=0.1, l1_ratio=0.5)
clf_en_sgd.fit_sgd(X_train, y_train)
clf_en_sgd.draw(X_train, y_train)
print("随机梯度下降 - 迭代次数:{}次".format(clf_en_sgd.n_iter_))
print("随机梯度下降 - 梯度:{}".format(clf_en_sgd.grad_))
print("随机梯度下降 - 权重:{}".format(clf_en_sgd.w[0]))
print("随机梯度下降 - 模型准确率:%0.2f%%" % (clf_en_sgd.score(X_train, y_train) * 100))


png

批量梯度下降 - 迭代次数:12次
批量梯度下降 - 梯度:[[ 0.20841643  0.06660582  0.03625754 -0.39246071]]
批量梯度下降 - 权重:[ 1.34227249  1.44899049 -0.31003063 -2.63945671]
批量梯度下降 - 模型准确率:100.00%

png

随机梯度下降 - 迭代次数:6次
随机梯度下降 - 梯度:[[-0.57973579 -0.58354255 -0.58911514 -0.5929219 ]]
随机梯度下降 - 权重:[0.27972255 0.20358714 0.09213509 0.01599968]
随机梯度下降 - 模型准确率:66.67%

特征映射类,多项式转化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np

class PolynomialFeatures:
def __init__(self, degree=2, include_bias=True):
"""
初始化多项式特征类
:param degree: 多项式的最高次数
:param include_bias: 是否包含偏置(即常数项)
"""
self.degree = degree
self.include_bias = include_bias

def fit_transform(self, X):
"""
将输入特征转换为多项式特征
:param X: 输入特征矩阵
:return: 转换后的特征矩阵
"""
n_samples, n_features = X.shape
# 计算输出特征的数量
n_output_features = self._num_output_features(n_features)

# 初始化输出特征矩阵
X_poly = np.empty((n_samples, n_output_features))

# 填充输出特征矩阵
current_feature = 0
for degree in range(0, self.degree + 1):
for items in self._combinations(n_features, degree):
X_poly[:, current_feature] = np.prod(X[:, items], axis=1)
current_feature += 1

return X_poly

def _num_output_features(self, n_features):
"""
计算输出特征的数量
:param n_features: 输入特征的数量
:return: 输出特征的数量
"""
from math import comb
if self.include_bias:
return sum(comb(n_features + i - 1, i) for i in range(self.degree + 1))
else:
return sum(comb(n_features + i - 1, i) for i in range(1, self.degree + 1))

def _combinations(self, n_features, degree):
"""
生成所有可能的特征组合
:param n_features: 输入特征的数量
:param degree: 多项式的次数
:return: 特征组合的迭代器
"""
from itertools import combinations_with_replacement
return combinations_with_replacement(range(n_features), degree)

逻辑斯谛回归类实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
import numpy as np
import matplotlib.pyplot as plt
from pylab import mpl
from scipy.optimize import fminbound
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import random
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
%matplotlib inline

# 图像显示中文
mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei']

class LogisticRegressionBase:
def __init__(self, max_iter=10000, distance=3, epsilon=1e-6, alpha=0.01, l1_ratio=0.2, fit_intercept=True):
self.max_iter = max_iter
self.epsilon = epsilon
self.alpha = alpha
self.l1_ratio = l1_ratio
self.distance = distance
self.fit_intercept = fit_intercept
self.w = None
self._X = None
self._y = None

@staticmethod
def preprocessing(X, fit_intercept):
if fit_intercept:
row = X.shape[0]
y = np.ones(row).reshape(row, 1)
return np.hstack((X, y))
else:
row = X.shape[0]
y = np.zeros(row).reshape(row, 1)
return np.hstack((X, y))
@staticmethod
def sigmoid(x):
return 1 / (1 + np.exp(-x))

def fit_bgd(self, data_x, data_y):
self._X = self.preprocessing(data_x, self.fit_intercept)
self._y = data_y.T
w = np.zeros((1, self._X.shape[1]))
k = 0
fw = self.like_func(w)
for _ in range(self.max_iter):
grad = self.grad(w)
if (np.linalg.norm(grad, axis=1, keepdims=True) < self.epsilon).all():
self.w = w
break

def f(x):
z = w - np.dot(x, grad)
return -self.like_func(z)

_lambda = fminbound(f, -self.distance, self.distance)
w1 = w - np.dot(_lambda, grad)
fw1 = self.like_func(w1)

if np.linalg.norm(fw1 - fw) < self.epsilon or \
(np.linalg.norm((w1 - w), axis=1, keepdims=True) < self.epsilon).all():
self.w = w1
break

k += 1
w, fw = w1, fw1

self.grad_ = grad
self.n_iter_ = k
self.coef_ = self.w[0][:-1]
self.intercept_ = self.w[0][-1] if self.fit_intercept else 0

def fit_sgd(self, data_x, data_y):
self._X = self.preprocessing(data_x, self.fit_intercept)
self._y = data_y.T.reshape(-1, 1)
w = np.zeros((1, self._X.shape[1]))
k = 0
fw = self.like_func(w)
for _ in range(self.max_iter):
idx = random.randint(0, self._X.shape[0] - 1)
xi = self._X[idx, :].reshape(1, -1)
yi = self._y[idx, 0]
grad = self.grad_sample(w, xi, yi)

if (np.linalg.norm(grad, axis=1, keepdims=True) < self.epsilon).all():
self.w = w
break

def f(x):
z = w - np.dot(x, grad)
return -self.like_func(z)

_lambda = fminbound(f, -self.distance, self.distance)
w1 = w - np.dot(_lambda, grad)
fw1 = self.like_func(w1)

if np.linalg.norm(fw1 - fw) < self.epsilon or \
(np.linalg.norm((w1 - w), axis=1, keepdims=True) < self.epsilon).all():
self.w = w1
break

k += 1
w, fw = w1, fw1

self.grad_ = grad
self.n_iter_ = k
self.coef_ = self.w[0][:-1]
self.intercept_ = self.w[0][-1] if self.fit_intercept else 0

def predict(self, x):
p = self.sigmoid(np.dot(self.preprocessing(x, self.fit_intercept), self.w.T))
return (p > 0.5).astype(int)

def score(self, X, y):
"""
计算模型准确率
:param X: 输入特征矩阵
:param y: 真实标签
:return: 模型准确率
"""
y_c = self.predict(X)
# 计算准确率
error_rate = np.sum(np.abs(y_c - y.T)) / y_c.shape[0]
return 1 - error_rate

def draw(self, X, y):
y = y[0]
X_po = X[np.where(y == 1)]
X_ne = X[np.where(y == 0)]
ax = plt.axes(projection='3d')
ax.scatter(X_po[:, 0], X_po[:, 1], X_po[:, 2], c="r", label="正类样本")
ax.scatter(X_ne[:, 0], X_ne[:, 1], X_ne[:, 2], c="b", label="负类样本")
ax.legend(loc='best')

x = np.linspace(min(X[:, 0]), max(X[:, 0]), 30)
y = np.linspace(min(X[:, 1]), max(X[:, 1]), 30)
x_3, y_3 = np.meshgrid(x, y)

a, b, c, d = self.w[0]
z_3 = -(a * x_3 + b * y_3 + d) / c
ax.plot_surface(x_3, y_3, z_3, alpha=0.5)
plt.show()

class LogisticRegression(LogisticRegressionBase):
def grad(self, w):
z = np.dot(self._X, w.T)
grad = np.dot(self._X.T, (self.sigmoid(z) - self._y))
return grad.reshape(1, -1)

def grad_sample(self, w, xi, yi):
z = np.dot(xi, w.T)
grad = xi * (yi - self.sigmoid(z))
return grad.reshape(1, -1)

def like_func(self, w):
z = np.dot(self._X, w.T)
f = self._y * z - np.log(1 + np.exp(z))
return np.sum(f)

class LogisticRegressionL2(LogisticRegressionBase):
def grad(self, w):
z = np.dot(self._X, w.T)
grad = np.dot(self._X.T, (self.sigmoid(z) - self._y))
return grad.reshape(1, -1) + self.alpha * w

def grad_sample(self, w, xi, yi):
z = np.dot(xi, w.T)
grad = xi * (yi - self.sigmoid(z))
return grad.reshape(1, -1) + self.alpha * w

def like_func(self, w):
z = np.dot(self._X, w.T)
f = self._y * z - np.log(1 + np.exp(z))
reg = 0.5 * self.alpha * np.sum(w**2)
return np.sum(f) - reg

class LogisticRegressionL1(LogisticRegressionBase):
def grad(self, w):
z = np.dot(self._X, w.T)
grad = np.dot(self._X.T, (self.sigmoid(z) - self._y))
return grad.reshape(1, -1) + self.alpha * np.sign(w)

def grad_sample(self, w, xi, yi):
z = np.dot(xi, w.T)
grad = xi * (yi - self.sigmoid(z))
return grad.reshape(1, -1) + self.alpha * np.sign(w)

def like_func(self, w):
z = np.dot(self._X, w.T)
f = self._y * z - np.log(1 + np.exp(z))
reg = self.alpha * np.sum(np.abs(w))
return np.sum(f) - reg

class LogisticRegressionElasticNet(LogisticRegressionBase):
def grad(self, w):
z = np.dot(self._X, w.T)
grad = np.dot(self._X.T, (self.sigmoid(z) - self._y))
return grad.reshape(1, -1) + self.alpha * (self.l1_ratio * np.sign(w) + (1 - self.l1_ratio) * w)

def grad_sample(self, w, xi, yi):
z = np.dot(xi, w.T)
grad = xi * (yi - self.sigmoid(z))
return grad.reshape(1, -1) + self.alpha * (self.l1_ratio * np.sign(w) + (1 - self.l1_ratio) * w)

def like_func(self, w):
z = np.dot(self._X, w.T)
f = self._y * z - np.log(1 + np.exp(z))
reg = self.alpha * (self.l1_ratio * np.sum(np.abs(w)) + 0.5 * (1 - self.l1_ratio) * np.sum(w**2))
return np.sum(f) - reg

iris = load_iris()
X = iris.data[:, :3] # 获取前三个特征
y = iris.target

# 获取前两种花
X = X[(y == 0) | (y == 1)]
y = y[(y == 0) | (y == 1)]
# y_train = y_train.reshape(1, -1)
# 训练数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=24)
y_train = y_train.reshape(1, -1)
y_test = y_test.reshape(1, -1)

models = {
'LogisticRegression': LogisticRegression,
'LogisticRegressionL2': LogisticRegressionL2,
'LogisticRegressionL1': LogisticRegressionL1,
'LogisticRegressionElasticNet': LogisticRegressionElasticNet
}

for name, model in models.items():
print(f"\n{name}:")

clf = model(epsilon=1e-6, alpha=0.1, l1_ratio=0.5, fit_intercept=True)
clf.fit_bgd(X_train, y_train)
clf.draw(X_train, y_train)
print("测试集图像")
clf.draw(X_test, y_test)
print(f"批量梯度下降 - 迭代次数:{clf.n_iter_}次")
print(f"批量梯度下降 - 梯度:{clf.grad_}")
print(f"批量梯度下降 - 权重:{clf.w[0]}")
print(f"批量梯度下降 - 模型准确率(训练集):{clf.score(X_train, y_train) * 100:.2f}%")
print(f"批量梯度下降 - 模型准确率(测试集):{clf.score(X_test, y_test) * 100:.2f}%")

clf_sgd = model(epsilon=1e-6, alpha=0.1, l1_ratio=0.2, fit_intercept=True)
clf_sgd.fit_sgd(X_train, y_train)
clf_sgd.draw(X_train, y_train)
print("测试集图像")
clf_sgd.draw(X_test, y_test)
print(f"随机梯度下降 - 迭代次数:{clf_sgd.n_iter_}次")
print(f"随机梯度下降 - 梯度:{clf_sgd.grad_}")
print(f"随机梯度下降 - 权重:{clf_sgd.w[0]}")
print(f"随机梯度下降 - 模型准确率(训练集):{clf_sgd.score(X_train, y_train) * 100:.2f}%")
print(f"随机梯度下降 - 模型准确率(测试集):{clf_sgd.score(X_test, y_test) * 100:.2f}%")

# 不使用截距
clf_no_intercept = model(epsilon=1e-6, alpha=0.1, l1_ratio=0.2, fit_intercept=False)
clf_no_intercept.fit_bgd(X_train, y_train)
clf_no_intercept.draw(X_train, y_train)
print("测试集图像")
clf_no_intercept.draw(X_test, y_test)
print(f"批量梯度下降(无截距) - 迭代次数:{clf_no_intercept.n_iter_}次")
print(f"批量梯度下降(无截距) - 梯度:{clf_no_intercept.grad_}")
print(f"批量梯度下降(无截距) - 权重:{clf_no_intercept.w[0]}")
print(f"批量梯度下降(无截距) - 模型准确率(训练集):{clf_no_intercept.score(X_train, y_train) * 100:.2f}%")
print(f"批量梯度下降(无截距) - 模型准确率(测试集):{clf_no_intercept.score(X_test, y_test) * 100:.2f}%")

clf_sgd_no_intercept = model(epsilon=1e-6, alpha=0.1, l1_ratio=0.5, fit_intercept=False)
clf_sgd_no_intercept.fit_sgd(X_train, y_train)
clf_sgd_no_intercept.draw(X_train, y_train)
print("测试集图像")
clf_sgd_no_intercept.draw(X_test, y_test)
print(f"随机梯度下降(无截距) - 迭代次数:{clf_sgd_no_intercept.n_iter_}次")
print(f"随机梯度下降(无截距) - 梯度:{clf_sgd_no_intercept.grad_}")
print(f"随机梯度下降(无截距) - 权重:{clf_sgd_no_intercept.w[0]}")
print(f"随机梯度下降(无截距) - 模型准确率(训练集):{clf_sgd_no_intercept.score(X_train, y_train) * 100:.2f}%")
print(f"随机梯度下降(无截距) - 模型准确率(测试集):{clf_sgd_no_intercept.score(X_test, y_test) * 100:.2f}%")

LogisticRegression:

png

测试集图像

png

批量梯度下降 - 迭代次数:106次
批量梯度下降 - 梯度:[[0.00048871 0.00024979 0.00014119 0.0001086 ]]
批量梯度下降 - 权重:[ -2.75849815 -62.3631215  119.9161798   -9.17964491]
批量梯度下降 - 模型准确率(训练集):100.00%
批量梯度下降 - 模型准确率(测试集):100.00%

png

测试集图像

png

随机梯度下降 - 迭代次数:8次
随机梯度下降 - 梯度:[[-2.17461649 -1.43326996 -0.69192343 -0.49423102]]
随机梯度下降 - 权重:[-0.00283963 -0.00764932  0.00885183 -0.00079823]
随机梯度下降 - 模型准确率(训练集):71.43%
随机梯度下降 - 模型准确率(测试集):56.67%

png

测试集图像

png

批量梯度下降(无截距) - 迭代次数:2次
批量梯度下降(无截距) - 梯度:[[ 4.82286476e-10  4.31432704e-10 -2.36562545e-09  0.00000000e+00]]
批量梯度下降(无截距) - 权重:[ -3.88601165 -16.37436628  26.99405085   0.        ]
批量梯度下降(无截距) - 模型准确率(训练集):100.00%
批量梯度下降(无截距) - 模型准确率(测试集):100.00%

png

测试集图像

png

随机梯度下降(无截距) - 迭代次数:113次
随机梯度下降(无截距) - 梯度:[[-2.11332754 -1.53319841 -0.62156692 -0.        ]]
随机梯度下降(无截距) - 权重:[-0.0294974  -0.12914278  0.18824239  0.        ]
随机梯度下降(无截距) - 模型准确率(训练集):100.00%
随机梯度下降(无截距) - 模型准确率(测试集):100.00%

LogisticRegressionL2:

png

测试集图像

png

批量梯度下降 - 迭代次数:28次
批量梯度下降 - 梯度:[[ 0.0007204  -0.00071139 -0.0002971   0.00019384]]
批量梯度下降 - 权重:[-0.58818051 -2.2492532   3.82116933 -0.40851439]
批量梯度下降 - 模型准确率(训练集):100.00%
批量梯度下降 - 模型准确率(测试集):100.00%

png

测试集图像

png

随机梯度下降 - 迭代次数:56次
随机梯度下降 - 梯度:[[-2.34877326 -1.7549991  -0.72887583 -0.46125125]]
随机梯度下降 - 权重:[-0.0107058  -0.05730437  0.07658344 -0.00917391]
随机梯度下降 - 模型准确率(训练集):100.00%
随机梯度下降 - 模型准确率(测试集):100.00%

png

测试集图像

png

批量梯度下降(无截距) - 迭代次数:31次
批量梯度下降(无截距) - 梯度:[[-0.00409333 -0.00335045 -0.00262166  0.        ]]
批量梯度下降(无截距) - 权重:[-0.65064485 -2.28344337  3.82764611  0.        ]
批量梯度下降(无截距) - 模型准确率(训练集):100.00%
批量梯度下降(无截距) - 模型准确率(测试集):100.00%

png

测试集图像

png

随机梯度下降(无截距) - 迭代次数:52次
随机梯度下降(无截距) - 梯度:[[-2.26042911 -1.60474193 -0.7448975   0.        ]]
随机梯度下降(无截距) - 权重:[-0.01709277 -0.04815376  0.08009096  0.        ]
随机梯度下降(无截距) - 模型准确率(训练集):100.00%
随机梯度下降(无截距) - 模型准确率(测试集):100.00%

LogisticRegressionL1:

png

测试集图像

png

批量梯度下降 - 迭代次数:26次
批量梯度下降 - 梯度:[[-0.03327036 -0.04600519  0.07788324 -0.08262092]]
批量梯度下降 - 权重:[-2.88141620e-01 -4.27794971e+00  5.43998197e+00  2.65521743e-07]
批量梯度下降 - 模型准确率(训练集):100.00%
批量梯度下降 - 模型准确率(测试集):100.00%

png

测试集图像

png

随机梯度下降 - 迭代次数:91次
随机梯度下降 - 梯度:[[2.57028353 1.23514176 1.9775431  0.5172318 ]]
随机梯度下降 - 权重:[-0.00730316 -0.09051132  0.14672342  0.0104378 ]
随机梯度下降 - 模型准确率(训练集):100.00%
随机梯度下降 - 模型准确率(测试集):100.00%

png

测试集图像

png

批量梯度下降(无截距) - 迭代次数:51次
批量梯度下降(无截距) - 梯度:[[-0.00699658 -0.0051444   0.01329163  0.        ]]
批量梯度下降(无截距) - 权重:[ 3.12660509e-09 -3.99041243e+00  4.46866252e+00  0.00000000e+00]
批量梯度下降(无截距) - 模型准确率(训练集):100.00%
批量梯度下降(无截距) - 模型准确率(测试集):100.00%

png

测试集图像

png

随机梯度下降(无截距) - 迭代次数:35次
随机梯度下降(无截距) - 梯度:[[-2.46985736 -1.67990491 -0.59701687  0.        ]]
随机梯度下降(无截距) - 权重:[-0.01460514 -0.05164428  0.0725267   0.        ]
随机梯度下降(无截距) - 模型准确率(训练集):100.00%
随机梯度下降(无截距) - 模型准确率(测试集):100.00%

LogisticRegressionElasticNet:

png

测试集图像

png

批量梯度下降 - 迭代次数:73次
批量梯度下降 - 梯度:[[ 0.00605949 -0.00189696  0.00670519  0.08816653]]
批量梯度下降 - 权重:[-6.09148595e-01 -2.58080507e+00  4.09358303e+00 -8.57268178e-08]
批量梯度下降 - 模型准确率(训练集):100.00%
批量梯度下降 - 模型准确率(测试集):100.00%

png

测试集图像

png

随机梯度下降 - 迭代次数:6次
随机梯度下降 - 梯度:[[-2.25477651 -1.67253597 -0.65848368 -0.50584968]]
随机梯度下降 - 权重:[-0.00703677 -0.01450912  0.01867985 -0.00188871]
随机梯度下降 - 模型准确率(训练集):65.71%
随机梯度下降 - 模型准确率(测试集):50.00%

png

测试集图像

png

批量梯度下降(无截距) - 迭代次数:108次
批量梯度下降(无截距) - 梯度:[[ 0.00303551 -0.00136426  0.0002686   0.        ]]
批量梯度下降(无截距) - 权重:[-0.63708095 -2.38098472  3.91382324  0.        ]
批量梯度下降(无截距) - 模型准确率(训练集):100.00%
批量梯度下降(无截距) - 模型准确率(测试集):100.00%

png

测试集图像

png

随机梯度下降(无截距) - 迭代次数:77次
随机梯度下降(无截距) - 梯度:[[2.23743624 1.06951146 1.86369981 0.        ]]
随机梯度下降(无截距) - 权重:[-0.00685798 -0.08619746  0.15116774  0.        ]
随机梯度下降(无截距) - 模型准确率(训练集):100.00%
随机梯度下降(无截距) - 模型准确率(测试集):100.00%

算法性能

使用的数据集是 鸢尾花 样本数目(150)选取了三个特征,两类花。

LogisticRegression

  • 优化算法: 批量梯度下降 (BGD)
  • 正则化: 无
  • 结果:
    • 较好好地分离两类鸢尾花数据,因为没有正则化,模型权重可能会受到训练数据中噪声的影响,在训练集和测试集上表现良好 (100%)。
    • 迭代次数较高(106次),因为没有正则化项来加速收敛。

LogisticRegressionL2

  • 优化算法: 批量梯度下降 (BGD)
  • 正则化: L2正则化
  • 结果:
    • 在训练集和测试集上的性能比基础逻辑回归好,迭代次数(28)< (106)。
    • 权重向量中的每个元素都会减小,泛化能力好一点。

LogisticRegressionL1

  • 优化算法: 批量梯度下降 (BGD)
  • 正则化: L1正则化
  • 结果:
    • 模型在权重稀疏性方面表现良好,部分权重推向零。
    • 迭代次数(26)< (106) 加快收敛。

LogisticRegressionElasticNet

  • 优化算法: 批量梯度下降 (BGD)
  • 正则化: 弹性网络正则化(L1+L2)
  • 结果:
    • 收敛速度快.
    • 模型稳定,泛化能力强。

随机梯度下降 (SGD) 对比批量梯度下降 (BGD)

  • 结果:
    • SGD通常比BGD更快收敛,但是由于每次迭代只使用一个样本,会引入噪声,学习曲线震荡,波动大。
    • 在迭代次数上,SGD可能比BGD多,也可能少,模型准确度可能低,其陷入局部最优解。

就结果来看,每种模型在给定的鸢尾花数据集上都很好地训练和测试,模型准确率都挺高的,尽管每种正则化方式(L1、L2、ElasticNet) 会对模型的泛化,学习过程产生影响。因为鸢尾花数据集线性可分,可能影响不大,换别的数据集可能会明显一点。

参考文献

  • 周志华. 机器学习[M]. 北京:清华大学出版社,2016.
  • 李航. 统计学习方法[M]. 北京:清华大学出版社,2012.
  • 谢文睿 秦州 贾彬彬 . 机器学习公式详解 第 2 版[M]. 人民邮电出版社.