Perception machine model for data requirements: The training data set requires a certain ultra -plane that can completely divide the authentic example points and negative example points of the data set to both sides of the ultra -plane, that is, the training data set is linearly divided. Because only when the training data set is linearly divided, the perception machine learning algorithm is converging; if the linearity of the training data set is inseparable, the perception machine learning algorithm does not converge, and iteration results will fluctuate. When training data hub is inseparable, a linear support vector machine can be used.
The learning process of the perception machine model: Seek the sensor model according to the training dataset, that is, find the model parameter w w wand harmony b b b。
Predictive process of perception machine model: The output category corresponding to the new input instance is calculated through the sensor model obtained by learning.
Category of Perception Machine Model:
- is used to solve the supervision and learning model of the second classification problem
- Non -probability model: Model takes function forms (rather than the conditional probability distribution form of the probability model)
- linear model: the model function is a linear function
- Parameterization model: The dimension of the model parameter is fixed
- Differential model: Learning decision -making functions directly f ( x ) f(x) f(x)
The main advantage of the perception machine model: The algorithm is simple and easy to implement.
The main disadvantages of the perception machine model: Requires training data sets to be divided.
Fanwat, corresponding to Minkowski Distance. Assuming the N dimension vector x = ( x 1 , x 2 , ⋯ , x n ) T x = (x_1,x_2,\cdots,x_n)^T x=(x1,x2,⋯,xn)T, its LP model is recorded as ∣ ∣ x ∣ ∣ p ||x||_p ∣∣x∣∣p, defined as
twenty three# . The number of samples has the following definition: ∣ ∣ x ∣ ∣ p = ( ∣ x 1 ∣ p + ∣ x 2 ∣ p + ⋯ + ∣ x n ∣ p ) 1 p ||x||_p = (|x_1|^p+|x_2|^p+\cdots+|x_n|^p)^{\frac{1}{p}} ∣∣x∣∣p=(∣x1∣p+∣x2∣p+⋯+∣xn∣p)p1
- Positive settings: ∣ ∣ x ∣ ∣ ≥ 0 ||x|| \ge 0 ∣∣x∣∣≥0, and there are ∣ ∣ x ∣ ∣ = 0 ⇔ x = 0 ||x||=0 \Leftrightarrow x=0 ∣∣x∣∣=0⇔x=0;
- positive and secondary: ∣ ∣ c x ∣ ∣ = ∣ c ∣ ∣ ∣ x ∣ ∣ ||cx|| = |c| \ ||x|| ∣∣cx∣∣=∣c∣ ∣∣x∣∣;
- Adgasability (Triangle Revitalization): ∣ ∣ x + y ∣ ∣ ≤ ∣ ∣ x ∣ ∣ + ∣ ∣ y ∣ ∣ ||x+y|| \le ||x|| + ||y|| ∣∣x+y∣∣≤∣∣x∣∣+∣∣y∣∣。
l0 model
Assuming N -dimensional vector x = ( x 1 , x 2 , ⋯ , x n ) T x = (x_1,x_2,\cdots,x_n)^T x=(x1,x2,⋯,xn)T, its L0 model is recorded as ∣ ∣ x ∣ ∣ 0 ||x||_0 ∣∣x∣∣0, defined as the number of 0 elements in the vector.
l1 model number
Assuming the n -dimensional vector x = ( x 1 , x 2 , ⋯ , x n ) T x = (x_1,x_2,\cdots,x_n)^T x=(x1,x2,⋯,xn)T, its L1 model is remembered ∣ ∣ x ∣ ∣ 1 ||x||_1 ∣∣x∣∣1, defined as ∣ ∣ x ∣ ∣ 1 = ∣ x 1 ∣ + ∣ x 2 ∣ + ⋯ + ∣ x n ∣ ||x||_1 = |x_1|+|x_2|+\cdots+|x_n| ∣∣x∣∣1=∣x1∣+∣x2∣+⋯+∣xn∣. The L1 model of the vector is the sum of the absolute values of each element in the vector, corresponding to Manhattan Distance.
L2 model
Assuming the n -dimensional vector x = ( x 1 , x 2 , ⋯ , x n ) T x = (x_1,x_2,\cdots,x_n)^T x=(x1,x2,⋯,xn)T, its L2 model is remembered ∣ ∣ x ∣ ∣ 2 ||x||_2 ∣∣x∣∣2, defined as ∣ ∣ x ∣ ∣ 2 = ( ∣ x 1 ∣ 2 + ∣ x 2 ∣ 2 + ⋯ + ∣ x n ∣ 2 ) 1 2 ||x||_2 = (|x_1|^2+|x_2|^2+\cdots+|x_n|^2)^{\frac{1}{2}} ∣∣x∣∣2=(∣x1∣2+∣x2∣2+⋯+∣xn∣2)21. The L2 model of the vector is the square root of each element in the vector, corresponding to the European -style distance (Manhattan Distance).
Infinite model
Assuming the N -dimensional vector x = ( x 1 , x 2 , ⋯ , x n ) T x = (x_1,x_2,\cdots,x_n)^T x=(x1,x2,⋯,xn)T, its infinite model is recorded as ∣ ∣ x ∣ ∣ ∞ ||x||_\infty ∣∣x∣∣∞, defined as ∣ ∣ x ∣ ∣ ∞ = m a x ( ∣ x 1 ∣ , ∣ x 2 ∣ , ⋯ , ∣ x n ∣ ) ||x||_\infty = max(|x_1|,|x_2|,\cdots,|x_n|) ∣∣x∣∣∞=max(∣x1∣,∣x2∣,⋯,∣xn∣). The infinite number of vectors is the maximum value of the absolute value of each element in the vector, corresponding to the Chebyshev Distance.
Known S S Sis N-1-dimensional super plane in N-dimensional European-style space w ⋅ x + b = 0 w·x + b =0 w⋅x+b=0, of which w w wand x x xare all n -dimensional vectors; x 0 = ( x 0 ( 1 ) , x 0 ( 2 ) , ⋯ , x 0 ( n ) ) x_0 = (x_0^{(1)},x_0^{(2)},\cdots,x_0^{(n)}) x0=(x0(1),x0(2),⋯,x0(n)). Verification: Click P P Pto ultra -flat plane S S Sdistance d = 1 ∣ ∣ w ∣ ∣ 2 ∣ w ⋅ x 0 + b ∣ d = \frac{1}{||w||_2} |w·x_0+b| d=∣∣w∣∣21∣w⋅x0+b∣, among them ∣ ∣ w ∣ ∣ 2 ||w||_2 ∣∣w∣∣2 w w w2-model.
The proof is as follows:
from super plane
The definition of S S Scan be known w w wis ultra -flat S S SThe method vector, b b bis ultra -flat S S SThe interception.
Set point x 0 x_0 x06 6
The projection on S S Sis x 1 = ( x 1 ( 1 ) , x 1 ( 2 ) , ⋯ , x 1 ( n ) ) x_1 = (x_1^{(1)},x_1^{(2)},\cdots,x_1^{(n)}) x1=(x1(1),x1(2),⋯,x1(n)), there are
w ⋅ x 1 + b = 0 (1) w · x_1 + b = 0 \tag{1} w⋅x1+b=0(1)
P P Pto ultra -flat plane S S Sdistance d d dis the vector x 0 x 1 ⃗ \vec{x_0 x_1} x0x1The length.
because of x 0 x 1 ⃗ \vec{x_0 x_1} x0x1and ultra -flat plane S S SFrench vector w w wParallel, so x 0 x 1 ⃗ \vec{x_0 x_1} x0x1and French vector angle of the string value c o s θ = 0 cos \theta = 0 cosθ=0, hence
w ⋅ x 0 x 1 ⃗ = ∣ w ∣ ∣ x 0 x 1 ⃗ ∣ c o s θ = ∣ w ∣ ∣ x 0 x 1 ⃗ ∣ = [ ( w ( 1 ) ) 2 + ( w ( 2 ) ) 2 + ⋯ + ( w ( n ) ) 2 ] 1 2 d = ∣ ∣ w ∣ ∣ 2 d (2) \begin{aligned} w · \vec{x_0 x_1} & = |w| \ |\vec{x_0 x_1}| \ cos \theta \\ & = |w| \ |\vec{x_0 x_1}| \\ & = [(w^{(1)})^2 + (w^{(2)})^2 + \cdots + (w^{(n)})^2]^\frac{1}{2} \ d \\ & = ||w||_2 d \end{aligned} \tag{2} w⋅x0x1=∣w∣ ∣x0x1∣ cosθ=∣w∣ ∣x0x1∣=[(w(1))2+(w(2))2+⋯+(w(n))2]21 d=∣∣w∣∣2d(2)
There are again (the distribution law of the application vector point)
w ⋅ x 0 x 1 ⃗ = w ( 1 ) ( x 1 ( 1 ) − x 0 ( 1 ) ) + w ( 2 ) ( x 1 ( 2 ) − x 0 ( 2 ) ) + ⋯ + w ( n ) ( x 1 ( n ) − x 0 ( n ) ) = ( w ( 1 ) x 1 ( 1 ) + w ( 2 ) x 1 ( 2 ) + ⋯ + w ( n ) x 1 ( n ) ) − ( w ( 1 ) x 0 ( 1 ) + w ( 2 ) x 0 ( 2 ) + ⋯ + w ( n ) x 0 ( n ) ) = w ⋅ x 1 − w ⋅ x 0 (3) \begin{aligned} w · \vec{x_0 x_1} & = w^{(1)} (x_1^{(1)} – x_0^{(1)}) + w^{(2)} (x_1^{(2)} – x_0^{(2)}) + \cdots + w^{(n)} (x_1^{(n)} – x_0^{(n)}) \\ & = (w^{(1)} x_1^{(1)} + w^{(2)} x_1^{(2)} + \cdots + w^{(n)} x_1^{(n)}) – (w^{(1)} x_0^{(1)} + w^{(2)} x_0^{(2)} + \cdots + w^{(n)} x_0^{(n)}) \\ & = w·x_1 – w·x_0 \end{aligned} \tag{3} w⋅x0x1=w(1)(x1(1)−x0(1))+w(2)(x1(2)−x0(2))+⋯+w(n)(x1(n)−x0(n))=(w(1)x1(1)+w(2)x1(2)+⋯+w(n)x1(n))−(w(1)x0(1)+w(2)x0(2)+⋯+w(n)x0(n))=w⋅x1−w⋅x0(3)
from the formula (1), there is w ⋅ x 1 = − b w·x_1 = -b w⋅x1=−b, so the style (3) can be written
w ⋅ x 0 x 1 ⃗ = w ⋅ x 1 − w ⋅ x 0 = − b − w ⋅ x 0 (4) \begin{aligned} w · \vec{x_0 x_1} & = w·x_1 – w·x_0 \\ & = -b – w·x_0 \end{aligned} \tag{4} w⋅x0x1=w⋅x1−w⋅x0=−b−w⋅x0(4)
from the formula (2) and the formula (4), get
∣ ∣ w ∣ ∣ 2 d = ∣ − b − w ⋅ x 0 ∣ d = 1 ∣ ∣ w ∣ ∣ 2 ∣ w ⋅ x 0 + b ∣ \begin{aligned} ||w||_2 d = |-b – w·x_0| \\ d = \frac{1}{||w||_2} |w·x_0+b| \end{aligned} ∣∣w∣∣2d=∣−b−w⋅x0∣d=∣∣w∣∣21∣w⋅x0+b∣
[Supplementary instructions] In the iteration of Example 2.1, the rule of choosing error classification points is to select the smallest error classification points of the index.
For a parameter combination ( w 0 , b 0 ) (w_0,b_0) (w0,b0), because of the mistake of classification points M M Mis fixed, so gradient ∇ L ( w 0 , b 0 ) \nabla L(w_0,b_0) ∇L(w0,b0)is also fixed, the gradient is w w wcomponent ∇ w L ( w 0 , b 0 ) \nabla w L(w_0,b_0) ∇wL(w0,b0)is the loss function pair w w wpartial guidance, gradient in b b bcomponent ∇ b L ( w 0 , b 0 ) \nabla_b L(w_0,b_0) ∇bL(w0,b0)is the loss function pair b b bpartial guidance, so there is a gradient
∇ w L ( w 0 , b 0 ) = L w ′ ( w 0 , b 0 ) = − ∑ x i ∈ M y i x i \nabla_w L(w_0,b_0) = L’_w(w_0,b_0) = – \sum_{x_i \in M} y_i x_i ∇wL(w0,b0)=Lw′(w0,b0)=−xi∈M∑yixi
∇ b L ( w 0 , b 0 ) = L b ′ ( w 0 , b 0 ) = − ∑ x i ∈ M y i \nabla_b L(w_0,b_0) = L’_b(w_0,b_0) = – \sum_{x_i \in M} y_i ∇bL(w0,b0)=Lb′(w0,b0)=−xi∈M∑yi
Because all sample points are required to complete the gradient, the time cost is high, so here we use the random gradient drop method of fast time speed. In each iteration process, it is not a way to make it
All the gradients of all error classification points in M M Mare decreased, but a random classification point is randomly selected to reduce its gradient. For a single error classification point ( x i , y i ) (x_i,y_i) (xi,yi), gradient
∇ w L ( w 0 , b 0 ) = L w ′ ( w 0 , b 0 ) = − y i x i \nabla_w L(w_0,b_0) = L’_w(w_0,b_0) = – y_i x_i ∇wL(w0,b0)=Lw′(w0,b0)=−yixi
∇ b L ( w 0 , b 0 ) = L b ′ ( w 0 , b 0 ) = − y i \nabla_b L(w_0,b_0) = L’_b(w_0,b_0) = – y_i ∇bL(w0,b0)=Lb′(w0,b0)=−yi
Based on this update w w wand b b b:
w ← w + η ( − ∇ w L ( w 0 , b 0 ) ) = w + η y i x i w \leftarrow w + \eta(-\nabla_w L(w_0,b_0)) = w + \eta y_i x_i w←w+η(−∇wL(w0,b0))=w+ηyixi
b ← b + η ( − ∇ b L ( w 0 , b 0 ) ) = b + η y i b \leftarrow b + \eta(-\nabla_b L(w_0,b_0)) = b + \eta y_i b←b+η(−∇bL(w0,b0))=b+ηyi
Among them, η \eta ηis a step -long, usually the range of the value is ( 0 , 1 ] (0,1] (0,1], also known as learning rate.
【source code address】code.perceptron.original_form_of_perceptron
# https://github.com/ChangxingJiang/Data-Mining-HandBook/blob/master/code/perceptron/_original_form.py
def original_form_of_perceptron(x, y, eta):
"" Performing machine learning algorithm original form
: Param X: Input variable
: Param Y: Output variable
: Param Eta: Learning Rate
: RETURN: W and B of the Perception Machine model
"" "
n_samples = len(x) # Sample point quantity
n_features = len(x[0]) # Feature vector dimension
w0, b0 = [0] * n_features, 0 # Select the initial value W0, B0
while True: # constantly iterate until there is no error classification point
for i in range(n_samples):
xi, yi = x[i], y[i]
if yi * (sum(w0[j] * xi[j] for j in range(n_features)) + b0) <= 0:
w1 = [w0[j] + eta * yi * xi[j] for j in range(n_features)]
b1 = b0 + eta * yi
w0, b0 = w1, b1
break
else:
return w0, b0
【source code address] Test
>>> from code.perceptron import original_form_of_perceptron
>>> dataset = [[(3, 3), (4, 3), (1, 1)], [1, 1, -1]]
>>> original_form_of_perceptron(dataset[0], dataset[1], eta=1)
([1, 1], -3)
[Supplementary instructions] Expansion of the weight vector: The vector of the weight vector of the expansion right is about to be biased into the weight vector w ^ = ( w T , b ) T \hat{w} = (w^T,b)^T w^=(wT,b)T. (The name of the “Expansion of Rights” is directly used below)
[Supplementary instructions] The number of mistakes classification: that is, there are the number of iterations of the wrong classification instance.
∣ ∣ w ^ o p t ∣ ∣ = 1 ||\hat{w}_{opt}||=1 ∣∣w^opt∣∣=1?
May wish to set the over -the -plans to expand the weight vector to
w ^ o p t ′ = ( w ′ o p t T , b ′ ) T = ( w o p t ′ ( 1 ) , w o p t ′ ( 2 ) , ⋯ , w o p t ′ ( n ) , b ′ ) T \hat{w}’_{opt}=({w’}_{opt}^T,b’)^T = (w’^{(1)}_{opt},w’^{(2)}_{opt},\cdots,w’^{(n)}_{opt},b’)^T w^opt′=(w′optT,b′)T=(wopt′(1),wopt′(2),⋯,wopt′(n),b′)T
∣ ∣ w ^ o p t ′ ∣ ∣ ! = 1 ||\hat{w}’_{opt}||!=1 ∣∣w^opt′∣∣!=1, so exist
w ^ o p t = ( w o p t ′ ( 1 ) ∣ ∣ w ^ o p t ′ ∣ ∣ , w o p t ′ ( 2 ) ∣ ∣ w ^ o p t ′ ∣ ∣ , ⋯ , w o p t ′ ( n ) ∣ ∣ w ^ o p t ′ ∣ ∣ , b ′ ∣ ∣ w ^ o p t ′ ∣ ∣ ) T \hat{w}_{opt} = (\frac{w’^{(1)}_{opt}}{||\hat{w}’_{opt}||},\frac{w’^{(2)}_{opt}}{||\hat{w}’_{opt}||},\cdots,\frac{w’^{(n)}_{opt}}{||\hat{w}’_{opt}||},\frac{b’}{||\hat{w}’_{opt}||})^T w^opt=(∣∣w^opt′∣∣wopt′(1),∣∣w^opt′∣∣wopt′(2),⋯,∣∣w^opt′∣∣wopt′(n),∣∣w^opt′∣∣b′)T
At this time ∣ ∣ w ^ o p t ∣ ∣ = 1 ||\hat{w}_{opt}||=1 ∣∣w^opt∣∣=1. Send a certificate.
For example: x ( 1 ) + x ( 2 ) − 3 x^{(1)}+x^{(2)}-3 x(1)+x(2)−3The weight vector of the expansion right is w ′ ^ = ( 1 , 1 , − 3 ) T \hat{w’} = (1,1,-3)^T w′^=(1,1,−3)T, ∣ ∣ w ′ ^ ∣ ∣ = 11 ||\hat{w’}||=\sqrt{11} ∣∣w′^∣∣=11, so there is w ^ = ( 1 11 , 1 11 , − 3 11 ) T \hat{w}=(\frac{1}{\sqrt{11}},\frac{1}{\sqrt{11}},\frac{-3}{\sqrt{11}})^T w^=(111,111,11−3)T, make
The ∣ ∣ w ^ ∣ ∣ = 1 ||\hat{w}||=1 ∣∣w^∣∣=1。
Conclusion is the higher operational efficiency of the puppet form of the perceptual learning algorithm under certain conditions. Let’s discuss below.
First of all, consider the original form and time complexity of the perception of the learning algorithm. The complexity of the time of each iteration of the original form is O ( S × N ) O(S×N) O(S×N), the total time complexity is O ( S × N × K ) O(S×N×K) O(S×N×K); The complexity of each iteration of each form is O ( S 2 ) O(S^2) O(S2), also need to calculate the time complexity of the Gram matrix O ( S 2 × N ) O(S^2×N) O(S2×N), the total time complexity is O ( S 2 × N + S 2 × K ) O(S^2×N+S^2×K) O(S2×N+S2×K); Among them S S Sis the number of sample points, N N Nis the dimension of the sample feature vector, K K Kis the number of iterations.
Because the original form and the iterative steps of the puppet form correspond to each other, in general, the original form is more suitable for training data with less dimensions and high quantities, and it is more suitable for training data with high dimensions and less quantities.
【source address】code.perceptron.count_gram
# https://github.com/ChangxingJiang/Data-Mining-HandBook/blob/master/code/perceptron/_gram.py
def count_gram(x):
"" Calculating GRAM matrix
: Param X: Input variable
: Return: Enter the gram matrix of the variable
"" "
n_samples = len(x) # Sample point quantity
n_features = len(x[0]) # Feature vector dimension
gram = [[0] * n_samples for _ in range(n_samples)] # initialized GRAM matrix
# Calculate Gram matrix
for i in range(n_samples):
for j in range(i, n_samples):
gram[i][j] = gram[j][i] = sum(x[i][k] * x[j][k] for k in range(n_features))
return gram
【source address】code.perceptron.dual_form_perceptron
# https://github.com/ChangxingJiang/Data-Mining-HandBook/blob/master/code/perceptron/_dual_form.py
from . import count_gram # code.perceptron.count_gram
def dual_form_perceptron(x, y, eta):
"" "Performing machine learning algorithm format form
: Param X: Input variable
: Param Y: Output variable
: Param Eta: Learning Rate
: Return: A (alpha) and B of the perception machine model
"" "
n_samples = len(x) # Sample point quantity
a0, b0 = [0] * n_samples, 0 # Select the initial value A0 (alpha), B0
gram = count_gram(x) # Calculate Gram matrix
while True: # continuously iterate until there is no error classification point
for i in range(n_samples):
yi = y[i]
val = 0
for j in range(n_samples):
xj, yj = x[j], y[j]
val += a0[j] * yj * gram[i][j]
if (yi * (val + b0)) <= 0:
a0[i] += eta
b0 += eta * yi
break
else:
return a0, b0
【Source code address] Test
>>> from code.perceptron import dual_form_perceptron
>>> dataset = [[(3, 3), (4, 3), (1, 1)], [1, 1, -1]] # 1 1 1
>>> dual_form_perceptron(dataset[0], dataset[1], eta=1)
([2, 0, 5], -3)