# [Linux Advanced] LSTAT of the system call -Get the file attribute

2023-03-14   ES

Perception machine model for data requirements: The training data set requires a certain ultra -plane that can completely divide the authentic example points and negative example points of the data set to both sides of the ultra -plane, that is, the training data set is linearly divided. Because only when the training data set is linearly divided, the perception machine learning algorithm is converging; if the linearity of the training data set is inseparable, the perception machine learning algorithm does not converge, and iteration results will fluctuate. When training data hub is inseparable, a linear support vector machine can be used.

The learning process of the perception machine model: Seek the sensor model according to the training dataset, that is, find the model parameter w w and harmony b b

Predictive process of perception machine model: The output category corresponding to the new input instance is calculated through the sensor model obtained by learning.

Category of Perception Machine Model

• is used to solve the supervision and learning model of the second classification problem
• Non -probability model: Model takes function forms (rather than the conditional probability distribution form of the probability model)
• linear model: the model function is a linear function
• Parameterization model: The dimension of the model parameter is fixed
• Differential model: Learning decision -making functions directly f ( x ) f(x)

The main advantage of the perception machine model: The algorithm is simple and easy to implement.

The main disadvantages of the perception machine model: Requires training data sets to be divided.

Fanwat, corresponding to Minkowski Distance. Assuming the N dimension vector x = ( x 1 , x 2 , ⋯   , x n ) T x = (x_1,x_2,\cdots,x_n)^T , its LP model is recorded as ∣ ∣ x ∣ ∣ p ||x||_p , defined as

twenty three# . The number of samples has the following definition: ∣ ∣ x ∣ ∣ p = ( ∣ x 1 ∣ p + ∣ x 2 ∣ p + ⋯ + ∣ x n ∣ p ) 1 p ||x||_p = (|x_1|^p+|x_2|^p+\cdots+|x_n|^p)^{\frac{1}{p}}

• Positive settings: ∣ ∣ x ∣ ∣ ≥ 0 ||x|| \ge 0 , and there are ∣ ∣ x ∣ ∣ = 0 ⇔ x = 0 ||x||=0 \Leftrightarrow x=0
• positive and secondary: ∣ ∣ c x ∣ ∣ = ∣ c ∣   ∣ ∣ x ∣ ∣ ||cx|| = |c| \ ||x||
• Adgasability (Triangle Revitalization): ∣ ∣ x + y ∣ ∣ ≤ ∣ ∣ x ∣ ∣ + ∣ ∣ y ∣ ∣ ||x+y|| \le ||x|| + ||y||

l0 model

Assuming N -dimensional vector x = ( x 1 , x 2 , ⋯   , x n ) T x = (x_1,x_2,\cdots,x_n)^T , its L0 model is recorded as ∣ ∣ x ∣ ∣ 0 ||x||_0 , defined as the number of 0 elements in the vector.

l1 model number

Assuming the n -dimensional vector x = ( x 1 , x 2 , ⋯   , x n ) T x = (x_1,x_2,\cdots,x_n)^T , its L1 model is remembered ∣ ∣ x ∣ ∣ 1 ||x||_1 , defined as ∣ ∣ x ∣ ∣ 1 = ∣ x 1 ∣ + ∣ x 2 ∣ + ⋯ + ∣ x n ∣ ||x||_1 = |x_1|+|x_2|+\cdots+|x_n| . The L1 model of the vector is the sum of the absolute values of each element in the vector, corresponding to Manhattan Distance.

L2 model

Assuming the n -dimensional vector x = ( x 1 , x 2 , ⋯   , x n ) T x = (x_1,x_2,\cdots,x_n)^T , its L2 model is remembered ∣ ∣ x ∣ ∣ 2 ||x||_2 , defined as ∣ ∣ x ∣ ∣ 2 = ( ∣ x 1 ∣ 2 + ∣ x 2 ∣ 2 + ⋯ + ∣ x n ∣ 2 ) 1 2 ||x||_2 = (|x_1|^2+|x_2|^2+\cdots+|x_n|^2)^{\frac{1}{2}} . The L2 model of the vector is the square root of each element in the vector, corresponding to the European -style distance (Manhattan Distance).

Infinite model

Assuming the N -dimensional vector x = ( x 1 , x 2 , ⋯   , x n ) T x = (x_1,x_2,\cdots,x_n)^T , its infinite model is recorded as ∣ ∣ x ∣ ∣ ∞ ||x||_\infty , defined as ∣ ∣ x ∣ ∣ ∞ = m a x ( ∣ x 1 ∣ , ∣ x 2 ∣ , ⋯   , ∣ x n ∣ ) ||x||_\infty = max(|x_1|,|x_2|,\cdots,|x_n|) . The infinite number of vectors is the maximum value of the absolute value of each element in the vector, corresponding to the Chebyshev Distance.

Known S S is N-1-dimensional super plane in N-dimensional European-style space w ⋅ x + b = 0 w·x + b =0 , of which w w and x x are all n -dimensional vectors; x 0 = ( x 0 ( 1 ) , x 0 ( 2 ) , ⋯   , x 0 ( n ) ) x_0 = (x_0^{(1)},x_0^{(2)},\cdots,x_0^{(n)}) . Verification: Click P P to ultra -flat plane S S distance d = 1 ∣ ∣ w ∣ ∣ 2 ∣ w ⋅ x 0 + b ∣ d = \frac{1}{||w||_2} |w·x_0+b| , among them ∣ ∣ w ∣ ∣ 2 ||w||_2 w w 2-model.

The proof is as follows:

from super plane

The definition of S S can be known w w is ultra -flat S S The method vector, b b is ultra -flat S S The interception.

Set point x 0 x_0 6 6

The projection on S S is x 1 = ( x 1 ( 1 ) , x 1 ( 2 ) , ⋯   , x 1 ( n ) ) x_1 = (x_1^{(1)},x_1^{(2)},\cdots,x_1^{(n)}) , there are

w ⋅ x 1 + b = 0 (1) w · x_1 + b = 0 \tag{1}

P P to ultra -flat plane S S distance d d is the vector x 0 x 1 ⃗ \vec{x_0 x_1} The length.

because of x 0 x 1 ⃗ \vec{x_0 x_1} and ultra -flat plane S S French vector w w Parallel, so x 0 x 1 ⃗ \vec{x_0 x_1} and French vector angle of the string value c o s θ = 0 cos \theta = 0 , hence

w ⋅ x 0 x 1 ⃗ = ∣ w ∣   ∣ x 0 x 1 ⃗ ∣   c o s θ = ∣ w ∣   ∣ x 0 x 1 ⃗ ∣ = [ ( w ( 1 ) ) 2 + ( w ( 2 ) ) 2 + ⋯ + ( w ( n ) ) 2 ] 1 2   d = ∣ ∣ w ∣ ∣ 2 d (2) \begin{aligned} w · \vec{x_0 x_1} & = |w| \ |\vec{x_0 x_1}| \ cos \theta \\ & = |w| \ |\vec{x_0 x_1}| \\ & = [(w^{(1)})^2 + (w^{(2)})^2 + \cdots + (w^{(n)})^2]^\frac{1}{2} \ d \\ & = ||w||_2 d \end{aligned} \tag{2}

There are again (the distribution law of the application vector point)

w ⋅ x 0 x 1 ⃗ = w ( 1 ) ( x 1 ( 1 ) − x 0 ( 1 ) ) + w ( 2 ) ( x 1 ( 2 ) − x 0 ( 2 ) ) + ⋯ + w ( n ) ( x 1 ( n ) − x 0 ( n ) ) = ( w ( 1 ) x 1 ( 1 ) + w ( 2 ) x 1 ( 2 ) + ⋯ + w ( n ) x 1 ( n ) ) − ( w ( 1 ) x 0 ( 1 ) + w ( 2 ) x 0 ( 2 ) + ⋯ + w ( n ) x 0 ( n ) ) = w ⋅ x 1 − w ⋅ x 0 (3) \begin{aligned} w · \vec{x_0 x_1} & = w^{(1)} (x_1^{(1)} – x_0^{(1)}) + w^{(2)} (x_1^{(2)} – x_0^{(2)}) + \cdots + w^{(n)} (x_1^{(n)} – x_0^{(n)}) \\ & = (w^{(1)} x_1^{(1)} + w^{(2)} x_1^{(2)} + \cdots + w^{(n)} x_1^{(n)}) – (w^{(1)} x_0^{(1)} + w^{(2)} x_0^{(2)} + \cdots + w^{(n)} x_0^{(n)}) \\ & = w·x_1 – w·x_0 \end{aligned} \tag{3}

from the formula (1), there is w ⋅ x 1 = − b w·x_1 = -b , so the style (3) can be written

w ⋅ x 0 x 1 ⃗ = w ⋅ x 1 − w ⋅ x 0 = − b − w ⋅ x 0 (4) \begin{aligned} w · \vec{x_0 x_1} & = w·x_1 – w·x_0 \\ & = -b – w·x_0 \end{aligned} \tag{4}

from the formula (2) and the formula (4), get

∣ ∣ w ∣ ∣ 2 d = ∣ − b − w ⋅ x 0 ∣ d = 1 ∣ ∣ w ∣ ∣ 2 ∣ w ⋅ x 0 + b ∣ \begin{aligned} ||w||_2 d = |-b – w·x_0| \\ d = \frac{1}{||w||_2} |w·x_0+b| \end{aligned}

[Supplementary instructions] In the iteration of Example 2.1, the rule of choosing error classification points is to select the smallest error classification points of the index.

For a parameter combination ( w 0 , b 0 ) (w_0,b_0) , because of the mistake of classification points M M is fixed, so gradient ∇ L ( w 0 , b 0 ) \nabla L(w_0,b_0) is also fixed, the gradient is w w component ∇ w L ( w 0 , b 0 ) \nabla w L(w_0,b_0) is the loss function pair w w partial guidance, gradient in b b component ∇ b L ( w 0 , b 0 ) \nabla_b L(w_0,b_0) is the loss function pair b b partial guidance, so there is a gradient

∇ w L ( w 0 , b 0 ) = L w ′ ( w 0 , b 0 ) = − ∑ x i ∈ M y i x i \nabla_w L(w_0,b_0) = L’_w(w_0,b_0) = – \sum_{x_i \in M} y_i x_i

∇ b L ( w 0 , b 0 ) = L b ′ ( w 0 , b 0 ) = − ∑ x i ∈ M y i \nabla_b L(w_0,b_0) = L’_b(w_0,b_0) = – \sum_{x_i \in M} y_i

Because all sample points are required to complete the gradient, the time cost is high, so here we use the random gradient drop method of fast time speed. In each iteration process, it is not a way to make it

All the gradients of all error classification points in M M are decreased, but a random classification point is randomly selected to reduce its gradient. For a single error classification point ( x i , y i ) (x_i,y_i) , gradient

∇ w L ( w 0 , b 0 ) = L w ′ ( w 0 , b 0 ) = − y i x i \nabla_w L(w_0,b_0) = L’_w(w_0,b_0) = – y_i x_i

∇ b L ( w 0 , b 0 ) = L b ′ ( w 0 , b 0 ) = − y i \nabla_b L(w_0,b_0) = L’_b(w_0,b_0) = – y_i

Based on this update w w and b b

w ← w + η ( − ∇ w L ( w 0 , b 0 ) ) = w + η y i x i w \leftarrow w + \eta(-\nabla_w L(w_0,b_0)) = w + \eta y_i x_i

b ← b + η ( − ∇ b L ( w 0 , b 0 ) ) = b + η y i b \leftarrow b + \eta(-\nabla_b L(w_0,b_0)) = b + \eta y_i

Among them, η \eta is a step -long, usually the range of the value is ( 0 , 1 ] (0,1] , also known as learning rate.

# https://github.com/ChangxingJiang/Data-Mining-HandBook/blob/master/code/perceptron/_original_form.py

def original_form_of_perceptron(x, y, eta):
"" Performing machine learning algorithm original form

: Param X: Input variable
: Param Y: Output variable
: Param Eta: Learning Rate
: RETURN: W and B of the Perception Machine model
"" "
n_samples = len(x)  # Sample point quantity
n_features = len(x[0])  # Feature vector dimension
w0, b0 = [0] * n_features, 0  # Select the initial value W0, B0

while True:  # constantly iterate until there is no error classification point
for i in range(n_samples):
xi, yi = x[i], y[i]
if yi * (sum(w0[j] * xi[j] for j in range(n_features)) + b0) <= 0:
w1 = [w0[j] + eta * yi * xi[j] for j in range(n_features)]
b1 = b0 + eta * yi
w0, b0 = w1, b1
break
else:
return w0, b0


>>> from code.perceptron import original_form_of_perceptron
>>> dataset = [[(3, 3), (4, 3), (1, 1)], [1, 1, -1]]
>>> original_form_of_perceptron(dataset[0], dataset[1], eta=1)
([1, 1], -3)


[Supplementary instructions] Expansion of the weight vector: The vector of the weight vector of the expansion right is about to be biased into the weight vector w ^ = ( w T , b ) T \hat{w} = (w^T,b)^T . (The name of the “Expansion of Rights” is directly used below)

[Supplementary instructions] The number of mistakes classification: that is, there are the number of iterations of the wrong classification instance.

#### ∣ ∣ w ^ o p t ∣ ∣ = 1 ||\hat{w}_{opt}||=1 ？

May wish to set the over -the -plans to expand the weight vector to

w ^ o p t ′ = ( w ′ o p t T , b ′ ) T = ( w o p t ′ ( 1 ) , w o p t ′ ( 2 ) , ⋯   , w o p t ′ ( n ) , b ′ ) T \hat{w}’_{opt}=({w’}_{opt}^T,b’)^T = (w’^{(1)}_{opt},w’^{(2)}_{opt},\cdots,w’^{(n)}_{opt},b’)^T

∣ ∣ w ^ o p t ′ ∣ ∣ ! = 1 ||\hat{w}’_{opt}||!=1 , so exist

w ^ o p t = ( w o p t ′ ( 1 ) ∣ ∣ w ^ o p t ′ ∣ ∣ , w o p t ′ ( 2 ) ∣ ∣ w ^ o p t ′ ∣ ∣ , ⋯   , w o p t ′ ( n ) ∣ ∣ w ^ o p t ′ ∣ ∣ , b ′ ∣ ∣ w ^ o p t ′ ∣ ∣ ) T \hat{w}_{opt} = (\frac{w’^{(1)}_{opt}}{||\hat{w}’_{opt}||},\frac{w’^{(2)}_{opt}}{||\hat{w}’_{opt}||},\cdots,\frac{w’^{(n)}_{opt}}{||\hat{w}’_{opt}||},\frac{b’}{||\hat{w}’_{opt}||})^T

At this time ∣ ∣ w ^ o p t ∣ ∣ = 1 ||\hat{w}_{opt}||=1 . Send a certificate.

For example: x ( 1 ) + x ( 2 ) − 3 x^{(1)}+x^{(2)}-3 The weight vector of the expansion right is w ′ ^ = ( 1 , 1 , − 3 ) T \hat{w’} = (1,1,-3)^T ∣ ∣ w ′ ^ ∣ ∣ = 11 ||\hat{w’}||=\sqrt{11} , so there is w ^ = ( 1 11 , 1 11 , − 3 11 ) T \hat{w}=(\frac{1}{\sqrt{11}},\frac{1}{\sqrt{11}},\frac{-3}{\sqrt{11}})^T , make

The ∣ ∣ w ^ ∣ ∣ = 1 ||\hat{w}||=1

Conclusion is the higher operational efficiency of the puppet form of the perceptual learning algorithm under certain conditions. Let’s discuss below.

First of all, consider the original form and time complexity of the perception of the learning algorithm. The complexity of the time of each iteration of the original form is O ( S × N ) O(S×N) , the total time complexity is O ( S × N × K ) O(S×N×K) ; The complexity of each iteration of each form is O ( S 2 ) O(S^2) , also need to calculate the time complexity of the Gram matrix O ( S 2 × N ) O(S^2×N) , the total time complexity is O ( S 2 × N + S 2 × K ) O(S^2×N+S^2×K) ; Among them S S is the number of sample points, N N is the dimension of the sample feature vector, K K is the number of iterations.

Because the original form and the iterative steps of the puppet form correspond to each other, in general, the original form is more suitable for training data with less dimensions and high quantities, and it is more suitable for training data with high dimensions and less quantities.

# https://github.com/ChangxingJiang/Data-Mining-HandBook/blob/master/code/perceptron/_gram.py

def count_gram(x):
"" Calculating GRAM matrix

: Param X: Input variable
: Return: Enter the gram matrix of the variable
"" "
n_samples = len(x)  # Sample point quantity
n_features = len(x[0])  # Feature vector dimension
gram = [[0] * n_samples for _ in range(n_samples)]  # initialized GRAM matrix

# Calculate Gram matrix
for i in range(n_samples):
for j in range(i, n_samples):
gram[i][j] = gram[j][i] = sum(x[i][k] * x[j][k] for k in range(n_features))

return gram


# https://github.com/ChangxingJiang/Data-Mining-HandBook/blob/master/code/perceptron/_dual_form.py

from . import count_gram  # code.perceptron.count_gram

def dual_form_perceptron(x, y, eta):
"" "Performing machine learning algorithm format form
: Param X: Input variable
: Param Y: Output variable
: Param Eta: Learning Rate
: Return: A (alpha) and B of the perception machine model
"" "
n_samples = len(x)  # Sample point quantity
a0, b0 = [0] * n_samples, 0  # Select the initial value A0 (alpha), B0
gram = count_gram(x)  # Calculate Gram matrix

while True:  # continuously iterate until there is no error classification point
for i in range(n_samples):
yi = y[i]

val = 0
for j in range(n_samples):
xj, yj = x[j], y[j]
val += a0[j] * yj * gram[i][j]

if (yi * (val + b0)) <= 0:
a0[i] += eta
b0 += eta * yi
break
else:
return a0, b0


>>> from code.perceptron import dual_form_perceptron
>>> dataset = [[(3, 3), (4, 3), (1, 1)], [1, 1, -1]]  # 1 1 1
>>> dual_form_perceptron(dataset[0], dataset[1], eta=1)
([2, 0, 5], -3)


source