Data analysis missing value loss processing (method+code)
The saints have said that data and features determine the upper limit of machine learning, and the models and algorithms are just approaching this upper limit.
No matter how good the model is, if there is no good data and characteristics, the training effect will not be improved. Data quality is crucial to data analysis, and sometimes its significance will be better than the model algorithm to some extent.
First of all, we should know: why is the data missing? ** The lack of data is what we can’t avoid. There are many possible reasons. The bloggers summarize the following three categories:
: The information is omitted. For example, due to the negligence of the staff and forgetting; or the lack of due to faults such as data collectors, such as when the real -time requirements of the system are high, the machine is too late to judge and decide;
Interesting: Some data sets will stipulate that the missing value is also a characteristic value in the feature description. At this time, the loss of value can be regarded as a special feature value;
No existence: Some characteristic attributes simply do not exist at all. For example, the name of a spouse of an unmarried person cannot be filled in, and the income status of a child cannot be filled in;
In short, we need to clarify the cause of missing values:Is it caused by negligence or inadvertent or unintentional, or it is caused by intentional, or it does not exist at all
twenty three# . Only by knowing its source can we take the right medicine and do it accordingly.
Before processing the missing data, it is necessary to understand the mechanism and form of the lack of data. Call the data set does not contain a deleat value variableFully variable, the data set contains a deleat value variable calledincomplete variable. From the lack of distribution, the missing can be divided intocompletely random lack，Random missingandNon -random missing at all。
completely random lack(Missing Completely at Random, MCAR): It means that the lack of data is completely random, does not depend on any incomplete variable or complete variable, and does not affect the unbiasedness of the sample, such as lack of family address;
Random missing(missing at random, mar): The lack of data is not completely random, that is, the lack of this type of data depends on other complete variables, such as the lack of financial data is related to the size of the enterprise;
Non -random missing(MISSING Not at Random, MNAR): Refers to the lack of data related to the value of incomplete variables itself, such as high -income people to provide family income;
For random missing and non -random defleration, it is not appropriate to delete the record directly, because it has been given above. Random deficiency can estimate the missing values by known variables, rather than non -randomness that is not randomly missing. There is no good solution.
Directly remove records that are missing and lost. This processing method is simple and rude, suitable for large data volume (more records) and smaller missing, and after removing, it has little impact on the overall. It is generally not recommended, because it is likely to cause data loss and data offset.
func: df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) # 1. Delete the 'Age' column df.drop('age', axis=1, inplace=True) # 2. Delete the line with empty values in the data table df.dropna() # 3, discard some rows that are lost in some columns df.dropna(axis=0, subset=['a','b'], inplace=True)
Directly remove the missing variables. Based on the first step, we already know the deficiencies of each variable. If the deficiency ratio of a variable is too high, it will basically lose the predictive significance. We can try to remove it directly.
# Variables that are greater than 80%of the missing ratio of more than 80% data=data.dropna(thresh=len(data)*0.2, axis=1)
Before the lack of value filling, we must first understand the lack of variables, that is, the meaning of the variable, the method of acquisition, and the logic of calculation, so as to know why the variable is missing and the missing value means. For example, the lack of age of ‘Age’, per capita age, and lack of lack should be a random deficiency. The number of loans of ‘Loannum’ loan may represent no loan, which is a lack of meaning. Global constant filling: 0, average value, median, number, etc. can be used. The average value is applicable to the approximately normal distribution data, and the observation value is evenly distributed around the mean; the median number is suitable for partial distribution or there is a group point data. Category variables, no size and sequence.
# mean filling data['col'] = data['col'].fillna(data['col'].means()) # medium number filling data['col'] = data['col'].fillna(data['col'].median()) # 4 4 4 4 data['col'] = data['col'].fillna(stats.mode(data['col']))
You can also use the IMPUTER class processing:
from sklearn.preprocessing import Imputer imr = Imputer(missing_values='NaN', strategy='mean', axis=0) imputed_data =pd.DataFrame(imr.fit_transform(df.values),columns=df.columns) imputed_data
Use a certain insertion mode to fill it.
# interpolated () interpolation method, the average value of the value before and after the missing value, but if there is a lack of missing values, the interpolation is not performed. df['c'] = df['c'].interpolate() # Use the previous value replacement. When the first line is missing, the bank is not worth using it forward. df.fillna(method='pad') # Use the subsequent value replacement. When the last line is missing, the bank uses backwards to replace without value. It is still missing. df.fillna(method='backfill')# Replace it with the values behind
The following two methods need to process data first
# You need to fill in the interpolation of column data first, and follow the follow -up as training data df['a'] = df['a'].interpolate() # 5 5 5 and non -air data df_notnull = df[df.is_fill==0] # Non -empty data df_null = df[df.is_fill==1] # 6 6 x_train = df_notnull[['b', 'a']] # Training data x, a, B collaboration y_train = df_notnull['c'] # Training data y, column C (target) test = df_null[['b', 'a']] # Forecast data x, a, B collaboration
The use of the KNN algorithm filling is actually the target column as the target scalar, using non -missing data to fit the KNN algorithm, and finally predict the lack of target columns. (For continuous features are generally weighted average, and discrete characteristics are generally weighted voting)
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor def knn_filled_func(x_train, y_train, test, k = 3, dispersed = True): # Params: x_train is the target column that does not contain data that does not lose values (excluding the target column) # Params: Y_Train is a target column without missing values # Params: Test is the target list as a missing data (excluding the target column) if dispersed: knn= KNeighborsClassifier(n_neighbors = k, weights = "distance") else: knn= KNeighborsRegressor(n_neighbors = k, weights = "distance") knn.fit(x_train, y_train.astype('int')) return test.index, knn.predict(test) index,predict = knn_filled_func(x_train, y_train, test, 3, True)
The idea of filling the random forest algorithm is similar to that of KNN, that is, the existing data fitting model is used to predict the lack of variables.
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier def RandomForest_filled_func(x_train, y_train, test, dispersed = True): # Params: X_Train is the target column that does not contain data that does not lose values (excluding the target column) # Params: Y_Train is a list of targets without missing values # Params: Test is the target list as a missing data (excluding the target column) if dispersed: rf= RandomForestRegressor() else: rf= RandomForestClassifier() rf.fit(x_train, y_train.astype('int')) return test.index, rf.predict(test) index,predict = RandomForest_filled_func(x_train, y_train, test, True)
Predictive operation after completion
# Filling prediction value df_null['c'] = predict # back to the original data df['c'] = df['c'].fillna(df_null[['c']].c) df.info()
red is filling data, green is the original data
The picture above shows random forest filling
The following figure shows the interpolation filling