Dice Loss for Data-imbalanced NLP Tasks
Xiaofei Sun*, Xiaoya Li*, Yuxian Meng, Junjun Liang, Fei Wu, Jiwei Li
Today, we will introduce you to the second article of Xiangnon Technology in ACL2020. The title isDice Loss for Data-imbalanced NLP Tasks。
In this article, wepropose to relieve a large number of data imbalances in the NLP mission with DICE LOSS, so as to improve the F1 score -based performance.
DICE LOSS form is simple and effective. The Cross Entropy Loss can be replaced with DICE LOSS. It can be named DICE Loss CTB5, CTB6, UD1.4, named entity recognition data set, Ontonotes5.0, MSRA, Ontonotes4.0, and and and and and and. Questions and answering data sets are close to or exceeded the current best results on Squad and QUOREF.
“Unbalance” data set in natural language processing
In various natural language processing tasks,Data imbalanceis a very common problem, especially in sequence marking tasks. For example, for word marking tasks, we generally use BIEOS. If we regard O as a negative case, others are regarded as positive cases, then the ratio of negative cases to positive cases is quite large.
This imbalance will cause two problems:
training and test loss. Occupy the training process of most negatives will dominate the model, leading the model to tend to be negative, and the F1 indicator used during testing requires each class to accurately predict;
Simple negatives are too much. The majority of negative cases also means that there are many simple samples. These simple samples are almost no help to model learning difficulties. Instead, they will promote the knowledge of the model for forgetting the difficulty samples under the action of cross entropy.
In general, a large number of simple negative cases will promote the model to ignore the difficulty of learning under the action of cross -entropy, and the sequence labeling task is often measured by F1, thereby predicting that the poor predictions directly lead to the low F1 value. Essence
In this article, we think that this problem is brought about by the characteristics of cross -entropy itself: cross -entropy “equally” look at each sample, regardless of positive and negative, try to push them to 1 (positive example) or 0 (negative) (negative) (negative) (negative) (negative) (negative) example). But in fact, for classification, a sample is classified as negative. The probability that only requires it is <0.5, and there is no need to push it to 0.
Based on this observation, we use the existing DICE LOSS and propose a self -adaptable loss based on DICE LOSS -DSC, to promote the model that pays more attention to difficulties during training, reduce the learning of simple negative cases, and thereby the whole in the whole Improve the effect of F1 -based value.
We experimented on multiple tasks, including: word marking, naming entity recognition, Q & A and paragraph recognition.
Poetry labeling, we can reach F1 of 97.92 on CTB5, reach F1 of 96.57 on CTB6, 96.98 on UD1.4, 99.38 on WSJ, 92.58 on Tweets, significantly surpass the baseline model.
Naming entity recognition, we can achieve 93.33 on the Conll2003, 92.07 on Ontonotes5, 96.72 on MSRA, and the F1 value of 84.47 on Ontonotes4, close or exceed the current best.
问#, we can surpass the baseline model about 1 F1 value on Squad1/2 and QUOREF.
paragraph recognition, our method can also significantly improve the results.
From Cross Entropy to Dice Losses
Cross Entropy Loss (CE)
We combed how to losses from cross -entropy to DICE LOSS in the order of logical order. We use the second category as a description, and the input is input as the, output is a dual value probability, and there is a binary true value。
First of all, the traditional cross -entropy loss is:
Obviously, for each sample, the CE is equal to them, regardless of whether the current sample is simple or complicated. When there are many simple samples, the training of the model will be occupied by these simple samples, making it difficult for the model to learn from complex samples.
So, a simple improvement method is to reduce the learning rate of the model on simple samples, so as to obtain the following weighted cross -entropy loss:
For different samples, we can set different weights to control the degree of learning the model on the sample. But at this time, the choice of weight becomes more difficult.
Because our goal is to alleviate the imbalance of the data set and improve the effect of the F1 evaluation index, we hope that a loss function can directly act on F1.
sørensen -DICE coefficient (DSC)
Fortunately, we can use an existing method -the Sørensen -DICE coefficient (referred to as DSC) -to measure F1. DSC is an indicator for measuring similarity between the two sets:
If we order A to be a collection of samples predicted by all models, so B is a collection of all samples that are actually actual, then DSC can be rewritten:
Among them, TP is True Positive, FN is False Negative, FP is False Negative, D is a data set, F is a classification model. Therefore, in this sense, DSC is equivalent to F1.
In this case, we want to directly optimize DSC, but the above expression is discrete. To this end, we need to transform the above DSC expression into continuous versions as a Soft F1.
For a single sample x, we directly define its DSC:
Note that this is consistent with the definition of DSC at the beginning. It can be seen that if X is a negative class, then its DSC is 0, so it will not contribute to training. In order to make the negative class contribute, we add a flat item:
But in this way, we need to manually adjust the smooth items according to different data sets. Moreover, when the Easy-Negative samples have a lot of samples, even if the above-mentioned flat items are used, the entire model training process will still be dominated by them. Based on this, we use a “self -adjustment” DSC:
Compare the above two DSCs, you can find it,Actually acting as the zoom coefficient, for simple samples (tend to 1 or 0),Make the model pay less attention to them.
From the perspective of the guidance, once the model is classified correctly, the current sample (just passed 0.5), DSC will make the model pay less attention to it, instead of encouraging the model to approach the two endpoints of 0 or 1, which can be encouraged. Effectively avoid the dominance of simple samples due to excessive simple sample training.
In fact, this is similar to Focal Loss (FL), that is, reduce the learning weight of the sample that has been divided into good categories:
However, even though FL can reduce learning weights for simple samples, it is essentially encouraging simple samples to tend to 0 or 1, which is fundamentally different from DSC.
Therefore, we say that DSC learns the learning process of simple samples and difficulty samples through the “balance”, thereby increasing the final F1 value (because F1 requires all types to have better results).
DICE LOSS (DL) and Tversky Loss (TL)
In addition to the above DSC, we also compare two types
The variant ofis the DICE LOSS (DL) and Tversky Loss (TL) of the following:
Special, in TL, if, it degenerates to DSC.
Summary of loss
Finally, let’s summarize each loss:
We collectively refer to the latter three losses as DICE LOSS.
We first experimented on the word marking mission. Data sets include Chinese CTB5/6, UD1.4, and English WSJ, Tweets. The baseline models include Joint-POS, Lattice-LSTM and Bert. The following tables are experimental results in Chinese and English:
can be seen that DSC can get the best effect on each dataset, and the improvement of other methods is inconsistent.
Naming entity recognition
Below we experimented on the naming entity recognition task. Data sets include Chinese ontonotes4, MSRA, and English Cononotes5. The baseline models include ELMO, CVT, Bert-Tagger, and Bert-MRC. The following table is the results of the experiment:
Like words, DSC can maintain consistent effects.
Below we experimented on the Q & A tasks on Squad1/2 and QUOREF. The baseline models include QANET, Bert, and Xlnet. The following table is the experimental result:
Whether it is bert or for Xlnet, DSC has significantly improved.
paragraph recognition is a classification task, and you need to determine whether the two given semantics are the same. Compared with the labeling task, the imbalance of the task is much lighter. The following table is the results of the experiment:
Although the effect is improved, there is no sequence labeling task, but there is still an improvement of near a point.
The influence of the degree of imbalance
Since the proposal of DICE LOSS is to alleviate the problem of data distribution imbalance, we naturally want to ask how the degree of imbalances affects the improvement of the effect. We use paragraph recognition QQP datasets for experiments. QQP original data contains 37%of positive classes and 63%negative classes. We use the following methods to change the data distribution:
+positive: Increase the number of positive classes by using the same meaning replacement to make the data distribution balance (50:50).
+negative: Increase the number of negative classes in the method of replacement of the same meaning, making the data distribution more unbalanced (21:79).
-negative: Randomly delete the negative class to balance the data distribution (50:50).
+positive&+negative: At the same time, the positive and negative classes are added to balance the data distribution (50:50).
The above methods of
have finally received data sets of the same size. The following table is the results of the experiment:
First of all, the balance of data has a very large impact on the final result. Even the baseline model bert. Generally speaking, the more unbalanced the data, the worse the final result. Of course, this is also affected by the overall data volume.
and for balanced data sets (+POSITIVE,+POSITIVE &+Negative), DSC’s improvement is slightly smaller than unbalanced data sets (Original,+Negative). The amount is related.
The impact on the task indicating the accuracy rate
Through the above experiments, we know that DICE LOSS helps improve the performance of the F1 value, so what about the task of accuracy as the indicator? We experimented on SST2 and SST5. The following table is the experimental result:
It can be seen that using DICE LOSS actually reduces accuracy. This is because DICE LOSS is actually considered “balance” in class, rather than considering all data in general.
This article uses the existing DICE LOSS, and proposes a new type of adaptive loss DSC for NLP tasks for various data distribution to alleviate the cross -entropy during training and the loss of F1 during the test F1. question.
experiment shows that using the loss can significantly increase the F1 value of the labeling task and classification task, and also shows that the improvement of the F1 effect is closely related to the degree of data imbalance and the amount of data.
Now, in“Zhihu”can I find us too
Enter Zhihu homepage search「PaperWeekly」
Click“Follow”Subscribe to our column
Paperweekly is an academic platform that recommends, interprets, discuss, and reports the results of artificial intelligence cutting -edge papers. If you study or engage in the AI field, please click on the background of the public account“Communication Group”, the little assistant will bring you into the communication group of Paperweekly.