Contrastive Data and Learning for Natural Language Processing Tutorial by Rui Zhang, Yangfeng Ji, Yue Zhang and Rebecca J. Passonneau | by Kcwayne

Self-Supervised Contrastive Learning: ⟶ No labels are required!

• Positive Sample: Data Augmentation
• Negative Sample: Random (In-batch Negatives)
• Four Challenges of Self-Supervised Contrastive Learning:
1. Non-trivial Data Augmentation
2. Risk of “Sampling Bias” (False Negatives)
3. Hard Negative Mining
4. Large Batch Size

1. Data Augmentation for Text

Text Space:
∘ Lexical Editing (token-level)
∘ Back-Translation (setence-level)
Embedding Space:
∘ Dropout
∘ Cutoff
∘ Mixup
Manual

Synonym Replacement
Random Insertion
Random Swap
Random Deletion

Word Replacement with c-BERT (Wu et al.,2018)
Incorporate contextual information to do word replacement.
Contextual BERT, or c-BERT for simplification, can perform contextual augmentation. Giving contextual information (label embeddings), we can assure c-BERT provide the replacement token without changing the sentiment of the label.

Create paraphrases of a sentence using back-translation.
• Positive: translate sentences from the same sentence.
• Negative: translate sentences from different sentences.

Dropout is a technique used to prevent overfitting. Here, we perform dropout in the Embedding Space to create contrastive examples.
• Positive: Two different dropout masks create two different embeddings for the same sentence.
• Negative: in-batch negatives.

A structured version of dropout.
Denote 𝐿 as the number of tokens, and 𝑑 is the dimension of word embeddings.
Now, we can do:
(a.) Token cutoff: remove some rows corresponding to some tokens.
(b.) Feature cutoff: remove some columns corresponding to some features.
(c.) Span cutoff: remove the entire text space.

fig.13 Cutoff (Blue parts are “cutoff” to be zero)

Linear interpolation over a pair of samples to create new sample.

• NL-Augmenter: Manual Data Augmentation (Dhole et al.,2021)

Crowdsource Wisdom-of-Researchers.

We don’t know the label, we may accidentally create false negative by sampling examples from the same class.

• Debiased Contrastive Learning

Assume a prior probability between positive (𝜏⁺)and negative (𝜏⁻), then approximate the distribution of negative examples to debias the loss.
We know 𝑝(𝑥′) and 𝑝ₓ⁺(𝑥′), because we can create them ourselves, but we don’t know 𝑝ₓ⁻(𝑥′). So, we sample 𝑁 samples from 𝑝(𝑥′), which contains both positive and negative; then we get 𝑀 samples from 𝑝ₓ⁺(𝑥′).
After this, we can rearrange the terms (fig.16) and replace 𝑝ₓ⁻(𝑥′) in the contrastive learning objective function (𝑁𝑔() term in fig.17). This way, we can estimate the distribution of 𝑝ₓ⁻(𝑥′).

fig.17 Debiased Contrastive Learning Objective Function

Some negative samples should have different labels from the anchor points, but their embeddings are too close, which results in hard to optimize.

Based on the foundation of traditional contrastive learning, weights are applied by comparing the magnitudes of dot products between anchor points and their negative pairs.
This weighting scheme assigns greater weight to negative pairs with high similarity to the positive pair, while assigning smaller weights to pairs with low similarity.
This approach guides the encoder to optimize more accurately.

fig.18 Different levels of differentiation difficulties

• Hard Positive Mining by Adversarial Examples (Kim et al.,2020)

Create adversarial examples that are positive but confuses the model, in order to increase robustness.

4. Large Batch Size

Larger batch size can cover more diverse negative samples to provide meaningful learning signals, but needs large amount of computational resource.

• Memory Bank to Reduce Computation (Wu et al.,2018)

Memory Bank: Compute and store the representations in advance, instead of computing embeddings for all examples in a batch.
⟶ Much like Dynamic Programming!

• Momentum Contrast (MoCo) (He et al.,2020)

Contrastive learning is essentially learning to match a query to a key.

Traditional contrastive learning framework:
Use and encoder for query & a decoder for key.
Do back propagation through both sides.
The number of negative samples is actually restricted to the size of the mini batch. ⟶ Need very large batch size.
Momentum Encoder:
Use momentum encoder for key, which can maintain a queue of keys. The key encoder is updated using momentum.
This way, we can scale the number of negative samples. We use momentum to keep track of a queue of keys to incorporate many keys at the same time.
Using this method, we can make sure our contrastive learning training is more stable, because we are maintaining a large and consistent dictionary.

fig.21 Pipeline of traditional/memory bank/MoCo

Positive samples: Same class
Negative samples: Different class
Pros:
∘ No need for Data Augmentations
∘ No risk of “Sampling bias”
∘ No need for Large Batch Size
Cons:
∘ Need Labels