Machine Learning, Natural Language Processing

Empowering Language Models to Generalize Across Domains and Aspects

February 4, 2024

In the world of text classification, being able to generalize across various domains and aspects without additional training is a game-changer. This is exactly what the Label Agnostic Pre-training for Zero-shot Text Classification approach aims to achieve. Developed by Christopher Clarke, Yuzhao Heng, Yiping Kang, Krisztian Flautner, Lingjia Tang, and Jason Mars, this novel approach introduces two new training strategies: Implicit training and Explicit pre-training. With these strategies, language models gain aspect-level understanding during training, enabling them to handle unseen data more effectively.

One key component of this approach is the Universal Text Classification Dataset (UTCD). Consisting of 18 classification datasets, UTCD covers three categories: Sentiment, Intent/Dialogue, and Topic classification. The datasets are diverse, spanning domains such as Banking, Finance, Legal, and more, and contain a variety of sequence lengths. These datasets are labeled in natural language, enabling the development of techniques that leverage the descriptive class names for zero-shot support.

To make it easy for developers to get started, the UTCD dataset and trained models are available on HuggingFace. Simply follow the instructions provided there to access and use the dataset and models.

If you prefer to set up the environment locally, here are the steps:

Setup Environment

OS: UNIX
Python version: 3.8.10
CUDA version: 11.6

Create a conda environment:

#bash
conda create -n zs-cls python=3.8.10 pip

Move to the project root directory and install the required Python packages:
```
#bash
pip3 install -r requirements.txt
```
Add the current directory to the Python path to ensure our local package is found:
```
#bash
export PYTHONPATH=$PYTHONPATH:`pwd`
```

BERT Sequence Classifier

The BERT Sequence Classifier is a powerful model for zero-shot text classification. Here are the steps to train and evaluate the model:

Training

Train solely on in-domain dataset go_emotion:

#bash
python zeroshot_classifier/models/bert.py train --domain in --dataset go_emotion

Train solely on out-of-domain dataset consumer_finance:

#bash
python zeroshot_classifier/models/bert.py train --domain out --dataset consumer_finance

Train on all in-domain datasets:

#bash
python zeroshot_classifier/models/bert.py train --domain in --dataset all

Evaluation

Evaluate a local model on out-of-domain dataset multi_eurlex:

#bash
python zeroshot_classifier/models/bert.py test --domain out --dataset multi_eurlex --model_name_or_path models/2022-06-15_21-23-57_BERT-Seq-CLS-out-multi_eurlex/trained

Binary & Dual Encoding Zero-shot Classification

Another approach in zero-shot classification is the Binary & Dual Encoding technique. Here’s how you can train and evaluate models using this strategy:

Training

Vanilla training on Binary BERT:

#bash
python zeroshot_classifier/models/binary_bert.py train --mode vanilla --batch_size 32 --epochs 8 --learning_rate 2e-5 --output_dir '{a=2e-5}'

Explicit training on Bi-Encoder:

#bash
python zeroshot_classifier/models/bi-encoder.py train --mode explicit --model_init '2022-11-21_18-58-54_Aspect-Pretrain-Binary-BERT_{md=exp, na=T}_{a=3e-05}/trained'

Evaluation

Evaluate an implicitly-trained model on all in-domain datasets:

#bash
python zeroshot_classifier/models/binary_bert.py test --mode implicit-on-text-encode-sep --domain in --model_dir_nm 2022-10-12_01-21-08_Binary-BERT-implicit-on-text-encode-sep-rand-aspect-norm

Generative Classification

The Generative Classification approach introduces generative models like GPT2. Here’s how you can train and evaluate models using this strategy:

Training

Implicit training on GPT with DDP (Distributed Data Parallel):

#bash
torchrun --nproc_per_node=4 zeroshot_classifier/models/gpt2.py train --mode implicit

Explicit training on GPT:

#bash
python zeroshot_classifier/models/gpt2.py train --mode explicit --model_init '2022-11-27_17-39-06_Aspect-Pretrain-NVIDIA-GPT2_{md=exp, na=T}_{a=2e-05}'

Evaluation

Evaluate a model with vanilla training on all out-of-domain datasets:

#bash
python zeroshot_classifier/models/gpt2.py test --mode implicit --model_dir_nm '2022-11-29_19-37-13_NVIDIA-GPT2_{md=van, na=T}_{a=3e-05}'

By leveraging the Label Agnostic Pre-training for Zero-shot Text Classification approach and the UTCD dataset, you can train and evaluate powerful models capable of zero-shot text classification. Whether you choose the BERT Sequence Classifier, Binary & Dual Encoding, or Generative Classification models, you can unlock the potential of language models to generalize across domains and aspects.

Feel free to explore more details and examples in the repository. If you have any questions or want to share your experiences, please leave a comment below. Happy coding!

References

UTCD dataset: GitHub Repository
Trained models: HuggingFace Model Hub
Clarke, C., et al. (2023). Label Agnostic Pre-training for Zero-shot Text Classification. Findings of the Association for Computational Linguistics: ACL 2023.

Group Sum

Empowering Language Models to Generalize Across Domains and Aspects

Setup Environment

BERT Sequence Classifier

Binary & Dual Encoding Zero-shot Classification

Generative Classification

References

Leave a Reply Cancel reply