In the world of text classification, being able to generalize across various domains and aspects without additional training is a game-changer. This is exactly what the Label Agnostic Pre-training for Zero-shot Text Classification approach aims to achieve. Developed by Christopher Clarke, Yuzhao Heng, Yiping Kang, Krisztian Flautner, Lingjia Tang, and Jason Mars, this novel approach introduces two new training strategies: Implicit training and Explicit pre-training. With these strategies, language models gain aspect-level understanding during training, enabling them to handle unseen data more effectively.
One key component of this approach is the Universal Text Classification Dataset (UTCD). Consisting of 18 classification datasets, UTCD covers three categories: Sentiment, Intent/Dialogue, and Topic classification. The datasets are diverse, spanning domains such as Banking, Finance, Legal, and more, and contain a variety of sequence lengths. These datasets are labeled in natural language, enabling the development of techniques that leverage the descriptive class names for zero-shot support.
To make it easy for developers to get started, the UTCD dataset and trained models are available on HuggingFace. Simply follow the instructions provided there to access and use the dataset and models.
If you prefer to set up the environment locally, here are the steps:
Setup Environment
- OS: UNIX
- Python version: 3.8.10
- CUDA version: 11.6
-
Create a conda environment:
#bash conda create -n zs-cls python=3.8.10 pip
-
Move to the project root directory and install the required Python packages:
#bash pip3 install -r requirements.txt
-
Add the current directory to the Python path to ensure our local package is found:
#bash export PYTHONPATH=$PYTHONPATH:`pwd`
BERT Sequence Classifier
The BERT Sequence Classifier is a powerful model for zero-shot text classification. Here are the steps to train and evaluate the model:
Training
-
Train solely on in-domain dataset
go_emotion
:#bash python zeroshot_classifier/models/bert.py train --domain in --dataset go_emotion
-
Train solely on out-of-domain dataset
consumer_finance
:#bash python zeroshot_classifier/models/bert.py train --domain out --dataset consumer_finance
-
Train on all in-domain datasets:
#bash python zeroshot_classifier/models/bert.py train --domain in --dataset all
Evaluation
- Evaluate a local model on out-of-domain dataset
multi_eurlex
:#bash python zeroshot_classifier/models/bert.py test --domain out --dataset multi_eurlex --model_name_or_path models/2022-06-15_21-23-57_BERT-Seq-CLS-out-multi_eurlex/trained
Binary & Dual Encoding Zero-shot Classification
Another approach in zero-shot classification is the Binary & Dual Encoding technique. Here’s how you can train and evaluate models using this strategy:
Training
-
Vanilla training on Binary BERT:
#bash python zeroshot_classifier/models/binary_bert.py train --mode vanilla --batch_size 32 --epochs 8 --learning_rate 2e-5 --output_dir '{a=2e-5}'
-
Explicit training on Bi-Encoder:
#bash python zeroshot_classifier/models/bi-encoder.py train --mode explicit --model_init '2022-11-21_18-58-54_Aspect-Pretrain-Binary-BERT_{md=exp, na=T}_{a=3e-05}/trained'
Evaluation
- Evaluate an implicitly-trained model on all in-domain datasets:
#bash python zeroshot_classifier/models/binary_bert.py test --mode implicit-on-text-encode-sep --domain in --model_dir_nm 2022-10-12_01-21-08_Binary-BERT-implicit-on-text-encode-sep-rand-aspect-norm
Generative Classification
The Generative Classification approach introduces generative models like GPT2. Here’s how you can train and evaluate models using this strategy:
Training
-
Implicit training on GPT with DDP (Distributed Data Parallel):
#bash torchrun --nproc_per_node=4 zeroshot_classifier/models/gpt2.py train --mode implicit
-
Explicit training on GPT:
#bash python zeroshot_classifier/models/gpt2.py train --mode explicit --model_init '2022-11-27_17-39-06_Aspect-Pretrain-NVIDIA-GPT2_{md=exp, na=T}_{a=2e-05}'
Evaluation
- Evaluate a model with vanilla training on all out-of-domain datasets:
#bash python zeroshot_classifier/models/gpt2.py test --mode implicit --model_dir_nm '2022-11-29_19-37-13_NVIDIA-GPT2_{md=van, na=T}_{a=3e-05}'
By leveraging the Label Agnostic Pre-training for Zero-shot Text Classification approach and the UTCD dataset, you can train and evaluate powerful models capable of zero-shot text classification. Whether you choose the BERT Sequence Classifier, Binary & Dual Encoding, or Generative Classification models, you can unlock the potential of language models to generalize across domains and aspects.
Feel free to explore more details and examples in the repository. If you have any questions or want to share your experiences, please leave a comment below. Happy coding!
References
- UTCD dataset: GitHub Repository
- Trained models: HuggingFace Model Hub
- Clarke, C., et al. (2023). Label Agnostic Pre-training for Zero-shot Text Classification. Findings of the Association for Computational Linguistics: ACL 2023.
Leave a Reply