Tibetan Natural Language Processing: A Quick Guide

December 21, 2023

Tibetan Natural Language Processing (NLP) has traditionally faced numerous challenges due to the complexity and uniqueness of the Tibetan language. However, with the introduction of PYBO, a Python library specifically designed for Tibetan NLP, these challenges are being addressed head-on. In this article, we will delve into the world of Tibetan NLP and explore the capabilities of PYBO.

PYBO: Empowering Tibetan NLP

PYBO is a powerful Python library that provides various functionalities for Tibetan NLP. Its primary feature is tokenization, which is the process of breaking down Tibetan text into individual words. PYBO enables developers to tokenize strings as well as entire directories of text files. This functionality is particularly useful for tasks such as text analysis, machine learning, and language modeling.

Getting Started with PYBO

Getting started with PYBO is straightforward. All you need is Python 3 installed on your system. Once Python 3 is installed, you can use the following command to install PYBO:

bash python3 -m pip install pybo

Tokenizing Tibetan Text

Tokenizing Tibetan text using PYBO is as simple as running a command. For tokenizing a string, you can use the bo tok-string command. For example, to tokenize the following Tibetan text:

༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། ། སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་ སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །"

You can run the following command:

bash bo tok-string "༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། ། སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་ སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །"

The output will be:

༄༅།_། རྒྱ་གར་ སྐད་ དུ །_ བོ་ དྷི་ སཏྭ་ ཙརྻ་ ཨ་བ་ ཏ་ ར །_ བོད་སྐད་ དུ །_ བྱང་ཆུབ་ སེམས་དཔ འི་ སྤྱོད་པ་ ལ་ འཇུག་པ །_། སངས་རྒྱས་ དང་ བྱང་ཆུབ་ སེམས་དཔའ་ ཐམས་ཅད་ ལ་ ཕྱག་ འཚལ་ ལོ །_། བདེ་གཤེགས་ ཆོས་ ཀྱི་ སྐུ་ མངའ་ སྲས་ བཅས་ དང༌ །_། ཕྱག་འོས་ ཀུན་ ལ འང་ གུས་པ ར་ ཕྱག་ འཚལ་ ཏེ །_། བདེ་གཤེགས་ སྲས་ ཀྱི་ སྡོམ་ ལ་ འཇུག་པ་ ནི །_། ལུང་ བཞིན་ མདོར་བསྡུས་ ནས་ ནི་ བརྗོད་པ ར་ བྱ །_།

Sorting Tibetan Words

PYBO also provides the functionality to sort Tibetan words. This can be achieved by using the bo kakha command. For example, to sort a list of Tibetan words stored in a file called to-sort.txt, you can run the following command:

bash bo kakha to-sort.txt

Find and Replace with Regular Expressions

PYBO’s Find and Replace (FNR) feature allows you to perform batch operations on text files using regular expressions. To utilize this feature, you need to provide PYBO with a directory containing the input files, a file containing the regular expressions, and optionally specify an output directory and a tag. The regular expressions should be in the following format:

<find-pattern><tab>-<tab><replace-pattern>

For example:

bash bo fnr <in-dir> <regex-file> -o <out-dir> -t <tag>

Contributing to PYBO

PYBO is an open-source project, and contributions from the community are highly appreciated. To contribute to PYBO, you can start by cloning the repository and setting up a virtual environment. After activating the virtual environment, install the dependencies by running the following commands:

bash pip install -e . pip install -r requirements-dev.txt

Furthermore, the project follows the angular commit message format for commit messages. The project also utilizes the pre-commit tool for code formatting and linting. Ensure that you set up pre-commit Git hook by running:

bash pre-commit install

Conclusion

PYBO is a powerful Python library that opens up a world of possibilities for Tibetan Natural Language Processing. By providing functionalities such as tokenization, sorting, and FNR operations using regular expressions, PYBO significantly simplifies the complexities of working with Tibetan text. Through the contributions and support of organizations like Khyentse Foundation, Barom/Esukhia canon project, and BDRC, PYBO continues to evolve and empower the Tibetan language processing ecosystem. Join the PYBO community, explore its capabilities, and contribute to its future growth.

References

contributor: Drupchen, Élie Roux, Ngawang Trinley, Tenzin, Joyce Mackzenzie

License: Apache 2, Source: GitHub Repository

Group Sum