Natural Language Processing, Software Development

Exploring Word Frequencies in Multiple Languages with wordfreq

January 18, 2024

If you’ve ever wondered how frequently a word is used in different languages, wordfreq can provide you with the answers. Developed by Robyn Speer, wordfreq is a Python library that allows you to look up the frequencies of words in over 40 languages, based on various sources of data. In this article, we’ll explore how you can use wordfreq to estimate the frequency with which a word is used in different languages.

Installation and Usage

To get started with wordfreq, make sure you have Python 3 installed and then install the library and its dependencies using pip:

#
pip3 install wordfreq

Alternatively, you can clone the repository and install the library for development using poetry:

#
poetry install

Once you have wordfreq installed, you can begin using its functions to look up word frequencies. The most straightforward function to use is word_frequency. Here’s an example:

#python
from wordfreq import word_frequency

frequency = word_frequency('hello', 'en')
print(frequency)  # Output: 0.004607361951540951

You simply pass in the word and the language code, and the function returns the frequency of the word in the given language as a decimal between 0 and 1.

If you’re interested in a logarithmic scale of word frequency, you can use the zipf_frequency function:

#python
from wordfreq import zipf_frequency

zipf = zipf_frequency('hello', 'en')
print(zipf)  # Output: 6.48

This function returns the word frequency on a logarithmic scale.

Supported Languages and Data Sources

wordfreq supports over 40 languages, including Arabic, Chinese, English, French, German, Japanese, Korean, Spanish, and many more. It uses multiple data sources, such as Wikipedia, subtitles, news articles, books, web text, Twitter, and Reddit, to provide accurate word frequencies.

For each language, wordfreq provides both ‘small’ and ‘large’ wordlists based on the frequency of words. The ‘small’ lists cover words that appear at least once per million words, while the ‘large’ lists cover words that appear at least once per 100 million words. The default list is ‘best’, which uses ‘large’ if available for the language and ‘small’ otherwise.

Tokenization and Handling Numbers

wordfreq uses advanced tokenization techniques based on the regex library to split text into tokens that can be counted consistently. It also provides support for handling numbers, using aggregated frequency bins for numbers of the same “shape”. This approach allows wordfreq to estimate the frequency of tokens containing multiple digits.

Deployment Architecture and Licensing

wordfreq is freely redistributable under the Apache license and includes data files that may be redistributed under a Creative Commons Attribution-ShareAlike 4.0 license. The library contains data extracted from Google Books Ngrams, Google Books Syntactic Ngrams, the Leeds Internet Corpus, Wikipedia, ParaCrawl, OpenSubtitles, and various SUBTLEX word lists.

For a complete list of citations and licensing information, please refer to the README file in the wordfreq repository.

Conclusion

wordfreq is a powerful Python library that enables you to explore word frequencies in multiple languages. By providing access to estimates of word frequencies based on diverse sources of data, wordfreq allows you to gain insights into the usage of words across different languages. Whether you’re conducting linguistic research or building NLP applications, wordfreq can be a valuable tool in your toolkit.

Have you ever used wordfreq in your projects? What interesting insights have you discovered? Share your thoughts and experiences in the comments below!

References:
– wordfreq Repository

Author:
Blake Bradford

Group Sum