Simplifying Chinese Word Matching

Blake Bradford Avatar

·

FuzzyChinese: Simplifying Chinese Word Matching

Are you tired of struggling with accurately matching Chinese words, particularly proper names and addresses? Look no further than FuzzyChinese, a powerful tool designed to simplify the process of fuzzy matching Chinese words. In this article, we will explore the key features and capabilities of FuzzyChinese, providing a comprehensive guide for stakeholders to leverage this tool effectively.

FuzzyChinese is an easy-to-use tool that can be installed with a simple pip command:

pip install fuzzychinese

Once installed, you can train the model using a dictionary of words you want to match. By calling the FuzzyChineseMatch.transform(raw_words, n) function, you can quickly find the top n most similar words in the target dictionary for your input words. FuzzyChinese offers three different analysis options during training: stroke, radical, and character. You can fine-tune the model’s performance by adjusting the ngram_range parameter.

To provide you with some hands-on experience, let’s take a look at an example:

“`python
from fuzzychinese import FuzzyChineseMatch
import pandas as pd

test_dict = pd.Series([‘长白朝鲜族自治县’, ‘长阳土家族自治县’, ‘城步苗族自治县’, ‘达尔罕茂明安联合旗’, ‘汨罗市’])
raw_word = pd.Series([‘达茂联合旗’, ‘长阳县’, ‘汩罗市’])
assert(‘汩罗市’ != ‘汨罗市’) # They are not the same!

fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer=’stroke’)
fcm.fit(test_dict)
top2_similar = fcm.transform(raw_word, n=2)
res = pd.concat([
raw_word,
pd.DataFrame(top2_similar, columns=[‘top1’, ‘top2’]),
pd.DataFrame(fcm.get_similarity_score(), columns=[‘top1_score’, ‘top2_score’]),
pd.DataFrame(fcm.get_index(), columns=[‘top1_index’, ‘top2_index’])
], axis=1)

print(res)
“`

This example demonstrates how FuzzyChinese can be used to find the most similar words in the dictionary for the provided input words. The output includes the similarity scores, matched words, and their corresponding indices.

Additionally, FuzzyChinese offers other useful functionalities. You can directly use the Stroke and Radical classes to decompose Chinese characters into strokes or radicals. Here’s an example:

“`python
from fuzzychinese import Stroke, Radical

stroke = Stroke()
radical = Radical()
print(“像”, stroke.get_stroke(“像”))
print(“像”, radical.get_radical(“像”))
“`

Furthermore, FuzzyChinese provides FuzzyChineseMatch.compare_two_columns(X, Y) to compare pairs of words in each row and obtain similarity scores.

To explore more about FuzzyChinese and its advanced features, refer to the official documentation.

We would like to express our gratitude to the contributors of the Chinese radical data from 開放詞典網.

In conclusion, FuzzyChinese is a valuable tool for fuzzy matching Chinese words. By following the guidelines and examples provided in this article, stakeholders can effectively utilize FuzzyChinese for various applications, including proper name matching and address matching. Improve your word matching accuracy and efficiency by incorporating FuzzyChinese into your workflow today!

References:
FuzzyChinese GitHub Repository
Chinese Radical Data Repository
開放詞典網

Leave a Reply

Your email address will not be published. Required fields are marked *