Natural Language Processing, Software Engineering

Simplifying Korean Language Preprocessing

January 27, 2024

Are you struggling with Korean language preprocessing in your software projects? Look no further! With hangul-utils, an integrated library for Korean language processing, you can easily perform text normalization, tokenization, and character manipulation tasks. In this article, we will explore the capabilities of hangul-utils and how it can simplify your Korean language processing workflow.

Text Normalization

Text normalization is crucial for reducing noise in texts collected online or transcribed from spoken language. The hangul-utils library uses the popular open-korean-text library for text normalization. It focuses on fixing typing errors and deals less with linguistic errors. The normalization process includes procedures such as removing repeating jamos, shortening jamos and characters that repeat excessively, normalizing frequently shortened words, normalizing tense jamos, and space normalization. However, note that hangul-utils does not handle errors or variations beyond these cases.

To use text normalization in hangul-utils, you can either use the Preprocessor method or the separate normalize function. Here’s an example:

#python
from hangul_utils import Preprocessor, normalize

p = Preprocessor()
p.normalize("부들부들부들부들 내가 작간데 화가낰ㅋㅋㅋㅋ")
# Output: "부들부들 내가 작가인데 화가나ㅋㅋㅋ"

normalize("부들부들부들부들 내가 작간데 화가낰ㅋㅋㅋㅋ")
# Output: "부들부들 내가 작가인데 화가나ㅋㅋㅋ"

Tokenization

hangul-utils provides sentence and word tokenization methods using the mecab-ko library as the backend. Compared to other taggers, such as Twitter’s Korean text library, mecab-ko offers more robustness and faster performance. It correctly tokenizes sentences and handles part-of-speech analysis with better accuracy.

To perform tokenization using hangul-utils, you can use the following methods:

sent_tokenize: Sentence tokenization
word_tokenize: Word tokenization
morph_tokenize: Morpheme tokenization
sent_word_tokenize: Simultaneous sentence and word tokenization
sent_morph_tokenize: Simultaneous sentence and morpheme tokenization

Here are a few examples:

#python
from hangul_utils import sent_tokenize, word_tokenize, morph_tokenize, sent_word_tokenize, sent_morph_tokenize

sent_tokenize("그러나 베네수엘라는 독일 보다 한 단계 위였다. 현 시점에서 눈에 띄는 선수가 몇몇 있다.")
# Output: ['그러나 베네수엘라는 독일 보다 한 단계 위였다.', '현 시점에서 눈에 띄는 선수가 몇몇 있다.']

word_tokenize("그러나 베네수엘라는 독일 보다 한 단계 위였다.")
# Output: ['그러나', '베네수엘라는', '독일', '보다', '한', '단계', '위였다', '.']

morph_tokenize("그러나 베네수엘라는 독일 보다 한 단계 위였다.")
# Output: ['그러나', '베네수엘라', '는', '독일', '보다', '한', '단계', '위', '였', '다', '.']

sent_word_tokenize("그러나 베네수엘라는 독일 보다 한 단계 위였다. 현 시점에서 눈에 띄는 선수가 몇몇 있다.")
# Output: [['그러나', '베네수엘라는', '독일', '보다', '한', '단계', '위였다', '.'], ['현', '시점에서', '눈에', '띄는', '선수가', '몇몇', '있다', '.']]

sent_morph_tokenize("그러나 베네수엘라는 독일 보다 한 단계 위였다.")
# Output: [['그러나', '베네수엘라', '는', '독일', '보다', '한', '단계', '위', '였', '다', '.'], ['현', '시점', '에서', '눈', '에', '띄', '는', '선수', '가', '몇몇', '있', '다', '.']]

Manipulating Korean Characters

hangul-utils also provides functions for manipulating Korean characters. The Korean script, Hangul, is composed of basic letters called “jamo.” The split_syllables function converts a string of syllables into a string of jamos, while the join_jamos function does the opposite.

Here’s an example of using these functions:

#python
from hangul_utils import split_syllable_char, split_syllables, join_jamos

split_syllable_char(u"안")
# Output: ('ㅇ', 'ㅏ', 'ㄴ')

split_syllables(u"안녕하세요")
# Output: ㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛ

sentence = u"앞 집 팥죽은 붉은 팥 풋팥죽이고, 뒷집 콩죽은 햇콩 단콩 콩죽.우리 집 깨죽은 검은 깨 깨죽인데 사람들은 햇콩 단콩 콩죽 깨죽 죽먹기를 싫어하더라."
s = split_syllables(sentence)
print(s)
# Output: ㅇㅏㅍ ㅈㅣㅂ ㅍㅏㅌㅈㅜㄱㅇㅡㄴ ㅂㅜㄺㅇㅡㄴ ㅍㅏㅌ ㅍㅜㅅㅍㅏㅌㅈㅜㄱㅇㅣㄱㅗ, ㄷㅟㅅㅈㅣㅂ ㅋㅗㅇㅈㅜㄱㅇㅡㄴ ㅎㅐㅅㅋㅗㅇ ㄷㅏㄴㅋㅗㅇ ㅋㅗㅇㅈㅜㄱ.ㅇㅜㄹㅣ ㅈㅣㅂ ㄲㅐㅈㅜㄱㅇㅡㄴ ㄱㅓㅁㅇㅡㄴ ㄲㅐ ㄲㅐㅈㅜㄱㅇㅣㄴㄷㅔ ㅅㅏㄹㅏㅁㄷㅡㄹㅇㅡㄴ ㅎㅐㅅㅋㅗㅇ ㄷㅏㄴㅋㅗㅇ ㅋㅗㅇㅈㅜㄱ ㄲㅐㅈㅜㄱ ㅈㅜㄱㅁㅓㄱㄱㅣㄹㅡㄹ ㅅㅣㅀㅇㅓㅎㅏㄷㅓㄹㅏ.

sentence2 = join_jamos(s)
print(sentence2)
# Output: 앞 집 팥죽은 붉은 팥 풋팥죽이고, 뒷집 콩죽은 햇콩 단콩 콩죽.우리 집 깨죽은 검은 깨 깨죽인데 사람들은 햇콩 단콩 콩죽 깨죽 죽먹기를 싫어하더라.

print(sentence == sentence2)
# Output: True

With hangul-utils, you have a comprehensive set of tools for Korean language preprocessing. Whether you need to normalize texts, tokenize sentences and words, or manipulate Korean characters, this library has you covered. Give it a try in your next Korean language project and experience the simplicity and power of hangul-utils!

Remember, if you have any questions or need assistance, don’t hesitate to ask. Happy coding!

References

hangul-utils GitHub repository: kaniblu/hangul-utils
Open Korean Text: twitter/twitter-korean-text
Mecab-ko: eunjeon/mecab-ko

Note: This article is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

Group Sum

Simplifying Korean Language Preprocessing

Text Normalization

Tokenization

Manipulating Korean Characters

References

Leave a Reply Cancel reply