Are you struggling with Korean language preprocessing in your software projects? Look no further! With hangul-utils, an integrated library for Korean language processing, you can easily perform text normalization, tokenization, and character manipulation tasks. In this article, we will explore the capabilities of hangul-utils and how it can simplify your Korean language processing workflow.
Text Normalization
Text normalization is crucial for reducing noise in texts collected online or transcribed from spoken language. The hangul-utils library uses the popular open-korean-text
library for text normalization. It focuses on fixing typing errors and deals less with linguistic errors. The normalization process includes procedures such as removing repeating jamos, shortening jamos and characters that repeat excessively, normalizing frequently shortened words, normalizing tense jamos, and space normalization. However, note that hangul-utils does not handle errors or variations beyond these cases.
To use text normalization in hangul-utils, you can either use the Preprocessor
method or the separate normalize
function. Here’s an example:
#python
from hangul_utils import Preprocessor, normalize
p = Preprocessor()
p.normalize("부들부들부들부들 내가 작간데 화가낰ㅋㅋㅋㅋ")
# Output: "부들부들 내가 작가인데 화가나ㅋㅋㅋ"
normalize("부들부들부들부들 내가 작간데 화가낰ㅋㅋㅋㅋ")
# Output: "부들부들 내가 작가인데 화가나ㅋㅋㅋ"
Tokenization
hangul-utils provides sentence and word tokenization methods using the mecab-ko
library as the backend. Compared to other taggers, such as Twitter’s Korean text library, mecab-ko
offers more robustness and faster performance. It correctly tokenizes sentences and handles part-of-speech analysis with better accuracy.
To perform tokenization using hangul-utils, you can use the following methods:
-
sent_tokenize
: Sentence tokenization -
word_tokenize
: Word tokenization -
morph_tokenize
: Morpheme tokenization -
sent_word_tokenize
: Simultaneous sentence and word tokenization -
sent_morph_tokenize
: Simultaneous sentence and morpheme tokenization
Here are a few examples:
#python
from hangul_utils import sent_tokenize, word_tokenize, morph_tokenize, sent_word_tokenize, sent_morph_tokenize
sent_tokenize("그러나 베네수엘라는 독일 보다 한 단계 위였다. 현 시점에서 눈에 띄는 선수가 몇몇 있다.")
# Output: ['그러나 베네수엘라는 독일 보다 한 단계 위였다.', '현 시점에서 눈에 띄는 선수가 몇몇 있다.']
word_tokenize("그러나 베네수엘라는 독일 보다 한 단계 위였다.")
# Output: ['그러나', '베네수엘라는', '독일', '보다', '한', '단계', '위였다', '.']
morph_tokenize("그러나 베네수엘라는 독일 보다 한 단계 위였다.")
# Output: ['그러나', '베네수엘라', '는', '독일', '보다', '한', '단계', '위', '였', '다', '.']
sent_word_tokenize("그러나 베네수엘라는 독일 보다 한 단계 위였다. 현 시점에서 눈에 띄는 선수가 몇몇 있다.")
# Output: [['그러나', '베네수엘라는', '독일', '보다', '한', '단계', '위였다', '.'], ['현', '시점에서', '눈에', '띄는', '선수가', '몇몇', '있다', '.']]
sent_morph_tokenize("그러나 베네수엘라는 독일 보다 한 단계 위였다.")
# Output: [['그러나', '베네수엘라', '는', '독일', '보다', '한', '단계', '위', '였', '다', '.'], ['현', '시점', '에서', '눈', '에', '띄', '는', '선수', '가', '몇몇', '있', '다', '.']]
Manipulating Korean Characters
hangul-utils also provides functions for manipulating Korean characters. The Korean script, Hangul, is composed of basic letters called “jamo.” The split_syllables
function converts a string of syllables into a string of jamos, while the join_jamos
function does the opposite.
Here’s an example of using these functions:
#python
from hangul_utils import split_syllable_char, split_syllables, join_jamos
split_syllable_char(u"안")
# Output: ('ㅇ', 'ㅏ', 'ㄴ')
split_syllables(u"안녕하세요")
# Output: ㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛ
sentence = u"앞 집 팥죽은 붉은 팥 풋팥죽이고, 뒷집 콩죽은 햇콩 단콩 콩죽.우리 집 깨죽은 검은 깨 깨죽인데 사람들은 햇콩 단콩 콩죽 깨죽 죽먹기를 싫어하더라."
s = split_syllables(sentence)
print(s)
# Output: ㅇㅏㅍ ㅈㅣㅂ ㅍㅏㅌㅈㅜㄱㅇㅡㄴ ㅂㅜㄺㅇㅡㄴ ㅍㅏㅌ ㅍㅜㅅㅍㅏㅌㅈㅜㄱㅇㅣㄱㅗ, ㄷㅟㅅㅈㅣㅂ ㅋㅗㅇㅈㅜㄱㅇㅡㄴ ㅎㅐㅅㅋㅗㅇ ㄷㅏㄴㅋㅗㅇ ㅋㅗㅇㅈㅜㄱ.ㅇㅜㄹㅣ ㅈㅣㅂ ㄲㅐㅈㅜㄱㅇㅡㄴ ㄱㅓㅁㅇㅡㄴ ㄲㅐ ㄲㅐㅈㅜㄱㅇㅣㄴㄷㅔ ㅅㅏㄹㅏㅁㄷㅡㄹㅇㅡㄴ ㅎㅐㅅㅋㅗㅇ ㄷㅏㄴㅋㅗㅇ ㅋㅗㅇㅈㅜㄱ ㄲㅐㅈㅜㄱ ㅈㅜㄱㅁㅓㄱㄱㅣㄹㅡㄹ ㅅㅣㅀㅇㅓㅎㅏㄷㅓㄹㅏ.
sentence2 = join_jamos(s)
print(sentence2)
# Output: 앞 집 팥죽은 붉은 팥 풋팥죽이고, 뒷집 콩죽은 햇콩 단콩 콩죽.우리 집 깨죽은 검은 깨 깨죽인데 사람들은 햇콩 단콩 콩죽 깨죽 죽먹기를 싫어하더라.
print(sentence == sentence2)
# Output: True
With hangul-utils, you have a comprehensive set of tools for Korean language preprocessing. Whether you need to normalize texts, tokenize sentences and words, or manipulate Korean characters, this library has you covered. Give it a try in your next Korean language project and experience the simplicity and power of hangul-utils!
Remember, if you have any questions or need assistance, don’t hesitate to ask. Happy coding!
References
- hangul-utils GitHub repository: kaniblu/hangul-utils
- Open Korean Text: twitter/twitter-korean-text
- Mecab-ko: eunjeon/mecab-ko
Note: This article is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
Leave a Reply