jieba载入list、DataFrame以及Series格式的自定义词典

2018-04-29

文本挖掘

pandas提供的表格数据形式非常方便，现在想将pandas的Series中的数据批量添加到jieba的自定义词典，但是jieba没有提供这样的方法，所以想看看有没有什么好方法来插入。

jieba载入自定义词典的方法

jieba是优秀的中文分词工具，它提供了两个载入自定义词典的方法：

add_word()方法
load_userdict()方法

add_word()

这个方法用于在程序中动态载入分词，一次处理一个词语，源码如下：

def add_word(self, word, freq=None, tag=None):
    """
    Add a word to dictionary.
    freq and tag can be omitted, freq defaults to be a calculated value
    that ensures the word can be cut out.
    """
    self.check_initialized()
    word = strdecode(word)
    freq = int(freq) if freq is not None else self.suggest_freq(word, False)
    self.FREQ[word] = freq
    self.total += freq
    if tag:
	self.user_word_tag_tab[word] = tag
    for ch in xrange(len(word)):
	wfrag = word[:ch + 1]
	if wfrag not in self.FREQ:
	    self.FREQ[wfrag] = 0
    if freq == 0:
	finalseg.add_force_split(word)

可以看到，jieba将词语添加到self.FREQ这个字典里面，达到添加自定义词语的目的。而且通过for ch in xrange(len(word))插入了多个词语，如add_word(“清华大学”)，会插入“清”、“清华”、“清华大”以及“清华大学”

load_userdict()

这个方法从文件中载入词语，然后通过add_word方法载入到自定义词典，源码如下：

def load_userdict(self, f):
    '''
    Load personalized dict to improve detect rate.
    Parameter:
	- f : A plain text file contains words and their ocurrences.
	      Can be a file-like object, or the path of the dictionary file,
	      whose encoding must be utf-8.
    Structure of dict file:
    word1 freq1 word_type1
    word2 freq2 word_type2
    ...
    Word type may be ignored
    '''
    self.check_initialized()
    if isinstance(f, string_types):
	f_name = f
	f = open(f, 'rb')
    else:
	f_name = resolve_filename(f)
    for lineno, ln in enumerate(f, 1):
	line = ln.strip()
	if not isinstance(line, text_type):
	    try:
		line = line.decode('utf-8').lstrip('\ufeff')
	    except UnicodeDecodeError:
		raise ValueError('dictionary file %s must be utf-8' % f_name)
	if not line:
	    continue
	# match won't be None because there's at least one character
	word, freq, tag = re_userdict.match(line).groups()
	if freq is not None:
	    freq = freq.strip()
	if tag is not None:
	    tag = tag.strip()
	self.add_word(word, freq, tag)

可以看到对于文件中的每一行，load_userdict都调用add_word方法载入，其实就是使用了循环将文件中的词语添加到自定义词典罢了。

将list、DataFrame以及Series的数据载入自定义词典

从jieba的load_userdict方法使用的载词方式，可以看出，想要从list、DataFrame以及Series的数据载入自定义词典，只能和load_userdict方法一样，循环调用add_word方法了。笑死。