jieba分词单例模式及linux权限不够情况下tmp_dir自定义

在linux环境下,没有root权限的情况下,有时会碰到如下问题:

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Dumping model to file cache /tmp/jieba.cache
Dump cache file failed.
Traceback (most recent call last):
  File "/home/work/anaconda3/envs/py27/lib/python2.7/site-packages/jieba/__init__.py", line 153, in initialize
    _replace_file(fpath, cache_file)
OSError: [Errno 1] Operation not permitted

这是因为jieba默认情况下在/tmp下存储缓存文件,然而不是root用户,权限不够。解决办法是修改默认缓存文件的目录,把缓存文件放在用户的目录下面。 jieba文档提到了tmp_dir和cache_file可以改,所以我们查看了下源码

/home/work/anaconda3/envs/py27/lib/python2.7/site-packages/jieba/__init__.py,文件52行-66行如下:
class Tokenizer(object):

    def __init__(self, dictionary=DEFAULT_DICT):
        self.lock = threading.RLock()
        if dictionary == DEFAULT_DICT:
            self.dictionary = dictionary
        else:
            self.dictionary = _get_abs_path(dictionary)
        self.FREQ = {}
        self.total = 0
        self.user_word_tag_tab = {}
        self.initialized = False
        self.tmp_dir = None
        # self.tmp_dir = '/'
        self.cache_file = None

修改源码,在64行self.tmp_dir中可以设置自定义缓存路径。 

另外一种方式是在代码中修改,以下是jieba单例模式demo

 1 class Singleton(object):
 2     """
 3     Jieba Utils Class
 4     """
 5     _instance = None
 6 
 7     def __new__(cls, *args, **kwargs):
 8         if not cls._instance:
 9             cls._instance = super(Singleton, cls).__new__(cls, *args, **kwargs)
10         return cls._instance
11 
12 
13 class JiebaUtil(Singleton):
14     """
15     jiebautil 工具包
16     """
17     _jieba_instance = None
18 
19     def get_instance(self):
20         """
21         get the global jieba instance
22         """
23         if self._jieba_instance:
24             return self._jieba_instance
25         print 'initialize...'
26         obj = jieba.Tokenizer()
27         obj.tmp_dir = dirpath
28         obj.load_userdict(user_dict_path)
29         obj.initialize()
30         self._jieba_instance = obj
31         return obj
32 
33 
34 if __name__ == '__main__':
35 
36     one = JiebaUtil()
37     two = JiebaUtil()
38 
39     print one == two
40 
41     tkn = one.get_instance()
42     tkn2 = one.get_instance()
43     print tkn == tkn2
44 
45     print id(one), id(two)
46 
47     print id(tkn), id(tkn2)

在27行中可以设置自定义的他们tmp_dir缓存路径。

参考:

http://funhacks.net/2017/01/17/singleton/

https://blog.csdn.net/sijiaqi11/article/details/78601258

原文地址:https://www.cnblogs.com/shizhh/p/10599931.html