针对句子或文章的 -- 关于切词组化的正序排布

有时候会有种需求就是:将一段句子里面的词语进行分词,分词结果根据构造的词典进行分词

(当然了解NLP的同学,jieba里面的自定义词典可以满足需求) 这里自己写了轮子共享下:

需求:

a = "i love you so much and want to do something for you, can you give me one chance or want to marry you."
require_list = ["want to", "so much", "one chance"]

## 结果需要
Finish Result: ['i', 'love', 'you', 'so much', 'and', 'want to', 'do', 'something', 'for', 'you,', 'can', 'you', 'give', 'me', 'one chance', 'or', 'want to', 'marry', 'you.']


采用了逆向词组判定和递归的想法:
# 关于切词组化的正序排布
def tokenize(need_article1, rule_phrase, new_list):
    need_article = need_article1
    len_need_article = len(need_article)
    for word_join in rule_phrase:

        while word_join in need_article:
        # if word_join in need_article:
            length = len(word_join)
            index_first = need_article.find(word_join)

            if index_first == -1:
                break
            try:
                if index_first + length < len_need_article:
                    if need_article[index_first + length] == " " and (index_first == 0 or need_article[index_first - 1] == " "):

                        need_article_temp = need_article[:index_first + length]
                        specfic_phrase = need_article_temp[index_first::]

                        if need_article_temp[:index_first]:

                            before_phrase = need_article_temp[:index_first]

                            conbine_list = tokenize(before_phrase, rule_phrase, new_list)

                            if conbine_list:
                                new_list.append(specfic_phrase)

                                need_article =need_article[index_first + length:]

                        else:
                            new_list.append(specfic_phrase)
                            need_article = need_article[index_first + length:]

                    else:
                        break
            except Exception as e:
                # 这里越界不做处理,直接将句子扔进递归
                need_article_temp = need_article[:index_first + length]

                specfic_phrase = need_article_temp[index_first::]
                if need_article_temp[:index_first]:
                    before_phrase = need_article_temp[:index_first]

                    conbine_list = tokenize(before_phrase, rule_phrase, new_list)

                    if conbine_list:
                        new_list.append(specfic_phrase)

                        need_article = need_article[index_first + length:]

                else:
                    new_list.append(specfic_phrase)
                    need_article = need_article[index_first + length:]
    else:
        conbine_list2 = need_article.split()
        new_list.extend(conbine_list2)
    if need_article1 == need_article:
        if not conbine_list2:
            conbine_list = need_article.split()
            new_list.extend(conbine_list)
        return need_article

    return new_list

def tokenize_foo(a, list1, new_list):

    new_list_order = tokenize(a, list1, new_list)
    if a == new_list_order:
        # 还是返回了原来的字符串, 没有任何的词汇组合
        new_list_order = new_list_order.split()

    return new_list_order

  

主程序:
if __name__ == '__main__':


    a = "i love you so much and want to do something for you, can you give me one chance or want to marry you."

    list1 = ["want to", "one chance", "so much"]

    new_list = []
    new_list_order = tokenize_foo(a, list1, new_list)
    print(new_list_order)

 结果展示




日积月累,小小的力量,大大的梦想...
原文地址:https://www.cnblogs.com/harp-yestar/p/tokenize_pharase.html