[ex]递归实现简易tokenizer


问题

用 Python 实现函数 count_words(),该函数输入字符串 s 和数字 n,返回 s 中 n 个出现频率最高的单词。返回值是一个元组列表,包含出现次数最高的 n 个单词及其次数,即 [(<单词1>, <次数1>), (<单词2>, <次数2>), ... ],按出现次数降序排列。


1
"""Count words.""" 2 def get_pos(k_new, v_new, p_new, ksorted,vsorted,plist): 3 if vsorted: 4 i = 0 5 while i < len(vsorted): 6 if v_new > vsorted[i]: 7 return get_pos(k_new, v_new, plist[i], ksorted[:i-1], vsorted[:i-1], plist[:i-1]) 8 elif v_new < vsorted[i]: 9 return get_pos(k_new, v_new, plist[i]+1, ksorted[i+1:], vsorted[i+1:], plist[i+1:]) 10 elif k_new < ksorted[i]: 11 return get_pos(k_new, v_new, plist[i], ksorted[:i-1], vsorted[:i-1], plist[:i-1]) 12 else: 13 return get_pos(k_new, v_new, plist[i]+1, ksorted[i+1:], vsorted[i+1:], plist[i+1:]) 14 else: 15 return p_new 16 17 def count_words(s, n): 18 """Return the n most frequently occuring words in s.""" 19 20 # Count the number of occurences of each word in s 21 wordlist = s.split(' ') 22 wdict = {} 23 for word in wordlist: 24 if word in wdict.keys(): 25 wdict[word] += 1 26 else: 27 wdict[word] = 1 28 # Sort the occurences in descending order (alphabetically in case of ties) 29 vs = [] 30 ks = [] 31 ps = [] 32 w_sorted = [] 33 for key, value in wdict.items(): 34 p_new = get_pos(key,value,0,ks,vs,ps) 35 w_sorted.insert(p_new, (key,value)) 36 ks.insert(p_new, key) 37 vs.insert(p_new, value) 38 ps = range(len(vs)) 39 40 # Return the top n most frequent words. 41 top_n = w_sorted[:n] 42 return top_n 43 44 45 def test_run(): 46 """Test count_words() with some inputs.""" 47 print(count_words("cat bat mat cat bat cat", 3)) 48 print(count_words("betty bought a bit of butter but the butter was bitter", 3)) 49 50 51 if __name__ == '__main__': 52 test_run()
原文地址:https://www.cnblogs.com/expttt/p/9357987.html