复合数据类型，英文词频统计

作业博客要求：

文字作业要求言简意骇，用自己的话说明清楚。
编码作业要求放上代码，加好注释，并附上运行结果截图。

1.列表，元组，字典，集合分别如何增删改查及遍历。

（1）列表

list = ['a','b','hello',1]

 
#第一在列表后方添加数据 第二为在对应的下边插入数据
list.append(2)
list.insert(0,'0')
print(list)

通过[]来创建列表，可通过索引（index）来获取列表中的元素和修改元素；append() 方法向列表的最后添加一个元素，insert()向列表的指定位置插入一个元素，extend()使用新的序列来扩展当前序列，需要一个序列作为参数，它会将该序列中的元素添加到当前列表中；通过pop() 根据索引删除并返回被删除的元素；一般通过for循环来遍历列表，如for s in stus :print(s)形式。

（2）元祖

使用()来创建元组，它的操作的方式基本上和列表是一致的。但元组是不可变的序列，不能尝试为元组中的元素重新赋值

（3）字典

使用 {} 来创建字典，每一个元素都是键值对，键不重复，值可以重复。

（4）集合

使用 {} 或set() 函数来创建集合，操作与字典类似，但只包含键，而没有对应的值，包含的数据不重复。可以通过set()来将序列和字典转换为集合。

2.总结列表，元组，字典，集合的联系与区别。参考以下几个方面：

括号
有序无序
可变不可变
重复不可重复
存储与查找方式

（1）列表用[]表示，有序，可变，可重复，元素以值的方式存储为值，可通过索引查找，如mylist[1]

（2）元组用()表示，有序，不可变，可重复，元素以值的方式存储为值，可通过索引查找，如tuple[0]

（3）字典用{}表示，无序，键不可重复，值可以重复，元素以键值对的方式存储为值，一般通过键查找，如dist['key']

（4）集合用{}表示，无序，可变，不可重复，元素以值的方式存储为值，可以通过set()来将序列和字典转换为集合。

3.词频统计

1.下载一长篇小说，存成utf-8编码的文本文件 file
2.通过文件读取字符串 str

3.对文本进行预处理

4.分解提取单词 list

5.单词计数字典 set , dict

6.按词频排序 list.sort(key=lambda),turple

7.排除语法型词汇，代词、冠词、连词等无语义词
- 自定义停用词表
- 或用stops.txt

8.输出TOP(20)

9.可视化：词云

import pandas as pd
def getText():
    txt = open(r"E:KINGPyCharmKINGKINGig.txt", "rt").read()
    txt = txt.lower()
    for ch in '''’!@#$%^&*()_+=-';":.,<>/?|''':
        txt.replace(ch, " ")
    wordlist = txt.split()
    return wordlist

# 词频统计
wordlist = getText()
 # 过滤（排除语法词汇，带刺，冠词，连词等）
mum = {'it', 'if', 'the', 'at', 'for', 'on', 'and', 'in', 'to', 'of', 'a', 'was', 'be', 'were', 'in', 'about',
        'from', 'with', 'without', 'an', 'one', 'another', 'others', 'that', 'they', 'himself', 'itself',
         'themselves', 'if', 'when', 'before', 'though', 'although',
       'while', 'as', 'as long as', 'i', 'he', 'him', 'she', 'out', 'is', 's', 'no', 'not', 'you', 'me', 'his',
       'but'}
wordset = set(wordlist) - mum

    # 字典
worddict = {}
for w in wordset:
    worddict[w] = wordlist.count(w)

    # 词频排序
    wordsort = list(worddict.items())
    wordsort.sort(key=lambda x: x[1], reverse=True)
for i in range(20):
    print(wordsort[i])
pd.DataFrame(data=wordsort).to_csv(r'E:\KING\大三（二）\big.csv', encoding='utf-8')

top20：

在线生成词云：

排序好的单词列表word保存成csv文件

import pandas as pd
pd.DataFrame(data=word).to_csv('big.csv',encoding='utf-8')

线上工具生成词云：
https://wordart.com/create