NLTK读书笔记和实践问题记录

python版本3.4.2：

1、书上的例子是

from nltk.corpus import wordnet as wn

wn.synset('car.n.01').lemma_names #获得同义词集

wn.synset('car.n.01').definition #获得定义

在3.4.2下执行得到输出：

<bound method Synset.lemma_names of Synset('car.n.01')>和

可能是版本问题，在上面命令行后加上（）即可，即如下：

wn.synset('car.n.01').lemma_names()

wn.synset('car.n.01').definition()

2、书上是from urllib import urlopen,但是报错：ImportError: cannot import name 'urlopen'；实际原因是python3的库和python2的库的位置不同，这里应该改成：

from urllib.request import urlopen。说道这里，顺便说一下from ... import ...和import的不同，如果使用import，则导入后如果访问这个模块的功能，需要全路径写上，而from ... import呢，访问时就直接写上import后面的即可（可能的意思是这个import的东东是from这里来的）。

3、python idle在backspace删除时总是感觉删除半个byte，有个白框框，可以按住alt键，一次删一个，按ctrl是一次删一个词

4、可能也是python3的缘故，urlopen(url).read()返回的是bytes，而不是str，python中str和bytes转化比较简单，例如bytes--》string，a.decode(encoding="utf-8");string-->bytes，a.encode(encoding="utf8")

5、对于自然语言处理，首先要将文本分词，将标点符号和单词分开，然后再处理

6、http://www.gutenberg.org/cache/epub/2554/pg2554.txt --《罪与罚》的地址变更

7、使用nltk.clean_html(htmltext),报错：builtins.NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function，发现nltk不再提供clearn_html和clean_url两个函数功能。可以使用Beautiful Soup项目提供的功能来处理html

8、安装方法：

import easy_install，easy_install packageName或者：

curl http://www.crummy.com/software/BeautifulSoup/bs4/download/4.1/beautifulsoup4-4.1.2.tar.gz >> beautifulsoup4-4.1.2.tar.gz

tar zxvf beautifulsoup4-4.1.2.tar.gz

cd beautifulsoup4-4.1.2

python setup.py install

9、BeautifulSoup 4之后，import的包改为 bs4,之前是import BeautifulSoup，现在改为import bs4. 具体使用方法：

http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

10、由于无法可靠地检验出文本内容的开始和结束、因此在从原始文本中挑出内容之前，需要手工检查文件来发现标记内容开始和结尾的特定字符串（使用find/rfind--反向查找）