python_爬虫

urllib.request.urlretrieve(url,本地地址)：将网页内容缓存到本地

urllib.request.urlcleanup()清除urlretrieve产生的缓存

.info:当前文件的基本信息

.getcode：获取网站的状态码

.geturl：获取网站的url

decode和encode的区别：decode是将已经编码好的内容进行解码，encode是将还没编码的内容进行编码，没有编码是无法写入文件的

超时设置timeout=：file=urllib.request.urlopen("www.baidu.com",time=10)

异常捕获：

try：

　　代码块

except exception as 变量：

　　代码块

import urllib.request
for i in range(0,10):
    try:
        file = urllib.request.urlopen('http://qq.ssjzw.com')
        data = file.read()
        print(data)
    except Exception as e:
        print('出现异常'+ str(e))
#出现异常HTTP Error 403: Forbidden

自动模拟http请求：需要用Reque变成请求

当请求的值中有中文时，不处理会出现ascii' codec can't encode characters 报错则需要加入

urllib.request.quote(变量):将变量中的中文进行转换

字段码相应处理需要导入pares——post请求

urllib.request.Request(地址,数据)：将数据发送到地址中

异常处理：urlError和htmlError

URLError没有异常状态码，而HTMLError有返回的状态码

URLError：1.连接不上服务器2.Url不存在3.本地网络不通4.触发httpError子类

import urllib.error
import urllib.request
try:
    urllib.request.urlopen("https://ac.zzidc.com/cas/login?service=https%3A%2F%2Fwww.zzidc.com%2F")
except urllib.error.URLError as e:
    if hasattr(e,'code'):
        print(e.code)
    if hasattr(e,"reason"):
        print(e.reason)

浏览器的伪装

创建报头：（User-Agent，‘值’）

添加headers需要使用urllib.request.build_opener()赋一个值

接下来拿赋的值使用.addheaders() = [报头]

赋的值.open(url).read().decode("编码")

如果爬取的网站使用utf-8不能使用则添加第二个参数ignore

正则表达式：里面的内容不需要非常的规范，如果筛选后还有一些一样的内容，但是不想要这些，则只需要在后面加入相同的内容就可以了

爬取一个兼职网：

import urllib.request
import re
url = "http://qq.ssjzw.com/"
headers = ('User-Agent',"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data = opener.open(url).read().decode("GBK")
tj = '<li>类型:(.*?)</li><li>(.*)</li>'
p = re.compile(tj).findall(str(data))
file = open("兼职群.txt" ,"w",encoding="utf-8")
for i in range(len(p)):
    dawa = str(p[i])
    file.write(dawa + "
")

print("------打印完成------")
file.close()
#重点：构造好header头
#难点：将筛选出来的元组进行写入文件

View Code

爬取新浪新闻：

import urllib.request
import re
url = "http://news.sina.com.cn/"
headers = ('User-Agent',"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data = opener.open(url).read().decode("utf-8","ignore")
tj = 'href="(http://news.sina.com.cn/.*?)"'
p = re.compile(tj).findall(data)
for i in range(len(p)):
    try:
        data_url = p[i]
        file = "F:/bing/a/" + str(i) + ".html"
        print("正在生成第%s"%i)
        urllib.request.urlretrieve(data_url,file)
        print("第%s条生成完成"%i)
    except Exception as e :
        print(str(e))
#难点：正确编写正则表达式，对过滤的内容非常严格，不对则后面生成的网页就404
# 重点：对于异常处理，设置编码ignore属性，拼接的文件名

View Code

爬取csdn首页：

import re
import urllib.request
url = "https://www.csdn.net/"
headers = ('User-Agent',"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data = opener.open(url).read().decode("utf-8","ignore")
tj = 'href="(https://blog.csdn.net.*?)"'
p = re.compile(tj).findall(data)
try:
    for i in range(len(p)):
        thisurl = p[i]
        file = "F:/bing/a/" + str(i) + ".html"
        print("正在保存%s条"%i)
        urllib.request.urlretrieve(url,file)
        print("第%s条保存成功"%i)
except Exception as e:
    print("出现错误" + str(e))

View Code