爬虫入门一基础知识以及request

title: 爬虫入门一基础知识以及request
date: 2020-03-05 14:43:00
categories: python
tags: crawler

爬虫整体概述，基础知识。
requests库的学习

1.request

Requests 是用Python语言编写，基于 urllib，采用 Apache2 Licensed 开源协议的 HTTP 库
http://docs.python-requests.org/en/latest/

1.1

import requests
                 r=requests.get("http://www.whu.edu.cn/ ")   #返回reponse对象
                 print(r.status_code)   
   返回值为200时，表明运行正常

输入：r.text  得到网页内容

HTTP状态码

200 成功/正常
404
503
…

1.2 http header

https://www.jianshu.com/p/6f29fcf1a6b3
HTTP（HyperTextTransferProtocol）即超文本传输协议，目前网页传输的的通用协议。HTTP协议采用了请求/响应模型，浏览器或其他客户端发出请求，服务器给与响应。就整个网络资源传输而言，包括message-header和message-body两部分。

根据维基百科对http header内容的组织形式，大体分为Request和Response两部分。

Header中有charset （字符集，也就是编码方式）
r.encoding是从HTTP header中猜测的响应内容编码方式，如果header中不存在charset,则认为编码为‘ISO-8859-1’(无法解析中文字符)
r.apparent_encoding是requests根据网页内容分析出来的

输入“r.encoding ” 查看该网页编码方式为'ISO-8859-1‘
输入“r.apparent_encoding”查看网页编码为'utf-8‘
输入“r.encoding=r.apparent_encoding”
再输入“r.text”,可以发现网页内容变为可以看懂的字符

1.3 异常

遇到网络问题（如：DNS查询失败、拒绝连接等）时，Requests会抛出一个ConnectionError 异常。
遇到罕见的无效HTTP响应时，Requests则会抛出一个 HTTPError 异常。
若请求超时，则抛出一个 Timeout 异常。
若请求超过了设定的最大重定向次数，则会抛出一个 TooManyRedirects 异常。
所有Requests抛出的异常都继承自 requests.exceptions.RequestException 。

1.4 通用框架

注意 
Try
Exception
R.raise_for_status()

def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()  # 如果状态不是200，引发error异常
        # print("%d
 %s" % (r.status_code, r.text))
        print("%s %s" % (r.encoding, r.apparent_encoding))
        r.encoding=r.apparent_encoding
        print("%s %s" % (r.encoding, r.apparent_encoding))
        #html = r.content  # bytes 类型
        #html_doc = str(html, 'utf-8')  # html_doc=html.decode("utf-8","ignore")
        #print(html_doc)
        return r.text
    except:
        return "产生异常"

1.5 requests的方法 //http的操作

注意method的function的区别

request

def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30) #reponse   参数 timeout
        r.raise_for_status()  # 如果状态不是200，引发error异常
        # print("%d
 %s" % (r.status_code, r.text))
        print("%s %s" % (r.encoding, r.apparent_encoding))
        r.encoding=r.apparent_encoding
        print("%s %s" % (r.encoding, r.apparent_encoding))
        #html = r.content  # bytes 类型
        #html_doc = str(html, 'utf-8')  # html_doc=html.decode("utf-8","ignore")
        #print(html_doc)
        print(r.text)
        return r.text
    except:
        return "产生异常"

def head(url):
    r=requests.head(url)
    print(r.headers)    # 注意head headers
    print(r.text)   #空

def post(url): #追加
    r=requests.get("http://httpbin.org/post")
    print(r.text)
    payload = {'name': 'your_name', 'ID': 'your_student number'}
    r = requests.post("http://httpbin.org/post", data=payload)   #参数 data
    print(r.text)

def put(url):   #覆盖
    r = requests.get("http://httpbin.org/put")
    print(r.text)
    payload = {'name': 'your_name', 'ID': '123456'}
    r = requests.put("http://httpbin.org/put", data=payload)
    print(r.text)

1.6 Request 访问控制字段 Requests.request(method,url,**kwargs)

标准格式 Requests.request(method,url,**kwargs)

**kwargs:控制访问的参数，均为可选项，共计13个
params：  字典或字节序列，作为参数增加到url中
data：       字典、字节序列或文件对象，作为Request的内容
JSON：    JSON格式的数据，作为Request的内容
headers： 字典，HTTP定制头。可模拟任何浏览器向服务器发起请求
           hd={'user-agent':'Chrome/56.0'}
           r=requests.request('post','https://www.amazon.com/',headers=hd)
Cookies：字典或CookieJar ， Request 中 的 cookie 
auth ：     元组 ，支持HTTP认证功能 
files :        字典类型，传输文件 
timeout :   设定超时时间,单位为秒 
proxles            ：  字典类型 ，设定访问代理服务器，可以增加登录认证
Allowredirects： True/Fa1se，默认为True，重定向开关
stream             ： True/Fa1se，默认为True，获取内容立即下载开关
verify              ： True/Fa1se，默认为True，认证SSL证书开关
Cert                 ：本地SSL证书路径

1.7 爬虫尺寸

网页：requests
网站：scrapy
全网：搜索引擎

1.8 robots协议

Robots协议（也称为爬虫协议、机器人协议等）的全称是“网络爬虫排除标准”（Robots Exclusion Protocol），网站通过Robots协议告诉搜索引擎哪些页面可以抓取，哪些页面不能抓取。

https://www.jd.com/robots.txt

User-agent: * 
Disallow: /?* 
Disallow: /pop/*.html 
Disallow: /pinpai/*.html?* 
User-agent: EtaoSpider 
Disallow: / 
User-agent: HuihuiSpider 
Disallow: / 
User-agent: GwdangSpider 
Disallow: / 
User-agent: WochachaSpider 
Disallow: /

*代表所有，/代表根目录
User-agent: * 
Disallow: / 
下面四种爬虫被京东认为恶意爬虫，拒接其访问

1.9 chrome 查看useragent

F12 network name

2.requests的例子

import requests
import os

def amazon():
    #url="https://www.amazon.cn"
    # r=requests.get(url)
    # print(r.status_code)
    #url="https://www.amazon.com"
    #理论上python直接爬，可以看到requests请求很诚实的告诉了网站访问使用Python发起的，
    # 该网站通过头信息判断该访问是爬虫发起的而不是由浏览器发起的。amazon会503，使用useragent模拟浏览器后没问题
    #问题是直接10060.
    #url = "https://www.amazon.co.jp"
    # try:
    #      r=requests.get(url)
    #      #r = requests.get(url,timeout=5)
    #       print(r.request.headers)  #头信息
    #      #print(r.request.url)
    #      #r.raise_for_status()
    #      print(r.status_code)
    # except:
    #      print("except %s"% r.status_code)
    # print(r.request.headers)  #   注意是request 网站通过头信息判断是python发起，爬虫，拒绝
    #hd = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}
    #r = requests.request('post', url=url, headers=hd)
    #r = requests.get(url, headers=hd)
    #print("final %s"% r.status_code)

    #上面是网络问题导致的amazon访问不了，我还以为是代码问题改了很久...下面这样做就行 了
    url = "https://www.amazon.com"
    r=requests.get(url)
    print("%s %s"%(r.status_code,r.request.headers))  #注意是request.headers不是requests
    #503 {'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
    hd = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}
    #r = requests.request('post', url=url, headers=hd) #请求方式是post，返回状态码405，后台不允许post
    r = requests.get(url, headers=hd)
    print("%s %s" % (r.status_code, r.request.headers))  #200


def searchengine():
    keyword = "知乎"
    try:
        kv = {'wd': keyword}
        r = requests.get("http://www.baidu.com/s", params=kv)
        print(r.request.url)
        r.raise_for_status()
        print(r.text[1:1000])
    # 结果太长，打印前1000个字符
    except:
        print("爬取失败")
    # 百度直接搜索 武汉大学，华科
    # https: // www.baidu.com / s?wd = 武汉大学 & rsv_spt = 1……
    # https: // www.baidu.com / s?wd = 华中科技大学 & rsv_spt = 1……
    # 所以只需要替换wd即可搜索
    #

def images():
    #可以通过循环语句，批量爬取大量图片  正则式也可
    url = "https://meowdancing.com/images/timg.jpg"
    root = "F://Pictures//"
    path = root + url.split('/')[-1]  #split 通过 / 分片，取最后一片也就是timg.jpg
    try:
        if not os.path.exists(root):
            os.mkdir(root)  # 用于以数字权限模式创建目录
        if not os.path.exists(path):
            r = requests.get(url)
            with open(path, 'wb')as f:
                f.write(r.content)
                f.close()
                print("文件保存成功")
        else:  # 写代码时注意缩进
            print("文件已存在")
    except:
        print("爬取失败")

def ipaddress():
    url = "http://www.ip138.com/ips138.asp?ip="
    ip="101.24.190.228"
    url=url+ip
    #   +"&action=2" 不加也可以
    hd = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}
    print(url)
    try:
        r = requests.get(url,headers=hd)   #不加hd好像不行
        print(r.status_code)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        print(r.text[-2000:])  # 输出最后2000个字符
    except:
        print("爬取失败")

    # 打开
    # http: // www.ip138.com / 可以通过输入IP地址查询地理位置，输入IP地址后，查看浏览器链接
    # http: // www.ip138.com / ips138.asp?ip = 202.114
    # .66
    # .96 & action = 2
    # 可以看出，查询链接为
    # http: // www.ip138.com / ips138.asp?ip =“你的IP地址”
    #
    # 通过这个例子我们可以看出，很多人机交互的操作，实际上是通过提交的HTTP链接来完成的，
    # 因此当我门通过简单的分析，得知HTTP链接与交互信息的对应关系后，就可以通过Python，爬取我们所需的资源


if __name__ == "__main__":
    #amazon()
    #searchengine()
    #images()
    ipaddress()

爬虫入门一 基础知识 以及request

title: 爬虫入门一 基础知识 以及request date: 2020-03-05 14:43:00 categories: python tags: crawler