爬虫基本库的使用之requests库

使用requests

由于处理网页验证和Cookies时,需要写Opener和Handler来处理,为了更方便地实现这些操作,就有了更强大的库requests。requests库功能很强大。能实现Cookies、登录验证、代理设置等操作。

简单使用requests库

import requests
r = requests.get('http://wwww.baidu.com/')
print(type(r), r.status_code, r.text, r.cookies, sep='\n\n')

GET请求

返回相应的请求信息

requests.get(url, params)
# url表示要捕获的页面链接,params表示url的额外参数(字典或字节流格式)

举例1:

import requests
r = requests.get('http://httpbin.org/get')
print(r.text)
# 输出
{
   "args": {}, 
   "headers": {
     "Accept": "*/*", 
     "Accept-Encoding": "gzip, deflate", 
     "Host": "httpbin.org", 
     "User-Agent": "python-requests/2.21.0"
   }, 
   "origin": "120.85.108.192, 120.85.108.192", 
   "url": "https://httpbin.org/get"
 }

举例2

import requests
data = {
     'name': 'LiYihua',
     'age': '21'
 }
r = requests.get('http://httpbin.org/get', params=data)
print(r.text)
# 输出:
{
  "args": {
    "age": "21", 
    "name": "LiYihua"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.21.0"
  }, 
  "origin": "120.85.108.92, 120.85.108.92", 
  "url": "https://httpbin.org/get?name=LiYihua&age=21"
}

举例3

import requests
r = requests.get('http://httpbin.org/get')
print(type(r.text), r.json(), type(r.json()), sep='\n\n')
# 输出:
<class 'str'>

{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'}, 'origin': '120.85.108.92, 120.85.108.92', 'url': 'https://httpbin.org/get'}

<class 'dict'>

举例4

抓取照片

import requests
r = requests.get('https://github.com/favicon.ico')
with open('favicon.ico', 'wb') as f:
    f.write(r.content)
# 运行结束后生成一个名为favicon.ico的图标

POST请求

这是一种比较常见的URL请求方式,举例:

import requests

data = {
    'name': 'LiYihua',
    'age': 21
}
r = requests.post('http://httpbin.org/post', data=data)
print(r.text)


# 输出:
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "21", 
    "name": "LiYihua"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "19", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.21.0"
  }, 
  "json": null, 
  "origin": "120.85.108.90, 120.85.108.90", 
  "url": "https://httpbin.org/post"
}

# POST请求成功,获得返回结果,form部分为提交的数据

Response

  1. text 和 content 获取响应的内容
  2. status code 属性得到状态码
  3. headers 属性得到响应头
  4. cookies属性得到 Cookies
  5. url属性得到 URL
  6. history属性得到请求历史

举例:

import requests

r = requests.get('https://www.cnblogs.com/liyihua/')

print(type(r.status_code), r.status_code,
      type(r.headers), r.headers,
      type(r.cookies), r.cookies,
      type(r.url), r.url,
      type(r.history), r.history,
      sep='\n\n')


# 输出:
<class 'int'>

200

<class 'requests.structures.CaseInsensitiveDict'>

{'Date': 'Thu, 20 Jun 2019 08:18:00 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Cache-Control': 'private, max-age=10', 'Expires': 'Thu, 20 Jun 2019 08:18:10 GMT', 'Last-Modified': 'Thu, 20 Jun 2019 08:18:00 GMT', 'X-UA-Compatible': 'IE=10', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Encoding': 'gzip'}

<class 'requests.cookies.RequestsCookieJar'>

<RequestsCookieJar[]>

<class 'str'>

https://www.cnblogs.com/liyihua/

<class 'list'>

[]

requests 的高级用法

  1. 文件上传

    import requests
    
    files = {
        'file': open('favicon.ico', 'rb')
    }
    r = requests.post('http://httpbin.org/post', files=files)
    print(r.text)
    
    
    # 输出:
    {
      "args": {}, 
      "data": "", 
      "files": {
        "file": "data:application/octetstream;base64,AAABAAIAEBAAAAEAIAAoBQAAJgAAACAgAAABACAAKBQAAE4FAAAoAAAAEAAAACAAAAABACAAAAAAAAAFAAA...
      }, 
      "form": {}, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Content-Length": "6665", 
        "Content-Type": "multipart/form-data; boundary=c1b665273fc73e67e57ac97e78f49110", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.21.0"
      }, 
      "json": null, 
      "origin": "120.85.108.71, 120.85.108.71", 
      "url": "https://httpbin.org/post"
    }
    
  2. 会话维持

    1. Session对象,可以方便的维护一个会话

      import requests
      
      requests.get('http://httpbin.org/cookies/set/number/123456789')
      r = requests.get('http://httpbin.org/cookies')
      print(r.text)
      
      
      # 输出:
      {
        "cookies": {}
      }
      
      
      import requests
      
      s = requests.Session()
      s.get('http://httpbin.org/cookies/set/number/123456789')
      r = s.get('http://httpbin.org/cookies')
      print(r.text)
      
      
      # 输出:
      {
        "cookies": {
          "number": "123456789"
        }
      }
      
    2. SSL证书验证

      import requests
      
      r = requests.get('https://www.12306.cn')
      print(r.status_code)
      
      # 没有出错会输出:200
      # 如果请求一个HTTPS站点,但是证书验证错误的页面时,就会错误。
      
      
      # 为了避免错误,可以将改例子稍作修改
      import requests
      from requests.packages import urllib3
      
      urllib3.disable_warnings()
      r = requests.get('https://www.12306.cn', verify=False)
      print(r.status_code)
      
    3. 代理设置

      import requests
      
      proxies = {
          'http': 'socks5://user:password@10.10.1.10:3128',
          'https': 'socks5://user:password@10.10.1.10:1080'
      }
      
      requests.get('https://www.taobao.com', proxies=proxies)
      
      
      # 使用SOCKS协议代理
      
    4. 超时设置

      import requests
      
      r = requests.get('https://taobao.com', timeout=(0.1, 1))
      print(r.status_code)
      
      # 输出:200
      
    5. 身份验证

      import requests
      from requests.auth import HTTPBasicAuth
      
      r = requests.get('http://localhost', auth=HTTPBasicAuth('liyihua', 'woshiyihua134'))
      print(r.status_code)
      
      
      # 输出:200
      
      
      # 也可以使用OAuth1方法
      import requests
      from requests_oauthlib import OAuth1
      
      url = 'https://api.twitter.com/1.1/account/verify_credentials.json'
      auth = OAuth1('YOUR_APP_KEY', 'YOUR_APP_SECRET'
                    'USER_OAUTH_TOKEN', 'USER_OAUTH_TOKEN_SECRET')
      requests.get(url, auth=auth)
      
    6. Prepared Request(准备请求

      要获取一个带有状态的 Prepared Request, 需要用Session.prepare_request()
      
      from requests import Request, Session
      
      url = 'http://httpbin.org/post'
      data = {
          'name': 'LiYihua'
      }           # 参数
      header = {
          'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537 (KHTML, like Gecko Chrome/53.0.2785.116 Safari/537.36'
      }           # 伪装浏览器
      s = Session()                       # 会话维持
      req = Request('POST', url, data=data, headers=header)
      
      prepped = s.prepare_request(req)            # Session的prepare_request()方法将req转化为一个 Prepared Request对象 
      r = s.send(prepped)                 # send() 发送请求
      print(r.text)
      
      
      # 输出:
      {
        "args": {}, 
        "data": "", 
        "files": {}, 
        "form": {
          "name": "LiYihua"
        }, 
        "headers": {
          "Accept": "*/*", 
          "Accept-Encoding": "gzip, deflate", 
          "Content-Length": "12", 
          "Content-Type": "application/x-www-form-urlencoded", 
          "Host": "httpbin.org", 
          "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537 (KHTML, like Gecko Chrome/53.0.2785.116 Safari/537.36"
        }, 
        "json": null, 
        "origin": "120.85.108.184, 120.85.108.184", 
        "url": "https://httpbin.org/post"
      }
      

本文来自博客园,作者:LeeHua,转载请注明原文链接:https://www.cnblogs.com/liyihua/p/11050374.html

原文地址:https://www.cnblogs.com/liyihua/p/11050374.html