爬虫——request

命名规范

  • module_name,模块
  • package_name,包
  • ClassName,类
  • method_name,方法
  • ExceptionName,异常
  • function_name,函数
  • GLOBAL_VAR_NAME,全局变量
  • instance_var_name,实例
  • function_parameter_name,参数
  • local_var_name,本变量

爬取图片

直接用get请求图片网址即可

1 # photo_url = 'https://wallpapers.wallhaven.cc/wallpapers/full/wallhaven-685513.jpg'
2 # response_get = requests.get(gif_uri)
3 # with open('panda.gif','wb') as f:
4 #     f.write(response_get.content)

百度翻译

百度固定格式kw,用post请求发送请求头和kw单词给百度翻译接口,编码格式utf-8

 1 # headers = {
 2 #     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'
 3 # }
 4 #
 5 # kw = {
 6 #     'kw':'wolf'
 7 # }
 8 #
 9 # response_post = requests.post('http://fanyi.baidu.com/sug',headers=headers,data=kw)
10 # response_post.encoding = 'utf-8'
11 # # print(response_post.text)
12 # import json
13 # data = response_post.text
14 # info = json.loads(data)
15 # print(info)
16 # # print(info['data'][0]['v'])
17 # for i in info['data'][0]['v'].split('; '):
18 #     print(i)

登录爬取

爬取登录后的页面,将登陆后的set_cookie或Cookie写到请求头里,可能遇到网站限速

1 # headers = {
2 #     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0',
3 #     # 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Mobile Safari/537.36',
4 #     'Cookie':'session_id_places=True; session_data_places=""'
5 # }
6 #
7 # r = requests.get('http://example.webscraping.com',headers=headers)
8 # print(r.text)

代理服务

利用代理服务器爬取百度页面(要指定http协议和端口号),用get请求发送代理和请求头给百度

 1 proxies = {'http':'ip'}
 2 headers = {
 3     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0',
 4 }
 5 r = requests.get('http_ljb://www.baidu.com',proxies=proxies,headers=headers)
 6 # print(r.status_code)          #状态码
 7 # print(r.text)                 #爬取的内容
 8 # print(r.content)              #爬取的内容,text可能有字符格式问题
 9 # print(r.headers)              #请求头
10 # print(r.url)                  #请求的地址
11 # print(r.cookies)              #cookie信息
原文地址:https://www.cnblogs.com/siplips/p/9682879.html