爬虫基础

环境: python3、windows

模块：requests、BeautifulSoup

安装模块：

pip3 install BeautifulSoup4
pip3 install requests

一、以汽车之家为例子，来一段简单的爬虫代码。

rt requests
from bs4 import BeautifulSoup

# 找到所有新闻
# 标题，简介，url，图片
#get方式向汽车之家新闻页面发送请求，获取返回的页面信息
response = requests.get('http://www.autohome.com.cn/news/')
#get请求默认编码是utf8,而国内网站许多如汽车之家则需改成gbk
response.encoding = 'gbk'

#以python标准库解析html文档
soup = BeautifulSoup(response.text,'html.parser')
#查找id=xx的标签，以此基础查找所有li标签
li_list = soup.find(id='auto-channel-lazyload-article').find_all(name='li')

#通过f12查看新闻板块下此标签的li包含我们需要的信息。再将每一个需要的标签通过BeautifulSoup方法解析出来。
for li in li_list:
    title = li.find('h3')
    #h3标签中会有None,可能是广告，直接跳过
    if not title:
        continue
    #简介
    summary = li.find('p').text
    
    #详细页url,找到a标签，a标签的所有属性都在attrs的字典里，可以attrs取值，也可以直接get方法取值
    # url = li.find('a').attrs['href']
    url = li.find('a').get('href')

    #同理先拿到图片url，再通过url向服务器发送请求，写入本地
    img_url = li.find('img').get('src')
    img = requests.get(img_url)

    #这里是伪代码，实际运行过程，文章标题会有许多的特殊字符，不可作为图片名称。可用其它名称，
    #或者通过正则替换掉特殊字符。
    file_name = title.text
    with open(file_name+'.jpg','wb') as f:
        f.write(img.content)

二、通过代码进行登录验证：

1.登录github:

首先我们进入github登录页面，输入错误的用户名以及密码，通过f12 NetWork一栏查看htttp请求状态

点击session，在Headers一栏，可以看到接收我们登录信息的URL是哪一个

此时，再查找服务端需要的Data信息，再最下方找到了Form Data

根据这个格式，我们向github服务端发送post请求：

import requests
from bs4 import BeautifulSoup

#获取token
r1 = requests.get('https://github.com/login')
s1 = BeautifulSoup(r1.text,'html.parser')
#同样是通过f12查看源码搜索token,找到了作为CSRF禁止跨站请求的token的标签，通过解析取得它的值
token = s1.find(name='input',attrs={'name':'authenticity_token'}).get('value')
#有的网站会在第一次get请求时给客户端发送一组cookies,当客户端带着此cookies来进行验证才会通过，所以这里先获取未登录的cookies
r1_cookie_dict= r1.cookies.get_dict()

#将用户名密码token发送到服务端
r2 = requests.post('https://github.com/session',
                   data={
                       'utf8':'✓',
                       'authenticity_token':token,
                       'login':'Mitsui1993',
                       'password':'假装有密码',
                       'commit':'Sign in'
                   },
                   cookies = r1_cookie_dict
                   )

#获取登陆后拿到的cookies,并整合到一个dict里
r2_cookie_dict = r2.cookies.get_dict()
cookie_dict = {}
cookie_dict.update(r1_cookie_dict)
cookie_dict.update(r2_cookie_dict)

#带着cookies验证是否登录成功，查看登录后可见的页面
r3 = requests.get(
    url='https://github.com/settings/emails',
    cookies=cookie_dict
)

#text里包含我的用户名，由此判定已经登录成功。
print(r3.text)

2.通过requests对抽屉网进行点赞

import requests

#取得未登录第一次get请求的cookies
r1 = requests.get('http://dig.chouti.com')
r1_cookies = r1.cookies.get_dict()

#由于点赞前需要先登录，所以这里跟github一样，我们通过解析http请求知道需要发送的目标url以及所需数据
r2 = requests.post('http://dig.chouti.com/login',
                   data={
                       'phone':'8615xxxxx',
                       'password':'woshiniba',
                       'oneMonth':1
                   },
                   cookies = r1_cookies)

#获取登录后的cookies
r2_cookies = r2.cookies.get_dict()

#整合cookies
r_cookies = {}
r_cookies.update(r1_cookies)
r_cookies.update(r2_cookies)

#真正的点赞功能需要的是第一次get时的cookies里的gpsd，这也是为什么我们主张将登陆前后的cookies合并一起发送的原因，
#这将大大提高我们请求的容错率。
# r_cookies = {'gpsd':r1_cookies['gpsd']}

#点赞格式url格式linksId=后面为文章id
r3 = requests.post('http://dig.chouti.com/link/vote?linksId=13921736',
              cookies = r_cookies)

#获得正确的状态码及返回信息，则正面已经成功。
print(r3.text)

三。requests模块与模块的其它方法：

 1 def request(method, url, **kwargs):
 2     """Constructs and sends a :class:`Request <Request>`.
 3 
 4     :param method: method for the new :class:`Request` object.
 5     :param url: URL for the new :class:`Request` object.
 6     :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
 7     :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
 8     :param json: (optional) json data to send in the body of the :class:`Request`.
 9     :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
10     :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
11     :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
12         ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
13         or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
14         defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
15         to add for the file.
16     :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
17     :param timeout: (optional) How long to wait for the server to send data
18         before giving up, as a float, or a :ref:`(connect timeout, read
19         timeout) <timeouts>` tuple.
20     :type timeout: float or tuple
21     :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
22     :type allow_redirects: bool
23     :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
24     :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
25     :param stream: (optional) if ``False``, the response content will be immediately downloaded.
26     :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
27     :return: :class:`Response <Response>` object
28     :rtype: requests.Response
29 
30     Usage::
31 
32       >>> import requests
33       >>> req = requests.request('GET', 'http://httpbin.org/get')
34       <Response [200]>
35     """
36 复制代码

Requests