004 Python网络爬虫与信息提取 Requests库爬虫实战

[A] 京东商品页面的爬取

  代码示例:

import requests
url = 'https://item.jd.com/70076567438.html'
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    text = r.text
    print(text)
except:
    print('爬取失败')
View Code

[B] 亚马逊商品页面的爬取

  示例代码:

import requests
r = requests.get('https://www.amazon.cn/dp/B0785D5L1H/ref=sr_1_1?__mk_zh_CN=亚马逊网站'
                 '&dchild=1&keywords=极简&qid=1605500387&sr=8-1')
print(r.status_code)    # 返回503, 即意味着此次爬取失败
print(r.request.headers)    # 返回对象中存在 'User-Agent': 'python-requests/2.24.0'
View Code

  分析:

    1. 从返回的状态码我们可以知道,此次爬取内容失败了

    2. 我们调取此次 HTTP 请求的请求头信息(r.request.headers)对象,可知用户代理(User-Agent)的值为python-requests/2.24.0

      说明我们的爬虫很诚实的告诉服务器:我是一个爬虫程序

    3. 我们可以通过修改用户代理名称即可成功骗过服务器,从而获取对应资源

      示例代码:

import requests

kv = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.amazon.cn/dp/B0785D5L1H/ref=sr_1_1?__mk_zh_CN=亚马逊网站'
                 '&dchild=1&keywords=极简&qid=1605500387&sr=8-1', headers=kv)
print(r.status_code)    # 返回200, 即意味着此次爬取失败
print(r.request.headers)    # 返回对象中存在 'User-Agent': 'Mozilla/5.0'
View Code

[C] 百度360关键字提交

    搜索引擎一般都有关键字提交的接口:  

      1. 百度关键字搜索接口:http://www.baidu.com/s?wd=keyword

      2. 360关键字搜索接口:http:/www.so.com/s?1=keyword

    示例代码:

import requests
keyword = 'Python'
url = 'http://www.baidu.com/s'
try:
    kv = {'wd': keyword}
    r = requests.get(url, params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print('不好意思哦,爬取失败了')
View Code

[D] 网络图片的爬取与存储

    1. 网络图片的识别

        一般网络图片的链接格式为:http://www.example.com/picture.jpg

          即以 xxx.jpg,  xxx.png等结尾的链接即为图片

    2. 根据指定网络图片url可以爬取并且存储图片

      示例代码:

import requests

path = 'C:/Users/Carrey/Desktop/abc.jpg'
url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000'
      '&sec=1605513179028&di=9221230b2ef023a1e92f50105c1afad8&imgtype=0'
      '&src=http%3A%2F%2Fpic1.win4000.com%2Fmobile%2Ff%2F53b4c394c966a.jpg'
r = requests.get(url)
print(r.status_code)
with open(path, 'wb') as f:
    f.write(r.content)
f.close()
print('爬取并且存储了')

      图片爬取的全代码:

# 爬取的全代码
import requests
import os
url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000'
      '&sec=1605513179028&di=9221230b2ef023a1e92f50105c1afad8&imgtype=0'
      '&src=http%3A%2F%2Fpic1.win4000.com%2Fmobile%2Ff%2F53b4c394c966a.jpg'
root = 'C:/Users/Carrey/Desktop/'
path = root + url[-10:]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path, 'wb') as f:
            f.write(r.content)
            f.close()
            print('恭喜我的主人,文件保存成功啦!')
    else:
        print('文件已存在')
except:
    print('So sorry, 文件爬取失败了')
View Code

[E] IP地址归属地的自动查询

    在网页中,我们可以手动去查询IP地址的位置,我们也可以通过爬虫来自动爬取

    示例代码:

import requests

IP = '202.204.80.112'
url = 'http://www.cip.cc/' + IP
kv = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(url, headers=kv, timeout=3)
print(r.status_code)
print(r.request.headers)
print(r.text[-500:])
View Code

    爬虫框架全代码:

# IP地址查询的全代码
import requests
IP = '202.204.80.112'
url = 'http://www.cip.cc/'
try:
    kv = {'User-Agent': 'Mozilla/5.0'}
    r = requests.get(url, headers=kv, timeout=3)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-300:])
except:
    print('So sorry, 小的爬取失败了')
View Code

  【小技巧】:获取网站API方法,观察 url 链接的变化

原文地址:https://www.cnblogs.com/carreyBlog/p/13984475.html