day12-20180426笔记

笔记：Python模块hashlib、io、json、requests

一、hashlib加密模块

适用于python2


import hashlib
# m = hashlib.md5()
# src = "123456"
# m.update(src)
# print(m.hexdigest())

摘要算法简介

Python的hashlib提供了常见的摘要算法，如MD5，SHA1等等。

什么是摘要算法呢？摘要算法又称哈希算法、散列算法。它通过一个函数，把任意长度的数据转换为一个长度固定的数据串（通常用16进制的字符串表示）。

举个例子，你写了一篇文章，内容是一个字符串'how to use python hashlib - by Michael'，并附上这篇文章的摘要是'2d73d4f15c0db7f5ecb321b6a65e5d6d'。如果有人篡改了你的文章，并发表为'how to use python hashlib - by Bob'，你可以一下子指出Bob篡改了你的文章，因为根据'how to use python hashlib - by Bob'计算出的摘要不同于原始文章的摘要。

可见，摘要算法就是通过摘要函数f()对任意长度的数据data计算出固定长度的摘要digest，目的是为了发现原始数据是否被人篡改过。

摘要算法之所以能指出数据是否被篡改过，就是因为摘要函数是一个单向函数，计算f(data)很容易，但通过digest反推data却非常困难。而且，对原始数据做一个bit的修改，都会导致计算出的摘要完全不同。

我们以常见的摘要算法MD5为例，计算出一个字符串的MD5值：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2018/4/27 14:30
# @Author : yangyuanqiang
# @File : demon1.py


import hashlib

md5 = hashlib.md5()
md5.update('how to use md5 in python hashlib?'.encode('utf-8'))
print(md5.hexdigest())

以上实例输出的结果

d26a53750bc40b38b65a520292f69306

md5在线解密破解：http://www.cmd5.com/

如果数据量很大，可以分块多次调用update()，最后计算的结果是一样的：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2018/4/27 14:30
# @Author : yangyuanqiang
# @File : demon1.py


import hashlib

md5 = hashlib.md5()
md5.update('how to use md5 in '.encode('utf-8'))
md5.update('python hashlib?'.encode('utf-8'))
print(md5.hexdigest())

以上实例输出的结果

d26a53750bc40b38b65a520292f69306

MD5是最常见的摘要算法，速度很快，生成结果是固定的128 bit字节，通常用一个32位的16进制字符串表示。

摘要算法在很多地方都有广泛的应用。要注意摘要算法不是加密算法，不能用于加密（因为无法通过摘要反推明文），只能用于防篡改，但是它的单向计算特性决定了可以在不存储明文口令的情况下验证用户口令。

二、io模块

StringIO

很多时候，数据读写不一定是文件，也可以在内存中读写。

StringIO顾名思义就是在内存中读写str。

要把str写入StringIO，我们需要先创建一个StringIO，然后，像文件一样写入即可：

>>> from io import StringIO
>>> f = StringIO()
>>> f.write('hello')
5
>>> f.write(' ')
1
>>> f.write('world!')
6
>>> print(f.getvalue())
hello world!

getvalue()方法用于获得写入后的str。

要读取StringIO，可以用一个str初始化StringIO，然后，像读文件一样读取：

from io import StringIO

f = StringIO("Hello!
Hi!
Goodbye!")
while True:
    s = f.readline()
    if s == '':
        break
    print(s.strip())

以上实例输出的结果

Hello!
Hi!
Goodbye!

from io import StringIO

stringIO = StringIO()
stringIO.write("hello world!")
stringIO.write("lalalalla, wo shi mai bao de xiao hang jia")
print(stringIO.getvalue())
stringIO.truncate(0)
print(stringIO.getvalue())

以上实例输出的结果

hello world!lalalalla, wo shi mai bao de xiao hang jia

BytesIO

StringIO操作的只能是str，如果要操作二进制数据，就需要使用BytesIO。

BytesIO实现了在内存中读写bytes，我们创建一个BytesIO，然后写入一些bytes：

from io import BytesIO

f = BytesIO()
print(f.write('中文'.encode('utf-8')))
print(f.getvalue())

以上实例输出的结果

6
b'xe4xb8xadxe6x96x87'

请注意，写入的不是str，而是经过UTF-8编码的bytes。

和StringIO类似，可以用一个bytes初始化BytesIO，然后，像读文件一样读取：

from io import BytesIO

f = BytesIO(b'xe4xb8xadxe6x96x87')
print(f.read())

以上实例输出的结果

b'xe4xb8xadxe6x96x87'

StringIO和BytesIO是在内存中操作str和bytes的方法，使得和读写文件具有一致的接口

三、json模块

JSON (JavaScript Object Notation) 是一种轻量级的数据交换格式。它基于ECMAScript的一个子集。

Python3 中可以使用 json 模块来对 JSON 数据进行编解码，它包含了两个函数：

json.dumps(): 对数据进行编码。
json.loads(): 对数据进行解码。

在json的编解码过程中，python 的原始类型与json类型会相互转换，具体的转化对照如下：

Python 编码为 JSON 类型转换对应表：

Python	JSON
dict	object
list, tuple	array
str	string
int, float, int- & float-derived Enums	number
True	true
False	false
None	null

JSON 解码为 Python 类型转换对应表：

JSON	Python
object	dict
array	list
string	str
number (int)	int
number (real)	float
true	True
false	False
null	None

json.dumps 与 json.loads 实例

以下实例演示了 Python 数据结构转换为JSON：

#!/usr/bin/env python

import json

# Python 字典类型转换为 JSON 对象
data = {
    'no' : 1,
    'name' : 'Runoob',
    'url' : 'http://www.runoob.com'
}

json_str = json.dumps(data)
print ("Python 原始数据：", repr(data))
print ("JSON 对象：", json_str)

以上实例输出的结果

Python 原始数据： {'url': 'http://www.runoob.com', 'no': 1, 'name': 'Runoob'}
JSON 对象： {"url": "http://www.runoob.com", "no": 1, "name": "Runoob"}

通过输出的结果可以看出，简单类型通过编码后跟其原始的repr()输出结果非常相似。

接着以上实例，我们可以将一个JSON编码的字符串转换回一个Python数据结构：

#!/usr/bin/env python

import json

# Python 字典类型转换为 JSON 对象
data1 = {
    'no' : 1,
    'name' : 'Runoob',
    'url' : 'http://www.runoob.com'
}

json_str = json.dumps(data1)
print ("Python 原始数据：", repr(data1))
print ("JSON 对象：", json_str)

# 将 JSON 对象转换为 Python 字典
data2 = json.loads(json_str)
print ("data2['name']: ", data2['name'])
print ("data2['url']: ", data2['url'])

以上实例输出的结果

Python 原始数据： {'name': 'Runoob', 'no': 1, 'url': 'http://www.runoob.com'}
JSON 对象： {"name": "Runoob", "no": 1, "url": "http://www.runoob.com"}
data2['name']:  Runoob
data2['url']:  http://www.runoob.com

要处理的是文件而不是字符串，你可以使用 json.dump() 和 json.load() 来编码和解码JSON数据。例如：

# 写入 JSON 数据
with open('data.json', 'w') as f:
    json.dump(data, f)

# 读取数据
with open('data.json', 'r') as f:
    data = json.load(f)

四、requests模块

requests库是一个常用的用于http请求的模块，它使用python语言编写，可以方便的对网页进行爬取，是学习python爬虫的较好的http请求模块。

requests库的七个主要方法

方法	解释
requests.request()	构造一个请求，支持以下各种方法
requests.get()	获取html的主要方法
requests.head()	获取html头部信息的主要方法
requests.post()	向html网页提交post请求的方法
requests.put()	向html网页提交put请求的方法
requests.patch()	向html提交局部修改的请求
requests.delete()	向html提交删除请求

其中response对象有以下属性：

属性	说明
r.status_code	http请求的返回状态，若为200则表示请求成功。
r.text	http响应内容的字符串形式，即返回的页面内容
r.encoding	从http header 中猜测的相应内容编码方式
r.apparent_encoding	从内容中分析出的响应内容编码方式（备选编码方式）
r.content	http响应内容的二进制形式

requests库的异常
注意requests库有时会产生异常，比如网络连接错误、http错误异常、重定向异常、请求url超时异常等等。所以我们需要判断r.status_codes是否是200，在这里我们怎么样去捕捉异常呢？

这里我们可以利用r.raise_for_status() 语句去捕捉异常，该语句在方法内部判断r.status_code是否等于200，如果不等于，则抛出异常。

于是在这里我们有一个爬取网页的通用代码框架：

try:
    r=requests.get(url,timeout=30)#请求超时时间为30秒
    r.raise_for_status()#如果状态不是200，则引发异常
    r.encoding=r.apparent_encoding #配置编码
    return r.text
except:
    return"产生异常"

request.head()

>>> r=requests.head("http://httpbin.org/get")
 >>>r.headers
 {'Connection': 'keep-alive', 'Server': 'meinheld/0.6.1', 'Date': 'Mon, 20 Nov 2017 08:08:46 GMT', 'Content-Type': 'application/json', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'X-Powered-By': 'Flask', 'X-Processed-Time': '0.000658988952637', 'Content-Length': '268', 'Via': '1.1 vegur'}
 >>>r.text
 ""

requests.post()

1、向url post一个字典：

>>> payload={"key1":"value1","key2":"value2"}
>>> r=requests.post("http://httpbin.org/post",data=payload)
>>> print(r.text)
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "json": null, 
  "origin": "218.197.153.150", 
  "url": "http://httpbin.org/post"
}

2、向url post 一个字符串，自动编码为data

>>>r=requests.post("http://httpbin.org/post",data='helloworld')
>>>print(r.text)
{
  "args": {}, 
  "data": "helloworld", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "10", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "json": null, 
  "origin": "218.197.153.150", 
  "url": "http://httpbin.org/post"
}

requests.put()

>>> payload={"key1":"value1","key2":"value2"}
>>> r=requests.put("http://httpbin.org/put",data=payload)
>>> print(r.text)
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "json": null, 
  "origin": "218.197.153.150", 
  "url": "http://httpbin.org/put"

requests.patch()

requests.patch和request.put类似。
两者不同的是：
当我们用patch时仅需要提交需要修改的字段。
而用put时，必须将20个字段一起提交到url，未提交字段将会被删除。
patch的好处是：节省网络带宽。

requests.request()

requests.request(）支持其他所有的方法。
requests.request(method，url,**kwargs)

method: “GET”、”HEAD”、”POST”、”PUT”、”PATCH”等等
url: 请求的网址
**kwargs: 控制访问的参数

requests模块的使用实例

1、京东商品信息的爬取

不需要对头部做任何修改，即可爬网页

import requests
url='http://item.jd.com/2967929.html'
try:
    r=requests.get(url,timeout=30)
    r.raise_for_status()
    r.encoding=r.apparent_encoding 
    print(r.text[:1000]) #部分信息
except:
    print("失败"）

2、亚马逊商品信息的爬取

该网页中对爬虫进行的爬取做了限制，因此我们需要伪装自己为浏览器发出的请求。

import requests
url='http://www.amazon.cn/gp/product/B01M8L5Z3Y'
try:
    kv={'user_agent':'Mozilla/5.0'}
    r=requests.get(url,headers=kv)#改变自己的请求数据
    r.raise_for_status()
    r.encoding=r.apparent_encoding 
    print(r.text[1000:2000]) #部分信息
except:
    print("失败"）

3、百度搜索关键字提交

百度的关键字接口：
https://www.baidu.com/s?wd=keyword

import requests
keyword='python'
try:
    kv={'wd':keyword}
    r=requests.get('https://www.baidu.com/s',params=kv)
    r.raise_for_status()
    r.encoding=r.apparent_encoding 
    print(len(r.text)) 
except:
    print("失败"）

4、网络图片的爬取

import requests
import os
try:
    url="http://baishi.baidu.com/watch/02167966440907275567.html"#图片地址
    root="E:/pic/"
    path=root+url.split("/")[-1]
    if not os.path.exists(root): #目录不存在创建目录
        os.mkdir(root)
    if not os.path.exists(path): #文件不存在则下载
        r=requests.get(url)
        f=open(path,"wb")
        f.write(r.content)
        f.close()
        print("文件下载成功")
    else:
        print("文件已经存在")
except:
    print("获取失败")

总结：

了解模块的方法怎么使用，平时多做练习，锻炼逻辑思维。