下载及爬取网页内容

三种常用的下载方式
import urllib
import requests

import urllib2

方法一

print('dowmload with urllib')

url = ""

urllib.urlretrieve(url,"lanzous") 不行, module 'urllib' has no attribute 'urlretrieve'

方法二 此方法python常用

r = requests.get(url)
with open("lanzo"+".pdf","wb") as f:
f.write(r.content)# No,rite() argument must be str, not bytes

方法三

print ('downloading with urllib2')

url = ''

f = urllib2.urlopen(url)

data = f.read()

with open("demo2.zip", "wb") as code:

code.write(data)

爬取网页内容

方法一

import urllib.request

url = "http://www.xxx.com"

html = urllib.request.urlopen(url).read() # 最基础的抓取

print(html)

方法二
import requests
url = "http://www.xxx.com"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept':'text/html;q=0.9,/;q=0.8',
'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding':'gzip',
'Connection':'close',
'Referer':None #注意如果依然不能抓取,这里可以设置抓取网站的host
}
html =requests.get(url,headers=headers)
html.encoding = html.apparent_encoding# 方法二运行失败

努力拼搏吧,不要害怕,不要去规划,不要迷茫。但你一定要在路上一直的走下去,尽管可能停滞不前,但也要走。
原文地址:https://www.cnblogs.com/wkhzwmr/p/13962948.html