爬虫

爬虫

爬虫的五个步骤

1.需求分析（人，我们）

2.寻找网址（人，我们）

3.下载网站的返回内容（程序，requests）

4.通过返回的信息找到需要爬取的数据内容（程序，正则表达式，re包，xPATH对应lxml包）

5.存储找到的数据内容（程序，mysql）

requests是python里的一个包，它跟浏览器的功能一样，输出一个URL就能返回一个HTML信息

SCRAPY框架爬虫

import requests

url = 'https://www.baidu.com'

response = requests.get(url)

要通过url来返回值，所以传入的参数是url，这个response里就包含HTML信息

print(response.text)

如果要访问一个url，它能正常地给我们返回HTML信息，它地值就是200

with open('html','wb') as f:

f.write(response.content)

#response.text的类型是string

#response.content的类型是Bytes

#response.text与response.content可以相互转换

#response.text = response.content.decode('utf-8')

【推广】免费学中医，健康全家人

原文地址：https://www.cnblogs.com/simpledu/p/14370052.html