渗透利器-kali工具 (第四章-5) 爬虫入门

本文内容：

交换机制
网页解析
爬虫所需的模块和库
目录扫描工具原理实战

Python爬虫入门[spider]

1，交换机制：

　　服务器与本地的交换机制：

　　　　http协议：客户端与服务器一种会话的方式。

　　　　客户端-------[requests[请求]]------->服务器

　　　　客户端-------[response[响应]]------>服务器

　　HTTP请求：

　　　　向服务器请求的时候使用request请求，包含了很多的不同的方法：主要用到[GET、POST]

　　HTTP响应：

　　　　向服务器提出request之后，服务器会返回给我们一个Response[我们请求的这个网页]

　　　　RESPONST：

　　　　　　Status_code：[状态码]200 网页中的元素

　　　　　　Status_code：[状态码]403/404

　　　　可以打开谷歌，加载一个网站然后点击检查>>>nerwork>>>刷新

2，网页解析：

　　1.网页解析：需要使用到[bs4]

　　　　from bs4 import BeautifulSoup

　　　　import requests

　　　　#解析网页内容

　　　　url = "https://www.baidu.com"

　　　　wb_data = requests.get(url)

　　　　soup = BeautifulSoup(wb_data.text,'lxml')

　　　　print(soup)

　　2.描述要爬取的元素位置：

　　　　eg：标题[在网页中找到它在的位置] >>>右键复制selector

　　　　　　titles = soup.select('#sy_load > ul:nth-child(2) > li:nth-child(1) > div.syl_info > a')

　　　　　　print(titles)

　　　　　　解释：

　　　　　　　　#sy_load > ul:nth-child(2) > li:nth-child(1) > div.syl_info：标签的位置：selector

　　　　　　　　a：查找a标签

　　　　向上查找[上级标签]class名：

　　　　　　titles = soup.select('div.syl_info> a')

　　　　　　print(titles)

　　　　　　解释：

　　　　　　　　div.syl_info 标签的class名

　　　　　　　　a：查找a标签

　　3.bs4中具有一个BeautifulSoup安装方法：

　　　　1.安装：pip install beautifulsoup4

　　　　2.可选择安装解析器：

　　　　　　pip install lxml [一般安装这个即可]

　　　　　　pip install html5lib

　　　　3.使用：

　　　　　　from bs4 import BeautifulSoup

　　　　　　import requests

　　　　　　req_obj = requests.get('https://www.baidu.com')

　　　　　　soup = BeautifulSoup(req_boj.txt,'lxml')

　　　　　　不使用BeautifulSoup，只返回状态码

　　　　　　使用BeautifulSoup，会将站点，html代码返回。

　　　　4.经常使用到的一些方法：

　　　　　　from bs5 import BeautifulSoup

　　　　　　import requests,re

　　　　　　a = requests.get('https://www.baidu.com')

　　　　　　b = BeautifulSoup(a.txt,'lxml')

　　　　　　print(b.title)　　　　　输入title找标签只找一个

　　　　　　print(b.find('title'))　　　　输入title找标签只找一个

　　　　　　print(b.find_all('div'))　　找所有div标签

　　　　　　c = soup.div　　　　　　创建div的实例化

　　　　　　print(c['id'])　　　　　　查看标签的id属性　　

　　　　　　print(c.attrs)　　　　　　查看标签的所有属性

　　　　　　d = soup.title　　　　　　创建title的实例化

　　　　　　print(d.string)　　　　　　获取标签里的字符串

　　　　　　e = soup.head　　　　　创建head的实例化

　　　　　　print(e.title)　　　　　　　获取标签，再获取子标签

　　　　　　f = soup.body　　　　　　创建body实例化

　　　　　　print(f.contents)　　　　　返回标签子节点，以列表的形式返回

　　　　　　g = soup.title　　　　　　创建title实例化

　　　　　　print(g.parent)　　　　　　查找父标签

　　　　　　print(soup.find_all(id='link2'))

3，爬虫所需的模块和库：

　　库：requests.bs4

　　模块：BeautifulSoup

　　1.抓取：requests

　　2.分析：BeautifulSoup

　　3.存储：

4.目录扫描工具原理实战：

　　import requests

　　import sys

　　url = sys.argv[1]

　　dic = sys.argv[2]

　　with open(dic,'r') as f:

　　　　for i in f.readlines()　　　　一行读取

　　　　　　i = i.strip()　　　　　　去除空格

　　　　　　r = requests.get(url+i)

　　　　　　if r.stats_code == 200:

　　　　　　　　print('url:'+r.url)