上交学长-02

7 爬虫 Http请求和Chrome

访问一个网页

http://kaoshi.edu.sina.com.cn/college/scorelist?tab=batch&wl=1&local=2&batch=&syear=2013

url：协议 + 域名／IP + 端口 + 路由 + 参数

ping

通过url能得到什么

在浏览器中打开

墙裂推荐大家使用Chrome浏览器

渲染效果、调试功能都是没话说的

http://www.google.cn/intl/zh-CN/chrome/browser/desktop/index.html

开发者工具

显示网页源代码、检查

Elements：页面渲染之后的结构，任意调整、即时显示；
Console：打印调试；
Sources：使用到的文件；
Network：全部网络请求。

Http请求

Http是目前最通用的web传输协议

GET：参数包含在url中；
POST：参数包含在数据包中，url中不可见。

http://shuju.wdzj.com/plat-info-59.html

Url类型

html：返回html结构页面，通过浏览器渲染后呈现给用户；
API：Application Programming Interfaces，请求后完成某些功能，例如返回数据。

http://kaoshi.edu.sina.com.cn/?p=college&s=api2015&a=getAllCollege

8 爬虫使用urllib2获取数据

Python中的Urllib2

https://docs.python.org/2/library/urllib2.html

我的python版本：2.7

发起GET请求

http://kaoshi.edu.sina.com.cn/college/scorelist?tab=batch&wl=1&local=2&batch=&syear=2013

request = urllib2.Request(url=url, headers=headers)

response = urllib2.urlopen(request, timeout=20)

result = response.read()

发起POST请求

http://shuju.wdzj.com/plat-info-59.html

data = urllib.urlencode({'type1': x, 'type2': 0, 'status': 0, 'wdzjPlatId': int(platId)})

request = urllib2.Request('http://shuju.wdzj.com/depth-data.html', headers)

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())

response = opener.open(request, data)

result = response.read()

处理返回结果

Html：BeautifulSoup，需要有一些CSS基础

API：JSON

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

9 实战爬取豆瓣电影数据

聊会天

三大目标：链家、豆瓣、点评

三月爬虫

矛与盾：伪装浏览器、IP限制、登陆、验证码（CAPTCHA）

通用思路

一个汇总页

一堆详情页

找链接

从汇总页一步一步下钻到详情页

找字段

在详情页中需要哪些字段

动手

10 数据库用MAMP和WAMP搭建Web环境

Web环境

Web服务器：Apache、Nginx，处理Web请求

数据库：MySQL，存储和管理数据

后端：PHP

Web服务启动后，就可以在浏览器中访问根目录中的网站项目

MAMP：Mac，Apache，MySQL，PHP，https://www.mamp.info/en/

WAMP：Windows，Apache，MySQL，PHP，http://www.wampserver.com/en/

偏好设置

端口设置：Apache、MySQL，端口只是一个后缀，不同服务使用不同端口，彼此不冲突

根目录：访问http://localhost:port/之后所得到的目录

Hello World

使用Html

使用PHP