爬虫入门

发现要抓取的内容在网页源码上面没有,找到传内容的json文件,解析,结果如下:
代码:
#coding=utf-8
import json
import urllib
import urllib.request
 
 
def getPage(url):     #获取json内容
    response=urllib.request.urlopen(url).read()
    z_response=response.decode('UTF-8')    #转码成中文
    return z_response
 
names=json.loads(getPage(url))
#{"state":"ok","message":"","special":"","data":{"total":4,"result":[{"amount":3528.5705,"id":2277807374,"capitalActl":[],"type":2,"capital":[{"amomon":"3,528.5705万元","percent":"54.29%"}],"name":"马化腾"},{"amount":1485.7115,"id":1925786094,"capitalActl":[],"type":2,"capital":[{"amomon":"1,485.7115万元","percent":"22.86%"}],"name":"张志东"},{"amount":742.859,"id":2246944474,"capitalActl":[],"type":2,"capital":[{"amomon":"742.859万元","percent":"11.43%"}],"name":"陈一丹"},{"amount":742.859,"id":2171369795,"capitalActl":[],"type":2,"capital":[{"amomon":"742.859万元","percent":"11.43%"}],"name":"许晨晔"}]}}
for i in range(0,names['data']['total']):
   print(names['data']['result'][i]['name'])
 
解决短时间内限制问题:
法一:有小部分网站的防范措施比较弱,可以伪装下IP,修改X-Forwarded-for(貌似这么拼。。。)即可绕过。

大部分网站么,如果要频繁抓取,一般还是要多IP。我比较喜欢的解决方案是国外VPS再配多IP,通过默认网关切换来实现IP切换,比HTTP代理高效得多,估计也比多数情况下的ADSL切换更高效。

法二:尽可能的模拟用户行为:1、UserAgent经常换一换;2、访问时间间隔设长一点,访问时间设置为随机数;3、访问页面的顺序也可以随机着来
原文地址:https://www.cnblogs.com/to-creat/p/6743985.html