示例

爬取搜狗指定词条搜索后的页面数据

import requests
#指定url地址
url = 'https://www.sogou.com/'
#发起请求:get方法的返回值就是一个响应对象
response = requests.get(url=url)
#获取响应数据:text属性返回的是字符串形式的响应数据
page_text = response.text
print(page_text)
#持久化存储
with open('./soguo.html','w',encoding='utf-8') as f:
    f.write(page_text)

爬取搜狗指定词条搜索后的页面数据

wd = input('enter a word:')
param = {
    'query':wd
}
#UA标识
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
#2.应用了UA伪装实现的请求发送
response = requests.get(url=url,params=param,headers=headers)
#设置响应数据的编码
response.encoding = 'utf-8'

page_text = response.text
print(page_text)

name = wd+'.html'
with open(name,'w',encoding='utf-8') as fp:
    fp.write(page_text)
    print(name,'爬取成功!')

爬取百度翻译,页面中有可能会存在动态加载的数据

wd = input('enter a Enlish word:')
url = 'https://fanyi.baidu.com/sug'
data = {
    'kw':wd
}
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
response = requests.post(url=url,data=data,headers=headers)

obj_json = response.json()

print(obj_json)

爬取的是豆瓣电影中的电影详情数据

注意:页面中有些情况下会包含动态加载的数据

import requests

url = 'https://movie.douban.com/j/chart/top_list'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
param = {
    "type": "5",
    "interval_id": "100:90",
    "action": "",
    "start": "0",
    "limit": "200",
}

obj_json = requests.get(url=url,headers=headers,params=param).json()

print(len(obj_json))

爬取国家药品监督管理总局中基于中华人民共和国化妆品生产许可证相关数据http://125.35.6.84:81/xk/

  • 首页中的企业信息是动态加载出来的
  • 首页中的企业信息是通过ajax请求获取的(ID)
  • 企业详情页中的详情数据也是动态加载出来
  • 企业详情数据是通过ajax动态请求(url:域名都统一一样 携带的参数(ID)不一样)到的(企业的详情信息)
import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
post_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
IDs = []
all_data = [] #存储所有企业的详情数据
for page in range(1,5):
    data = {
        "on": "true",
        "page": str(page),
        "pageSize": "15",
        "productName":"" ,
        "conditionType":"1",
        "applyname": "",
        "applysn": "",
    }
    #首页ajax请求返回的响应数据(解析出ID)
    json_obj = requests.post(url=post_url,headers=headers,data=data).json()
    
    for dic in json_obj['list']:
        ID = dic['ID']
        IDs.append(ID)

for id in IDs:
    detail_post_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
    data = {
        'id':id
    }
    detail_dic = requests.post(url=detail_post_url,headers=headers,data=data).json()
    all_data.append(detail_dic)
    
print(all_data[0])
原文地址:https://www.cnblogs.com/wanglan/p/10788656.html