爬取某招聘网站特殊字段的信息

举例:爬取带有“数据分析”职位的信息

需要材料:Python、

pip install requests
pip install  lxml 
pip install  pandas 

第一步:引用

import time
import requests
from lxml import etree
import pandas as pd
from pandas import DataFrame
from pandas import Series

第二步:找到可打开的正确的url

url = 'http://search.51job.com/list/080200,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=’

第三步:向这个URL发出请求,文本编码修改为中文字符

head = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'}
s = requests.Session()
res = s.get(url,headers = head)
res.encoding = 'gbk'
root = etree.HTML(res.text)

第四步:在原网址查看所需字段的位置

读取所需数据

position = root.xpath('//div[@class="el"]/p/span/a/@title')
company = root.xpath('//div[@class="el"]/span[@class="t2"]/a/@title')
place = root.xpath('//div[@class="el"]/span[@class="t3"]/text()')
salary = root.xpath('//div[@class="el"]/span[@class="t4"]/text()')
date = root.xpath('//div[@class="el"]/span[@class="t5"]/text()')
job = DataFrame([position,company,place,salary,date]).T
job.columns = ['position','company','place','salary','date']
job['page'] = 1
job.head()

测试运行

封装函数【要找到自增规律】

def Crawler_51job(df_origin,n):
    head = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'}
    s = requests.Session()
    for i in range(1,n+1):
        url = '

'http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=000000%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=99&keyword=%E4%BF%A1%E7%94%A8%E7%AE%A1%E7%90%86&keywordtype=2&curr_page=' + str(i) + '&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&fromType=14&dibiaoid=0&confirmdate=9'

'
        res = s.get(url,headers = head)
        res.encoding = 'gbk'
        root = etree.HTML(res.text)
        position = root.xpath('//div[@class="el"]/p/span/a/@title')
        company = root.xpath('//div[@class="el"]/span[@class="t2"]/a/@title')
        place = root.xpath('//div[@class="el"]/span[@class="t3"]/text()')
        salary = root.xpath('//div[@class="el"]/span[@class="t4"]/text()')
        date = root.xpath('//div[@class="el"]/span[@class="t5"]/text()')
        df = DataFrame([position,company,place,salary,date]).T
        df.columns = ['position','company','place','salary','date']
        df = DataFrame([position,company,place,salary,date]).T
        df.columns = ['position','company','place','salary','date']
        df['page'] = i
        time.sleep(2)
        df_origin = pd.concat([df_origin,df])
    return(df_origin)

函数调用与数据输出

job = Crawler_51job(job,17)

job = Crawler_51job(job,17)
job.to_csv('51job.csv',index = 0)

结果(excel坏了,用记事本查看一下)

 这样就完成了一个爬虫小程序

原文地址:https://www.cnblogs.com/keepgoingon/p/7110756.html