Python学习25

python—简单数据抓取二(抓取笔趣阁、趣事百科、优图网、安居客)

学习目标:

python—数据抓取的日常练习


学习内容:

1、抓取笔趣阁的首页小说简介
2、利用start—with抓取趣事百科的相关信息
3、获取优图网的图片,利用//代表前面有东西模糊匹配到img标签并获取到data-original图片的地址
4、抓取安居客非图片内容


1、抓取笔趣阁的首页小说简介

source = requests.get('http://www.xbiquge.la', headers=headers).text
base = etree.HTML(source).xpath('//*[@id="newscontent"]/div[1]/ul/li')
for i in base:
    type = i.xpath('span[1]/text()')
    books = i.xpath('span[2]/a/text()')
    chapter = i.xpath('span[3]/a/text()')
    author = i.xpath('span[4]/text()')

    print(type, books, chapter, author)
输出:
['[都市小说]'] ['摆个摊就能成神豪'] ['第160章 完全没有用武之地'] ['小老叔']
['[其他小说]'] ['荒野的黑客'] ['第四十三章 楞憨哼,揍我'] ['云外一声鸡']
['[其他小说]'] ['冷宫皇后皆寂寞'] ['第99章:母子之间的较量'] ['非也大人']
['[修真小说]'] ['西游之开局拒绝大闹天宫'] ['第二百八十九章 最弱的圣人'] ['我气化三清']
['[都市小说]'] ['人在末世也种田'] ['35、你老公和一个女人在一起呐'] ['小风猴猴']
.........

2、利用start—with抓取趣事百科的相关信息

//[@id=“qiushi_tag_123983036”]/div[1]/a[2]/h2
//
[@id=“qiushi_tag_123884600”]/a[1]/div/span
//[@id=“qiushi_tag_124000602”]
//
[@id=“qiushi_tag_124000602”]/div[1]/a[2]/h2 // *[ @ id = “qiushi_tag_124002094”]/a[1]/div/span

source = requests.get('https://www.qiushibaike.com/text/', headers=headers).text
base = etree.HTML(source).xpath('//*[starts-with(@id, "qiushi_tag_")]')
for i in base:
    title = i.xpath('div[1]/a[2]/h2/text()')
    content = " ".join(i.xpath('a[1] / div / span/text()'))
    print(title)
    print(content)

3、获取优图网的图片,利用//代表前面有东西模糊匹配到img标签并获取到data-original图片的地址

for i in range(1, 2):
    source = requests.get('http://www.uppsd.com/search-0-20-0-0-1-p'+str(i), headers=headers).text
    base = etree.HTML(source).xpath('//img[@class = "lazy"]/@data-original')
    for i in base:
        pic = requests.get(i).content
        print(pic)

4、抓取安居客非图片内容

source = requests.get('https://tianjin.anjuke.com/sale/?from=navigation', headers=headers).text
base = etree.HTML(source).xpath('//*[@id="__layout"]/div/section/section[3]/section[1]/section[2]/div')
for i in base:
    titel = i.xpath('a/div[2]/div[1]/div[1]/h3/text()')
    print(titel)
    txt = i.xpath('a / div[2] / div[1] / section / div[1] / p/span/text()')
    print(txt)
    neirong = i.xpath('a / div[2] / div[1] / section / div[1] / p/text()')
    print(neirong)
原文地址:https://www.cnblogs.com/tangmf/p/14331238.html