Webmagic使用

准备:

        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>0.7.3</version>
        </dependency>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-extension</artifactId>
            <version>0.7.3</version>
        </dependency>

● PageProcessor

1.例:爬取id为nav的节点下的div节点下的div节点下的ul下的第5个li节点下的a节点

package com.tenpower.demo;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
import us.codecraft.webmagic.pipeline.FilePipeline;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.FileCacheQueueScheduler;
import us.codecraft.webmagic.scheduler.QueueScheduler;
import us.codecraft.webmagic.scheduler.RedisScheduler;

/**
 * 爬取类
 */
public class MyProcessor implements PageProcessor {
    
    public void process(Page page) {
        System.out.println( page.getHtml().xpath("//*[@id="nav"]/div/div/ul/li[5]/a" ).toString() );
    }

    public Site getSite() {
        return null;
    }

    public static void main(String[] args) {
    }
}

输出结果:

<a href="/nav/ai">人工智能</a>

2.添加目标地址,将当前页面里的所有链接都添加到目标页面中,从种子页面爬取到更多页面

page.addTargetRequests( page.getHtml().links().all()  );

3.目标地址正则匹配。例:只爬博客的文章详细页内容,并提取标题

page.addTargetRequests(  page.getHtml().links().regex("https://blog.csdn.net/[a-z 0-9 -]+/article/details/[0-9]{8}").all());
System.out.println(page.getHtml().xpath("//*[@id="mainBox"]/main/div[1]/div/div/div[1]/h1"));

复制XPath方法:

1)浏览器F12选中某个节点

2)右键复制XPath

4.忽略html标签只提取文本

String nickname=page.getHtml().xpath("//*[@id="uid"]/text()").get();

5.只提取图片地址

String image=page.getHtml().xpath("//*[@id="asideProfile"]/div[1]/div[1]/a").css("img","src").toString();

● Pipeline

1.FilePipeline文件方式保存

public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net/")
                .addPipeline(new ConsolePipeline())
                .addPipeline(new FilePipeline("e:/data"))
                .run();
    }

2.自定义Pipeline

package com.tenpower.demo;

import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

public class MyPipeline implements Pipeline {
    public void process(ResultItems resultItems, Task task) {

        String title = resultItems.get("title");
        System.out.println("定制  title:"+title);

    }
}

main方法添加Pipeline

public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net/")
                .addPipeline(new FilePipeline("e:/data"))
                .addPipeline(new MyPipeline())
                .run();
    }

● Scheduler

1.内存队列

public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net/")
                .setScheduler(new QueueScheduler())
                .run();
    }

2.文件队列。可以在关闭程序并下次启动时,从之前抓取到的URL继续抓取

public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net/")
                .setScheduler(new FileCacheQueueScheduler("E:\scheduler"))
                .run();
    }

3.redis队列。可进行多台机器同时抓取

public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net/")
                .setScheduler(new RedisScheduler("127.0.0.1"))
                .run();
    }

附1:Spider常用API

 附2:Page常用API

 附3:XPath语法

1.选取节点

 2.谓语。查找某个特定的节点或者包含某个特定值的节点(须嵌套在方括号中)

 3.通配符。选取未知的XML元素,通配指定节点

 4.多路径选择

 5.XPath轴。定义相对于当前节点的节点集

 

 

 6.XPath运算符

 7.常用功能函数。更好的进行模糊搜索

原文地址:https://www.cnblogs.com/naixin007/p/14505450.html