基于JSoup库的java爬虫开发学习—

基于JSoup库的java爬虫开发学习——小步快跑

因某需求，需要使用java从网页上爬取一些数据来使用，花了点时间看了一下JSoup,简单介绍一下

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. Java HTML Parser官网

译：jsoup是一个用于处理实际HTML的Java库。它提供了一个非常方便的API来提取和操作数据，使用最好的DOM、CSS和类jquery方法。

简单来说就是可以使用这个jsoup库根据HTML标签元素来定位你想要的数据，下面直接切入主题学习使用JSoup.

一、导入所需jar包

本文写作时使用的maven文件，如需下载jar包，文低引用2中有相关下载链接

<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.12.1</version>
</dependency>

二、main测试

1.读取超链接URL（本文测试这一种方式，欲使用他方式请参考引用3），简单点，就去访问百度的首页面

        try {
            //首先，通过工具类连接上URL
            Document doc = Jsoup.connect("https://www.baidu.com/").get();
            //通过文档获取标题信息
            String title = doc.title();
            System.out.println(title);
        } catch (IOException e) {
            e.printStackTrace();
        }

打印内容：

百度一下，你就知道

2.获取<a>标签的URL及文本

        try {
            Document doc = Jsoup.connect("https://www.baidu.com/").get();
            /*获取URL的链接*/
            Elements links = doc.select("a[href]");
            for (Element link : links) {
                System.out.println("link : " + link.attr("href"));
                System.out.println("text : " + link.text());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

打印内容（部分）：

text : 百度首页
link : javascript:;
text : 设置
link : https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5
text : 登录
link : http://news.baidu.com
text : 新闻
link : https://www.hao123.com

3.获取URL的元信息

        try {
            Document doc = Jsoup.connect("https://www.baidu.com/").get();
            /*获取URL的元信息*/
            //查询metab标签的第一个name属性值为referrer的属性为content的值
            String keywords = doc.select("meta[name=referrer]").first().attr("content");
            System.out.println("Meta keyword : " + keywords);
            String description = doc.select("meta[name=theme-color]").get(0).attr("content");
            System.out.println("Meta description : " + description);

        } catch (IOException e) {
            e.printStackTrace();
        }

打印结果：

Meta keyword : always
Meta description : #2932e1

4.获取URL的图像信息

        try {
            Document doc = Jsoup.connect("https://www.baidu.com/").get();
            /*获取URL的图像*/
            Elements images = doc.select("img[src~=(?i)\.(png|jpe?g|gif)]");
            for (Element image : images) {
                System.out.println("src : " + image.attr("src"));
                System.out.println("height : " + image.attr("height"));
                System.out.println("width : " + image.attr("width"));
                System.out.println("alt : " + image.attr("alt"));
            }

        } catch (IOException e) {
            e.printStackTrace();
        }

打印结果（部分）：

src : //www.baidu.com/img/baidu_jgylogo3.gif
height :
width :
alt : 到百度首页
src : //www.baidu.com/img/baidu_resultlogo@2.png
height :
width :
alt : 到百度首页

5.获取表单参数

        try {
            Document doc = Jsoup.connect("https://www.baidu.com/").get();
            /*获取表单参数*/
            //首先通过ID定位指定标签
            Element loginform = doc.getElementById("form");
            //获取标签input因是通过标签获取，所以它是一个复数集合
            Elements inputElements = loginform.getElementsByTag("input");
            //遍历集合获取每一个input标签中的属性值（根据此法可定位自己想要的数据）
            for (Element inputElement : inputElements) {
                String key = inputElement.attr("name");
                String value = inputElement.attr("value");
                System.out.println("Param name: "+key+" 
Param value: "+value);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

打印结果（部分）：

Param name: rsv_pq
Param value: 935e81bd0003a4fa
Param name: rsv_t
Param value: 995c4PHOYhjruVrvWzHXHuwlKcndZzriFTV+H6ELp2VaJNhvTjAP9/aule8
Param name: rqlang
Param value: cn

实战测试

某网站获取近十年的河南高考分数线 http://www.gaokao.com/henan/fsx/

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

/**
 * Created with CosmosRay
 *
 * @author CosmosRay
 * @date 2019/6/24
 * Function:
 */
public class MyJSoup {
    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("http://www.gaokao.com/henan/fsx/").get();
            Element element = doc.getElementsByTag("table").first();
            Elements titls = element.getElementsByTag("tr");
            boolean flag = false;
            for (Element titl : titls
            ) {
                if(!flag) {
                    Elements ths = titl.getElementsByTag("th");
                    for (Element element1 : ths
                    ) {
                        String s = element1.text();
                        System.out.print(s + "  ");
                    }
                    System.out.println();
                    flag = true;
                }else {
                    Elements ths = titl.getElementsByTag("td");
                    for (Element element1 : ths
                    ) {
                        String s = element1.text();
                        System.out.print(s + "  ");
                    }
                    System.out.println();
                }
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

打印结果：

2018 2017 2016 2015 2014 2013 2012 2011 2010 2009
一本 547 516 517 513 536 519 557 562 532 552
二本 436 389 458 455 483 465 509 515 489 510
专科 200 180 183 180 200 - 360 393 397 417

引用：1.Java HTML Parser 2. jsoup Cookbook(中文版) 3.易百教程