爬虫综合案例

爬虫综合案例(jd爬虫)

学习了HttpClient和Jsoup,就掌握了如何抓取数据和如何解析数据,接下来,我们完成我们的项目案例,把京东的手机数据抓取下来

一、需求分析

需求说明:

本次爬取jd商城中所有手机商品数据:主要包含 商品的名称 商品价格 商品的id 商品图片 商品的详情的地址

 

 

 

 

 

 

 

 

 

 

 

 

 

 

通过点击F12观察: 所需要爬取的数据在一下这几个地方

 

 

 

 

 

 

 

 

 

对于商品的详情页: 通过分析发现 , 请详情页的url地址就是通过spu拼接而来的

 

 

 

 

 

 

 

 

 

 

 

 

 

1. spu 和 sku的区别说明

l SPU = Standard Product Unit (标准产品单位)

SPU是商品信息聚合的最小单位,是一组可复用、易检索的标准化信息的集合,该集合描述了一个产品的特性。通俗点讲,属性值、特性相同的商品就可以称为一个SPU。

 

例如 iPhone X 可以确定一个产品即为一个SPU

 

l SKU=stock keeping unit(库存量单位)

SKU即库存进出计量的单位, 可以是以件、盒、托盘等为单位。SKU是物理上不可分割的最小存货单元。在使用时要根据不同业态,不同管理模式来处理。在服装、鞋类商品中使用最多最普遍。

 

例如  iPhone X 64G 银色 则是一个SKU。

二、项目的准备工作

1. 表结构的准备工作

根据需求分析, 我们创建的表如下:

CREATE DATABASE `day04_jdspider`;

USE  `day04_jdspider`;

CREATE TABLE `jd_item` (

  `id` bigint(10) NOT NULL AUTO_INCREMENT COMMENT '主键id',

  `spu` bigint(15) DEFAULT NULL COMMENT '商品集合id',

  `sku` bigint(15) DEFAULT NULL COMMENT '商品最小品类单元id',

  `title` varchar(1000) DEFAULT NULL COMMENT '商品标题',

  `price` double(10,0) DEFAULT NULL COMMENT '商品价格',

  `pic` varchar(200) DEFAULT NULL COMMENT '商品图片',

  `url` varchar(1500) DEFAULT NULL COMMENT '商品详情地址',

  `created` varchar(100) DEFAULT NULL COMMENT '创建时间',

  `updated` varchar(100) DEFAULT NULL COMMENT '更新时间',

  PRIMARY KEY (`id`),

  KEY `sku` (`sku`) USING BTREE

) ENGINE=InnoDB AUTO_INCREMENT=1116 DEFAULT CHARSET=utf8 COMMENT='京东商品';

 

2. 项目准备

l 1) 创建项目的模块

 
   
 
   

 

 

 
   
 
   

2) 添加pom依赖

<dependencies>
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.4</version>
    </dependency>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.10.3</version>
    </dependency>
    <dependency>
        <groupId>com.mchange</groupId>
        <artifactId>c3p0</artifactId>
        <version>0.9.5.2</version>
    </dependency>

    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>5.1.38</version>
    </dependency>

    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <version>1.18.8</version>
        <scope>provided</scope>
    </dependency>

</dependencies>
<build>
    <plugins>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.2</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>

    </plugins>

</build>

l 3) 添加C3P0配置文件: c3p0-config.xml

<c3p0-config>
    <!-- 使用默认的配置读取连接池对象 -->
    <default-config>

        <!--  连接参数 -->
        <property name="driverClass">com.mysql.jdbc.Driver</property>

        <property name="jdbcUrl">jdbc:mysql://localhost:3306/day04_jdspider</property>
        <property name="user">root</property>
        <property name="password">123456</property>

        <!-- 连接池参数 -->
        <property name="initialPoolSize">5</property>

        <property name="maxPoolSize">10</property>
        <property name="checkoutTimeout">3000</property>
    </default-config>
</c3p0-config>

 

l 4) 添加工具类

public class C3P0Utils {

    private  static ComboPooledDataSource dataSource = new ComboPooledDataSource();

    private C3P0Utils() {
    }

    public static Connection getConnection(){


        Connection connection = null;
        try {
            connection = dataSource.getConnection();
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return connection;
    }



    public static void  closeAll(ResultSet resultSet, Statement statement, Connection connection){
        try{
            if( resultSet!=null ){
                resultSet.close();
            }

            if( statement!=null ){
                statement.close();
            }

            if( connection!=null ){
                connection.close();
            }

        }catch (Exception e) {
            e.printStackTrace();
        }

    }

}

 

l 5) 添加pojo类:

注意: 使用此注解 ,前提必须在idea中安装好lombok插件, 并在pom中导入lombok依赖才可以使用, 否则手动实现 get set toString 以及 空参 和全参构造

@Data
@AllArgsConstructor
@NoArgsConstructor
public class Item {
    //主键
    private Long id;

    //标准产品单位(商品集合)
    private Long spu;

    //库存量单位(最小品类单元)
    private Long sku;

    //商品标题
    private String title;

    //商品价格
    private Double price;

    //商品图片
    private String pic;

    //商品详情地址
    private String url;

    //创建时间
    private String created;

    //更新时间
    private String updated;


}

 

 

 

3. 项目开发

l 1) 发送请求, 获取数据

public class JdSpider {

    public static void main(String[] args) throws Exception {
        //1. 确定首页URL
        String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=1&click=0";


        //2. 发送请求, 获取数据  httpClient
        //2.1: 创建HttpClient对象:
        CloseableHttpClient httpClient = HttpClients.createDefault();


        /2.2: 创建请求方式的对象: HttpGet  HttpPost
        HttpGet httpGet = new HttpGet(indexUrl);

        //2.3: 设置请求信息: 请求头
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");


         //2.4: 发送请求, 获取响应对象
        CloseableHttpResponse response = httpClient.execute(httpGet);


        //2.5: 根据response 获取响应的数据
        int statusCode = response.getStatusLine().getStatusCode();

        System.out.println("状态码为:" + statusCode);
        if (statusCode == 200) {
            String html = EntityUtils.toString(response.getEntity(), "UTF-8");
            /2.6 释放资源
            response.close();

        }

   }

}

 

l 2) 解析数据: 注意红色部分为新增解析数据代码


public class JdSpider {

    public static void main(String[] args) throws Exception {
        //1. 确定首页URL
        String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=0&click=0";


        //2. 发送请求, 获取数据  httpClient
        //2.1: 创建HttpClient对象:
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.2: 创建请求方式的对象: HttpGet  HttpPost
        HttpGet httpGet = new HttpGet(indexUrl);

        //2.3: 设置请求信息: 请求头
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");


        //2.4: 发送请求, 获取响应对象
        CloseableHttpResponse response = httpClient.execute(httpGet);


        //2.5: 根据response 获取响应的数据
        int statusCode = response.getStatusLine().getStatusCode();

        System.out.println("状态码为:" + statusCode);
        if (statusCode == 200) {

            String html = EntityUtils.toString(response.getEntity(), "UTF-8");


            //2.6 释放资源
            response.close();

            //3. 解析数据: jsoup
            //3.1: 根据html 获取其对应document对象
            Document document = Jsoup.parse(html);

            //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
            Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");

            List<Item> itemList = new ArrayList<>();
            for (Element li : lis) {
                //3.3: 获取每件商品的图片的URL , 完成图片的下载
                Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");

                String imgUrl = "https:" + imgs.attr("src");
                //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
                HttpGet imgGet = new HttpGet(imgUrl);

                CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
                HttpEntity imgEntity = imgResonse.getEntity();
                InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

                //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
               String imgFileName = "E:\jdImg\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));

               FileOutputStream outputStream = new FileOutputStream(imgFileName);

               //3.3.3: 两个流进行对接 将数据写入到本地磁盘中
              int len;

              byte[] b = new byte[1024];
              while ((len = inputStream.read(b)) != -1) {
                   outputStream.write(b, 0, len);
               }

               //3.3.4: 释放资源
               outputStream.close();

               inputStream.close();
               imgResonse.close();
               //3.4: 解析 spu 和 sku
               String skuValue = li.attr("data-sku");

               String spuValue = li.attr("data-spu");
               if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;
               //3.5: 解析商品名称
               Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");

               String title = ems.text();
               //3.6: 解析商品的价格
               Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");

               String price = priceLiEls.text();
               //3.7: 解析商品的URL
               String itemUrl = "https://item.jd.com/" + skuValue + ".html";

               //3.8: 封装数据
               Item item = new Item(null,

                            Long.parseLong(spuValue),
                            Long.parseLong(skuValue),
                            title,
                            Double.parseDouble(price),
                            imgFileName,
                            itemUrl,
                            new Date().toLocaleString(),
                            new Date().toLocaleString()
                );
                    //3.9: 把解析每一个item对象. 都封装到一个集合中
                itemList.add(item);

           }

           System.out.println("获取到:" + itemList.size() + "个");
       }
       
    }
}

 

l 3) 保存数据

n 3.1: 先构建一个 jdSpiderDao 用于执行保存数据

public class JDItemDao {

    // 保存数据的操作
    public void  saveItem(List<Item> itemList) throws Exception {


        //1. 从连接池中获取连接对象
        Connection connection = C3P0Utils.getConnection();


        //2. 根据连接创建预处理的执行平台
        String sql = "insert into jd_item VALUES (null,?,?,?,?,?,?,?,?) ";

        PreparedStatement statement = connection.prepareStatement(sql);

        //3.执行SQL. 获取结果
        for (Item item : itemList) {


            //3.1: 有? 先 封装 ?
            statement.setLong(1,item.getSpu());

            statement.setLong(2,item.getSku());
            statement.setString(3,item.getTitle());
            statement.setDouble(4,item.getPrice());
            statement.setString(5,item.getPic());
            statement.setString(6,item.getUrl());
            statement.setString(7,item.getCreated());
            statement.setString(8,item.getUpdated());

            //3.2: 执行SQL
            statement.executeUpdate();


        }

        //4. 释放资源
        C3P0Utils.closeAll(null,statement,connection);

    }
}

 

 

n 3.2) 代码操作: 注意红色是新增地方

public class JdSpider {

    public static void main(String[] args) throws Exception {

        //1. 确定首页URL
        String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=1&click=0";


        //2. 发送请求, 获取数据  httpClient
        //2.1: 创建HttpClient对象:
        CloseableHttpClient httpClient = HttpClients.createDefault();



        //2.2: 创建请求方式的对象: HttpGet  HttpPost
        HttpGet httpGet = new HttpGet(indexUrl);

        //2.3: 设置请求信息: 请求头
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");


        //2.4: 发送请求, 获取响应对象
        CloseableHttpResponse response = httpClient.execute(httpGet);


        //2.5: 根据response 获取响应的数据
        int statusCode = response.getStatusLine().getStatusCode();

        System.out.println("状态码为:" + statusCode);
        if (statusCode == 200) {

            String html = EntityUtils.toString(response.getEntity(), "UTF-8");


            //2.6 释放资源
            response.close();



            //3. 解析数据: jsoup
            //3.1: 根据html 获取其对应document对象
            Document document = Jsoup.parse(html);

            //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
            Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");

            List<Item> itemList = new ArrayList<>();
            for (Element li : lis) {
                //3.3: 获取每件商品的图片的URL , 完成图片的下载
                Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");

                String imgUrl = "https:" + imgs.attr("src");


                //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
                HttpGet imgGet = new HttpGet(imgUrl);


                CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
                HttpEntity imgEntity = imgResonse.getEntity();

                InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

                //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
                // http://img10.360buyimg.com/n7/jfs/t1/110811/33/3085/317953/5e8c4bafEf33aaa74/5531debb59f5350c.jpg
                String imgFileName = "E:\jdImg\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));

                FileOutputStream outputStream = new FileOutputStream(imgFileName);

                //3.3.3: 两个流进行对接 将数据写入到本地磁盘中

                int len;

                byte[] b = new byte[1024];
                while ((len = inputStream.read(b)) != -1) {
                    outputStream.write(b, 0, len);
                }

                //3.3.4: 释放资源
                outputStream.close();

                inputStream.close();
                imgResonse.close();


                //3.4: 解析 spu 和 sku
                String skuValue = li.attr("data-sku");

                String spuValue = li.attr("data-spu");
                if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;


                //3.5: 解析商品名称
                Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");

                String title = ems.text();


                //3.6: 解析商品的价格
                Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");

                String price = priceLiEls.text();


                //3.7: 解析商品的URL
                String itemUrl = "https://item.jd.com/" + skuValue + ".html";


                //3.8: 封装数据
                Item item = new Item(null,

                        Long.parseLong(spuValue),
                        Long.parseLong(skuValue),
                        title,
                        Double.parseDouble(price),
                        imgFileName,
                        itemUrl,
                        new Date().toLocaleString(),
                        new Date().toLocaleString()
                );
                //3.9: 把解析每一个item对象. 都封装到一个集合中
                itemList.add(item);

            }

            System.out.println("获取到:" + itemList.size() + "个");


            //4. 保存数据操作 : mysql

            JDItemDao jdItemDao = new JDItemDao();

            jdItemDao.saveItem(itemList);


        }


    }
}

 

 

 

l 4) 分页处理: 红色为分页代码处理

public class JdSpider {

    public static void main(String[] args) throws Exception {
        int page = 1;
        //1. 确定首页URL
        String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=" + (page * 2 - 1) + "&click=0";


        //2. 发送请求, 获取数据  httpClient
        //2.1: 创建HttpClient对象:
        CloseableHttpClient httpClient = HttpClients.createDefault();


        while (page <= 100) {
            System.out.println("当前正在处理:" + page);
            System.out.println("当前正在处理页面地址为:" + indexUrl);

            //2.2: 创建请求方式的对象: HttpGet  HttpPost
            HttpGet httpGet = new HttpGet(indexUrl);

            //2.3: 设置请求信息: 请求头
            httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");


            //2.4: 发送请求, 获取响应对象
            CloseableHttpResponse response = httpClient.execute(httpGet);


            //2.5: 根据response 获取响应的数据
            int statusCode = response.getStatusLine().getStatusCode();

            System.out.println("状态码为:" + statusCode);
            if (statusCode == 200) {

                String html = EntityUtils.toString(response.getEntity(), "UTF-8");


                //2.6 释放资源
                response.close();



                //3. 解析数据: jsoup
                //3.1: 根据html 获取其对应document对象
                Document document = Jsoup.parse(html);

                //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
                Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");

                List<Item> itemList = new ArrayList<>();
                for (Element li : lis) {
                    //3.3: 获取每件商品的图片的URL , 完成图片的下载
                    Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");

                    String imgUrl = "https:" + imgs.attr("src");


                    //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
                    HttpGet imgGet = new HttpGet(imgUrl);


                    CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
                    HttpEntity imgEntity = imgResonse.getEntity();

                    InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

                    //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
                    // http://img10.360buyimg.com/n7/jfs/t1/110811/33/3085/317953/5e8c4bafEf33aaa74/5531debb59f5350c.jpg
                    String imgFileName = "E:\jdImg\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));

                    FileOutputStream outputStream = new FileOutputStream(imgFileName);

                    //3.3.3: 两个流进行对接 将数据写入到本地磁盘中

                    int len;

                    byte[] b = new byte[1024];
                    while ((len = inputStream.read(b)) != -1) {
                        outputStream.write(b, 0, len);
                    }

                    //3.3.4: 释放资源
                    outputStream.close();

                    inputStream.close();
                    imgResonse.close();


                    //3.4: 解析 spu 和 sku
                    String skuValue = li.attr("data-sku");

                    String spuValue = li.attr("data-spu");
                    if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;


                    //3.5: 解析商品名称
                    Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");

                    String title = ems.text();


                    //3.6: 解析商品的价格
                    Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");

                    String price = priceLiEls.text();


                    //3.7: 解析商品的URL
                    String itemUrl = "https://item.jd.com/" + skuValue + ".html";


                    //3.8: 封装数据
                    Item item = new Item(null,

                            Long.parseLong(spuValue),
                            Long.parseLong(skuValue),
                            title,
                            Double.parseDouble(price),
                            imgFileName,
                            itemUrl,
                            new Date().toLocaleString(),
                            new Date().toLocaleString()
                    );
                    //3.9: 把解析每一个item对象. 都封装到一个集合中
                    itemList.add(item);

                }

                System.out.println("获取到:" + itemList.size() + "个");


                //4. 保存数据操作 : mysql

                JDItemDao jdItemDao = new JDItemDao();

                jdItemDao.saveItem(itemList);

                //5. 获取下一页
                page++;

                indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=" + (page * 2 - 1) + "&click=0";
            }
        }

        // 6. 释放资源 : 千万不要放置在while循环里面
        httpClient.close();


    }
}

 

到此 基础jd爬虫案例全部实现

三、爬虫项目优化

将各个阶段的代码抽取为方法

l 抽取一个根据指定的url来获取html的方法

public static String getHtml(String indexUrl, CloseableHttpClient httpClient) throws Exception {

    //2.2: 创建请求方式的对象: HttpGet  HttpPost
    HttpGet httpGet = new HttpGet(indexUrl);

    //2.3: 设置请求信息: 请求头
    httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");


    //2.4: 发送请求, 获取响应对象
    CloseableHttpResponse response = httpClient.execute(httpGet);


    //2.5: 根据response 获取响应的数据
    int statusCode = response.getStatusLine().getStatusCode();

    System.out.println("状态码为:" + statusCode);
    if (statusCode == 200) {

        String html = EntityUtils.toString(response.getEntity(), "UTF-8");


        //2.6 释放资源
        response.close();



        return html;
    }

    return null;

}

 

 

l 抽取一个用于解析每页数据的方法

public static List<Item> parseHtmlToListItem(CloseableHttpClient httpClient, String html) throws IOException {
    //3. 解析数据: jsoup
    //3.1: 根据html 获取其对应document对象
    Document document = Jsoup.parse(html);

    //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
    Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");

    List<Item> itemList = new ArrayList<>();
    for (Element li : lis) {
        //3.3: 获取每件商品的图片的URL , 完成图片的下载
        Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");

        String imgUrl = "https:" + imgs.attr("src");


        //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
        HttpGet imgGet = new HttpGet(imgUrl);


        CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
        HttpEntity imgEntity = imgResonse.getEntity();

        InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

        //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
        // http://img10.360buyimg.com/n7/jfs/t1/110811/33/3085/317953/5e8c4bafEf33aaa74/5531debb59f5350c.jpg
        String imgFileName = "E:\jdImg\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));

        FileOutputStream outputStream = new FileOutputStream(imgFileName);

        //3.3.3: 两个流进行对接 将数据写入到本地磁盘中

        int len;

        byte[] b = new byte[1024];
        while ((len = inputStream.read(b)) != -1) {
            outputStream.write(b, 0, len);
        }

        //3.3.4: 释放资源
        outputStream.close();

        inputStream.close();
        imgResonse.close();


        //3.4: 解析 spu 和 sku
        String skuValue = li.attr("data-sku");

        String spuValue = li.attr("data-spu");
        if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;


        //3.5: 解析商品名称
        Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");

        String title = ems.text();


        //3.6: 解析商品的价格
        Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");

        String price = priceLiEls.text();


        //3.7: 解析商品的URL
        String itemUrl = "https://item.jd.com/" + skuValue + ".html";


        //3.8: 封装数据
        Item item = new Item(null,

                Long.parseLong(spuValue),
                Long.parseLong(skuValue),
                title,
                Double.parseDouble(price),
                imgFileName,
                itemUrl,
                new Date().toLocaleString(),
                new Date().toLocaleString()
        );
        //3.9: 把解析每一个item对象. 都封装到一个集合中
        itemList.add(item);

    }
    return itemList;
}

 

 

l 最终的抽取后的整个代码的

public class JdSpider {

    public static void main(String[] args) throws Exception {
        int page = 1;
        //1. 确定首页URL
        String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=" + (page * 2 - 1) + "&click=0";


        //2. 发送请求, 获取数据  httpClient
        //2.1: 创建HttpClient对象:
        CloseableHttpClient httpClient = HttpClients.createDefault();


        while (page <= 100) {
            System.out.println("当前正在处理:" + page);
            System.out.println("当前正在处理页面地址为:" + indexUrl);

            String html = getHtml(indexUrl, httpClient);
            if(html!=null){
                //3. 解析数据: jsoup
                List<Item> itemList = parseHtmlToListItem(httpClient, html);

                System.out.println("获取到:" + itemList.size() + "个");
                //4. 保存数据操作 : mysql
                JDItemDao jdItemDao = new JDItemDao();

                jdItemDao.saveItem(itemList);

                //5. 获取下一页
                page++;

                indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=" + (page * 2 - 1) + "&click=0";
            }
        }

        // 6. 释放资源 : 千万不要放置在while循环里面
        httpClient.close();

    }


    // 解析数据
    public static List<Item> parseHtmlToListItem(CloseableHttpClient httpClient, String html) throws IOException {

        //3. 解析数据: jsoup
        //3.1: 根据html 获取其对应document对象
        Document document = Jsoup.parse(html);

        //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
        Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");

        List<Item> itemList = new ArrayList<>();
        for (Element li : lis) {
            //3.3: 获取每件商品的图片的URL , 完成图片的下载
            Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");

            String imgUrl = "https:" + imgs.attr("src");


            //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
            HttpGet imgGet = new HttpGet(imgUrl);


            CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
            HttpEntity imgEntity = imgResonse.getEntity();

            InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

            //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
            // http://img10.360buyimg.com/n7/jfs/t1/110811/33/3085/317953/5e8c4bafEf33aaa74/5531debb59f5350c.jpg
            String imgFileName = "E:\jdImg\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));

            FileOutputStream outputStream = new FileOutputStream(imgFileName);

            //3.3.3: 两个流进行对接 将数据写入到本地磁盘中

            int len;

            byte[] b = new byte[1024];
            while ((len = inputStream.read(b)) != -1) {
                outputStream.write(b, 0, len);
            }

            //3.3.4: 释放资源
            outputStream.close();

            inputStream.close();
            imgResonse.close();


            //3.4: 解析 spu 和 sku
            String skuValue = li.attr("data-sku");

            String spuValue = li.attr("data-spu");
            if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;


            //3.5: 解析商品名称
            Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");

            String title = ems.text();


            //3.6: 解析商品的价格
            Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");

            String price = priceLiEls.text();


            //3.7: 解析商品的URL
            String itemUrl = "https://item.jd.com/" + skuValue + ".html";


            //3.8: 封装数据
            Item item = new Item(null,

                    Long.parseLong(spuValue),
                    Long.parseLong(skuValue),
                    title,
                    Double.parseDouble(price),
                    imgFileName,
                    itemUrl,
                    new Date().toLocaleString(),
                    new Date().toLocaleString()
            );
            //3.9: 把解析每一个item对象. 都封装到一个集合中
            itemList.add(item);

        }
        return itemList;
    }


    public static String getHtml(String indexUrl, CloseableHttpClient httpClient) throws Exception {

        //2.2: 创建请求方式的对象: HttpGet  HttpPost
        HttpGet httpGet = new HttpGet(indexUrl);

        //2.3: 设置请求信息: 请求头
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");


        //2.4: 发送请求, 获取响应对象
        CloseableHttpResponse response = httpClient.execute(httpGet);


        //2.5: 根据response 获取响应的数据
        int statusCode = response.getStatusLine().getStatusCode();

        System.out.println("状态码为:" + statusCode);
        if (statusCode == 200) {

            String html = EntityUtils.toString(response.getEntity(), "UTF-8");


            //2.6 释放资源
            response.close();



            return html;
        }

        return null;

    }
}

原文地址:https://www.cnblogs.com/shan13936/p/13969718.html