HttpClient简单操作

HttpClient 这个框架主要用来请求第三方服务器，然后获取到网页，得到我们需要的数据；

HttpClient设置请求头消息User-Agent模拟浏览器

首先建一个Maven项目，然后添加httpClient依赖，版本是4.5

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.2</version>
</dependency>

创建demo01：

package com.demo.httpclient.chap02;
 
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
 
public class Demo01 {
 
    public static void main(String[] args) throws Exception{
        CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
        HttpGet httpGet=new HttpGet("http://www.tuicool.com/"); // 创建httpget实例
        CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
        HttpEntity entity=response.getEntity(); // 获取返回实体
        System.out.println("网页内容："+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
        response.close(); // response关闭
        httpClient.close(); // httpClient关闭
    }
}

返回内容：

网页内容：

<!DOCTYPE html>

<html>

    <head>

          <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

    </head>

    <body>

        <p>系统检测亲不是真人行为，因系统资源限制，我们只能拒绝你的请求。如果你有疑问，可以通过微博 http://weibo.com/tuicool2012/ 联系我们。</p>

    </body>

</html>

我们模拟下浏览器设置下User-Agent头消息：

加下 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent

package com.demo.httpclient.chap02;
 
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
 
public class Demo01 {
 
    public static void main(String[] args) throws Exception{
        CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
        HttpGet httpGet=new HttpGet("http://www.tuicool.com/"); // 创建httpget实例
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent
        CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
        HttpEntity entity=response.getEntity(); // 获取返回实体
        System.out.println("网页内容："+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
        response.close(); // response关闭
        httpClient.close(); // httpClient关闭
    }
}

当然通过火狐firebug，我们还可以看到其他请求头消息：

都是可以通过setHeader方法设置key value；来得到模拟浏览器请求；

HttpClient获取响应内容类型Content-Type

响应的网页内容都有类型也就是Content-Type

通过火狐firebug，我们看响应头信息：

当然我们可以通过HttpClient接口来获取；

HttpEntity的getContentType().getValue() 就能获取到响应类型；

System.out.println("Content-Type:"+entity.getContentType().getValue());
        //System.out.println("网页内容："+EntityUtils.toString(entity, "utf-8")); // 获取网页内容

运行输出：

Content-Type:text/html

一般网页是text/html当然有些是带编码的，

比如请求www.tuicool.com：输出：

Content-Type:text/html; charset=utf-8

假如请求js文件，比如 http://www.baidu.com/static/js/jQuery.js

运行输出：

Content-Type:application/javascript

假如请求的是文件，比如 http://central.maven.org/maven2/HTTPClient/HTTPClient/0.3-3/HTTPClient-0.3-3.jar

运行输出：

Content-Type:application/java-archive

当然Content-Type还有一堆，那这东西对于我们爬虫有啥用的，我们再爬取网页的时候，可以通过

Content-Type来提取我们需要爬取的网页或者是爬取的时候，需要过滤掉的一些网页；

HttpClient获取响应状态Status

我们HttpClient向服务器请求时，

正常情况执行成功返回200状态码，

不一定每次都会请求成功，

比如这个请求地址不存在返回404

服务器内部报错返回500

有些服务器有防采集，假如你频繁的采集数据，则返回403 拒绝你请求。

当然我们是有办法的后面会讲到用代理IP。

这个获取状态码，我们可以用 CloseableHttpResponse对象的getStatusLine().getStatusCode()

System.out.println("Status:"+response.getStatusLine().getStatusCode());

运行输出：

Status:200

Content-Type:text/html;charset=UTF-8

假如换个页面 http://www.baidu.com/aaa.jsp

因为不存在，

所以返回 404