爬虫----设置代理HttpClientDownloader

从0.7.1版本开始,WebMagic开始使用了新的代理APIProxyProvider。因为相对于Site的“配置”,ProxyProvider定位更多是一个“组件”,所以代理不再从Site设置,而是由HttpClientDownloader设置

API说明
HttpClientDownloader.setProxyProvider(ProxyProvider proxyProvider) 设置代理

ProxyProvider有一个默认实现:SimpleProxyProvider。它是一个基于简单Round-Robin的、没有失败检查的ProxyProvider。可以配置任意个候选代理,每次会按顺序挑选一个代理使用。它适合用在自己搭建的比较稳定的代理的场景。

代理示例:

  1. 设置单一的普通HTTP代理为101.101.101.101的8888端口,并设置密码为"username","password"
    HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
    httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(new Proxy("101.101.101.101",8888,"username","password")));
    spider.setDownloader(httpClientDownloader);
  1. 设置代理池,其中包括101.101.101.101和102.102.102.102两个IP,没有密码
    HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
    httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(
    new Proxy("101.101.101.101",8888)
    ,new Proxy("102.102.102.102",8888)));

如果对于代理部分有建议的,欢迎参与讨论#579 更多的代理ProxyProvider实现

package com.mwq.job.task;

import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.downloader.HttpClientDownloader;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.proxy.Proxy;
import us.codecraft.webmagic.proxy.SimpleProxyProvider;

@Component
public class ProxyTest implements PageProcessor {
    @Scheduled(fixedDelay = 1000)
    public void process(){
        //创建下载器
        HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
        //给下载器设置代理服务器信息
        httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(new Proxy("150.109.32.166",80)));
        Spider.create(new ProxyTest())
                .addUrl("http://ip.chinaz.com/getip.aspx")
                .setDownloader(httpClientDownloader)
                .run();
    }
    @Override
    public void process(Page page) {
        System.out.println(page.getHtml().toString());
    }



    Site site = Site.me();
    @Override
    public Site getSite() {
        return site;
    }
}

提供两个免费代理服务商网站:

米扑代理:https://proxy.mimvp.com/free.php

西刺免费代理:http://www.xicidaili.com/

原文地址:https://www.cnblogs.com/mwq1992/p/14219596.html