Java 使用正则表达式取出图片地址以及跳转的链接地址,来判断死链(一)

任务:通过driver的getPageSource()获取网页的源码内容,在把网页中图片链接地址和跳转的url地址进行过滤,在get每个请求,来判断是否是死链

如图:

获取网页源码中所有的href,以及img src后的链接 

代码实现:

调用代码实现,正则表达式

public void home_page(){
        op.loopGet(home, 40, 3, 60);
       String  source=driver.getPageSource();//获取网页源码
      //  System.out.println(source);       
        String imageSrc="img\s*src="?(http:"?(.*?)("|>|\s+))";//图片的正则表达式   //要注意https的数据是否能loading出来,要注意查看
        String jumpAdders="a\s*href="?(http:"?(.*?)("|>|\s+))";//获取html的地址
        Regular(imageSrc,source);
        Regular(jumpAdders,source);
    }

Regular方法,使用正则表达式

public void  Regular(String expressions, String sourceFile) {
        Map<String, String> result = new HashMap<String, String>();
        Pattern p = Pattern.compile(expressions);
        Matcher m = p.matcher(sourceFile);
        while (m.find()) {
            //System.out.println(m.group());  //需要做对比是否需要全部去出数据更快,
            String regularURL = m.group().replace("img src=", "").replace("a href=", "");
            regularURL=regularURL.substring(1,regularURL.length()-1);//会多引号
            result = Pub.get(regularURL);        
            if (!"200".equals(result.get("Code"))) {
                Log.logError("请求失败,请检查图片或者是网页链接否正常显示,请求地址为:"+regularURL);
            }
        }
        System.out.println("**********************");
    
    }

Pub.get方法,发送get请求

public static Map<String, String> get(String url) {
        int defaultConnectTimeOut = 30000; // 默认连接超时,毫秒
        int defaultReadTimeOut = 30000; // 默认读取超时,毫秒

        Map<String, String> result = new HashMap<String, String>();
        BufferedReader in = null;

        try {
            Log.logInfo("通过java请求访问:["+url+"]");
            // 打开和URL之间的连接
            URLConnection connection = new URL(url).openConnection();
            // 此处的URLConnection对象实际上是根据URL的请求协议(此处是http)生成的URLConnection类的子类HttpURLConnection
            // 故此处最好将其转化为HttpURLConnection类型的对象,以便用到HttpURLConnection更多的API.
            HttpURLConnection httpURLConnection = (HttpURLConnection) connection;

            // 设置通用的请求属性
            httpURLConnection.setRequestProperty("accept", "*/*");
            httpURLConnection.setRequestProperty("connection", "Keep-Alive");
            httpURLConnection.setRequestProperty("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)");
            httpURLConnection.setConnectTimeout(defaultConnectTimeOut);
            httpURLConnection.setReadTimeout(defaultReadTimeOut);
            if (staging != null) {
                httpURLConnection.setRequestProperty("Cookie", staging.toString());
            }
            if (ORIGINDC != null) {
                httpURLConnection.setRequestProperty("Cookie", ORIGINDC.toString());
                ORIGINDC = null;
            }

            // // Fidder监听请求
            // if ((!proxyHost.equals("") && !proxyPort.equals(""))) {
            // System.setProperty("http.proxyHost", proxyHost);
            // System.setProperty("http.proxyPort", proxyPort);
            // }

            // 建立连接
            httpURLConnection.connect();
            result = getResponse(httpURLConnection, in, result);

        } catch (Exception requestException) {
            System.err.println("发送GET请求出现异常!" + requestException);
            // requestException.printStackTrace();
        }
        // 关闭输入流
        finally {
            try {
                if (in != null) {
                    in.close();
                }
            } catch (Exception closeException) {
                closeException.printStackTrace();
            }
        }
        
        return result;
    }

 结果展示:

图片正常展示

访问的链接地址,并查到某一处请求失效:

原文地址:https://www.cnblogs.com/chongyou/p/7286447.html