初识网络爬虫

再学习正则表达式之后,可以利用正则表达式进行网络爬虫

首先利用网络编程把网页加载到内存,并且保存到本地

利用正则抽取有用的信息。最终打印输出到控制台

爬取网易首页的所有连接

public class SpiderTest {
    public static String getUrlContent(String toUrl){
        BufferedReader br =null;
        StringBuilder sb = new StringBuilder();
        try {
            URL url = new URL(toUrl);
            try {
                br = new BufferedReader(new InputStreamReader(url.openStream()));
                String temp = "";
                while((temp= br.readLine())!=null){
                    sb.append(temp);
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        }
        return sb.toString();
    }
    public static void main(String[] args) {
        String str = getUrlContent("https://www.163.com");
        //Pattern p = Pattern.compile("<a[\s\S]+?</a>");//取得超链接的所有内容
        Pattern p2 = Pattern.compile("href=".+?"");
        //Pattern p2 = Pattern.compile("href="(.+?)"");
        Matcher m = p2.matcher(str);
        while(m.find()){
            System.out.println(m.group());
            //System.out.println(m.group(1));
        }
    }
}

结果显示:

href="https://ent.163.com/19/0628/07/EIOA5VR000038FO9.html"
href="https://ent.163.com/19/0628/07/EIO7VG3U00038FO9.html"
href="http://fashion.163.com/"
href="http://lady.163.com/photoview/00A70026/115916.html#p=EIOGR4FS00A70026NOS"
href="http://lady.163.com/photoview/00A70026/115915.html#p=EIOGI7DD00A70026NOS"
href="http://dy.163.com/"
href="http://dy.163.com/v2/article/detail/EINGAP5J05259Q0E.html"
后面还有很多。。。。
原文地址:https://www.cnblogs.com/5aixin/p/11105473.html