Java 爬虫简单起步

JAVA第一步是环境，环境配好后开始编写，这里用的代码编辑器是IDEA（据说很好用，可惜用不惯，还找了个汉化包）

 1 // 定义即将访问的链接
 2         String url = "http://www.baidu.com";
 3         // 定义一个字符串用来存储网页内容
 4         String result = "";
 5         // 定义一个缓冲字符输入流
 6         BufferedReader in = null;
 7         try
 8         {
 9             // 将string转成url对象
10             URL realUrl = new URL(url);
11 
12             //HttpURLConnection conn = (HttpURLConnection) realUrl.openConnection();
13             // 初始化一个链接到那个url的连接
14             URLConnection connection = realUrl.openConnection();
15             connection.setReadTimeout(100);
16             connection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
17             // 开始实际的连接
18             connection.connect();
19             // 初始化 BufferedReader输入流来读取URL的响应
20             in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
21             // 用来临时存储抓取到的每一行的数据
22             String line;
23             while ((line = in.readLine()) != null)
24             {
25                 // 遍历抓取到的每一行并将其存储到result里面
26                 result += line + "
";
27             }
28         } catch (Exception e)
29         {
30             System.out.println("发送GET请求出现异常！" + e);
31             e.printStackTrace();
32         } // 使用finally来关闭输入流
33         finally
34         {
35             try
36             {
37                 if (in != null)
38                 {
39                     in.close();
40                 }
41             } catch (Exception e2)
42             {
43                 e2.printStackTrace();
44             }
45         }
46         System.out.println(result);

代码见注释很多，适合初学，分享给大家。

实现结果，获取百度首页源代码。

下一步就是html格式化或者正则获取等操作。