【java爬虫】网络爬虫思路

主要是针对某个单独的网站进行页面的爬取，方式有好多种，记录一下大体的思路。

方法1：

a、通过http请求获取返回的静态页面。

b、将返回的字符串页面进行split，切割成字符串数组。

c、遍历字符串数组，通过正则筛选所需要的链接。

d、拼接获取到的链接，发送请求获取页面。

实际应用：

遇到过：网站验证码，单位时间内访问次数限制，还有ajax填充数据等问题。ajax post请求还算好解决，但是验证码和访问次数限制感觉很无力，Orz...

方法1：获取一整张页面

	public static String getStringHtml(String url){
		//实例化客户端
		HttpClient client = new DefaultHttpClient();
		HttpGet getHttp = new HttpGet(url);
		//整张页面
		String content = null;
		HttpResponse response;
		
		try {
			response = client.execute(getHttp);
			//获取到responce下载
			HttpEntity entity = response.getEntity();
			if(entity!=null){
				content = EntityUtils.toString(entity);
				//System.out.println(content);
			}
		} catch (ClientProtocolException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}finally{
			client.getConnectionManager().shutdown();
		}
		return content;
	}

方法：将文件写出到指定文件夹

public static void writetoFile(String context,String fileName)throws Exception{  
		        // 构建指定文件  
		        File file = new File("E:" + File.separator + "htmlfile"+File.separator+fileName);  
		        OutputStream out = null;  
		        try {  
		            // 根据文件创建文件的输出流  
		            out = new FileOutputStream(file);  
		            // 把内容转换成字节数组  
		            byte[] data = context.getBytes();  
		            // 向文件写入内�? 
		            out.write(data);  
		        } catch (Exception e) {  
		            e.printStackTrace();  
		        } finally {  
		            try {  
		                // 关闭输出流 
		                out.close();  
		            } catch (Exception e) {  
		                e.printStackTrace();  
		            }  
		        }  
		    }