网络爬虫速成指南（三）编码识别

问题的提出：
采用上节的方法偶尔会下载到的HTML乱码，原因是上节的代码中进行了简易的编码识别，比如根据头信息，
根据meta中的charset：<meta http-equiv="Content-type" content="text/html; charset=gb2312" />。
即使这样也会遇到下载到乱码的情况，原因是这两者提供的charset都可能不准确。
解决方案：
1 手动指定编码
2 自动识别编码
如果只采一个网站，自己指定下编码就好了，
但是如果是海量的采集那就不能一个网站一个网站的去指定编码了。
本节介绍两个包用来自动识别编码。

一下是两个java的编码识别的包及使用示例。net的也有类似的包，忘记名字了。

参考源：
http://code.google.com/p/juniversalchardet/



package cn.tdt.crawl.encoding;

import java.io.File;
import java.io.IOException;

import org.mozilla.universalchardet.UniversalDetector;

public class DetectorDemo {

    private static java.io.FileInputStream fis;

    public static void main(String[] args) throws IOException {
        
        String fileName = "F:/qq.txt";
        File f = new File(fileName);
        fis = new java.io.FileInputStream(f);
        
        //method 1:        
//        byte[] data = new byte[(int) f.length()];
//        for (int i = 0; i < data.length; i++) {
//            data[i] = (byte) fis.read();
//        }        
//        String encoding = Icu4jDetector.getEncode(data);
//        System.out.println(encoding);
        
        byte[] buf = new byte[4096];
        UniversalDetector detector = new UniversalDetector(null);
        int nread;
        while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }
        detector.dataEnd();
        String encoding = detector.getDetectedCharset();
        if (encoding != null) {
            System.out.println("Detected encoding = " + encoding);
        } else {
            System.out.println("No encoding detected.");
        }
        detector.reset();

    }

}

package cn.tdt.crawl.encoding;

import java.io.IOException;
import java.io.InputStream;

import com.ibm.icu.text.CharsetDetector;
import com.ibm.icu.text.CharsetMatch;

public class Icu4jDetector {
    
    public static String getEncode(byte[] data){
           CharsetDetector detector = new CharsetDetector();
           detector.setText(data);
           CharsetMatch match = detector.detect();
           String encoding = match.getName();
           System.out.println("The Content in " + match.getName());
           CharsetMatch[] matches = detector.detectAll();
           System.out.println("All possibilities");
           for (CharsetMatch m : matches) {
            System.out.println("CharsetName:" + m.getName() + " Confidence:"
              + m.getConfidence());
           }
           return encoding;
        }
    
    public static String getEncode(InputStream data,String url) throws IOException{
           CharsetDetector detector = new CharsetDetector();
           detector.setText(data);
           CharsetMatch match = detector.detect();
           String encoding = match.getName();
           System.out.println("The Content in " + match.getName());
           CharsetMatch[] matches = detector.detectAll();
           System.out.println("All possibilities");
           for (CharsetMatch m : matches) {
            System.out.println("CharsetName:" + m.getName() + " Confidence:"
              + m.getConfidence());
           }
           return encoding;
        }
    
}

网络爬虫速成指南 （三） 编码识别

网络爬虫速成指南（三）编码识别