WEB数据挖掘（十一）——Aperture数据抽取（7）：在Aperture中重要的API

本人认为，如果介绍Aperture抽象的API，恐怕使人不知所云；抽象的API失去具体的上下文显得有点苍白。人们认识事物的方式从源头上而言总是从特殊到一般，从具体到抽象。基于此，本文还是实现具有上下文的example

本文先来演示一下一个简单的数据抽取程序，基本流程是：

１根据InputStream识别文件的mime类型；

2 根据识别的mime类型获取ExtractorFactory，进一步获取Extractor

３调用 Extractor的extract方法填充RDFContainer

4 输出Model的RDF格式

代码示例如下：

public class ExtractorExample {
    public static void main(String[] args) throws Exception {
        // create a MimeTypeIdentifier
        MimeTypeIdentifier identifier = new MagicMimeTypeIdentifier();
        // create an ExtractorRegistry containing all available
        // ExtractorFactories
        ExtractorRegistry extractorRegistry = new DefaultExtractorRegistry();
        // read as many bytes of the file as desired by the MIME type identifier
        File file = new File("/home/chenying/web/news1.html");
        FileInputStream stream = new FileInputStream(file);
        BufferedInputStream buffer = new BufferedInputStream(stream);
        byte[] bytes = IOUtil.readBytes(buffer, identifier.getMinArrayLength());
        stream.close();
        // let the MimeTypeIdentifier determine the MIME type of this file
        String mimeType = identifier.identify(bytes, file.getPath(), null);
        // skip when the MIME type could not be determined
        if (mimeType == null) {
            System.err.println("MIME type could not be established.");
            return;
        }
        //System.out.println("mimeType:"+mimeType);
        // create the RDFContainer that will hold the RDF model
        URI uri = new URIImpl(file.toURI().toString());
        Model model = RDF2Go.getModelFactory().createModel();
        model.open();
        RDFContainer container = new RDFContainerImpl(model, uri);
        // determine and apply an Extractor that can handle this MIME type
        Set factories=extractorRegistry.getExtractorFactories(mimeType);
        //Set factories = extractorRegistry.get(mimeType);
        if (factories != null && !factories.isEmpty()) {
            // just fetch the first available Extractor
            ExtractorFactory factory = (ExtractorFactory) factories.iterator().next();
            Extractor extractor = factory.get();
 
            // apply the extractor on the specified file
            // (just open a new stream rather than buffer the previous stream)
            stream = new FileInputStream(file);
            buffer = new BufferedInputStream(stream, 8192);
            extractor.extract(uri, buffer, Charset.forName("utf-8"), mimeType, container);
            stream.close();
        }
        // add the MIME type as an additional statement to the RDF model
        container.add(NIE.mimeType, mimeType);
        // report the output to System.out
        //container.getModel().writeTo(new PrintWriter(System.out),Syntax.Ntriples);
        container.getModel().writeTo(new PrintWriter(System.out),Syntax.RdfXml);
    }
}

运行上面的类，会在eclipse的控制台看到Model的Syntax.RdfXml格式的输出，本人的输出如下：

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<rdf:Description rdf:about="file:/home/chenying/web/news1.html">
    <rdf:type rdf:resource="http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#HtmlDocument"/>
    <plainTextContent xmlns="http://www.semanticdesktop.org/ontologies/2007/01/19/nie#">本文件为测试的解析文件
 </plainTextContent>
    <mimeType xmlns="http://www.semanticdesktop.org/ontologies/2007/01/19/nie#">text/html</mimeType>
</rdf:Description>

</rdf:RDF>

上面的example里面一些核心功能很多是通过手工编写的，如果Aperture的功能是如此不智能，那我们不免为之泄气；不过本人认为，简单的example是了解高级应用的门径，不如此，往往使我们迷失于高级应用的迷宫。

下面本人来点稍微高级点的example，基本流程是：

１创建Model

2 创建RDFContainer包装Ｍodel

3 创建FileSystemDataSource，设置相应属性

４创建FileSystemCrawler并设置DataSource，DataAccessorRegistry和CrawlerHandler（回调处理）

5 FileSystemCrawler的调用crawl方法

代码示例如下：

public class TutorialCrawlingExample {

    public static void main(String[] args) throws Exception {
        // create a new ExampleFileCrawler instance
        TutorialCrawlingExample crawler = new TutorialCrawlingExample();

        if (args.length != 1) {
            System.err.println("Specify the root folder");
            System.exit(-1);
        }      
        // start crawling and exit afterwards
        crawler.doCrawling(new File(args[0]));
    }

    public void doCrawling(File rootFile) throws Exception {
        // create a model that will store the data source configuration
        Model model = RDF2Go.getModelFactory().createModel();
        // open the model
        model.open();
        // .. and wrap it in an RDFContainer
        RDFContainer configuration = new RDFContainerImpl(model, new URIImpl("source:testSource"), false);
        
        // now create the data source
        FileSystemDataSource source = new FileSystemDataSource();
        // and set the configuration container
        source.setConfiguration(configuration);
        // now we can call the type-specific setters in each DataSource class
        source.setRootFolder(rootFile.getAbsolutePath());
        
        // setup a crawler that can handle this type of DataSource
        FileSystemCrawler crawler = new FileSystemCrawler();
        crawler.setDataSource(source);
        crawler.setDataAccessorRegistry(new DefaultDataAccessorRegistry());
        crawler.setCrawlerHandler(new TutorialCrawlerHandler());

        // start crawling
        crawler.crawl();
    }
}

class TutorialCrawlerHandler extends CrawlerHandlerBase {

    // our 'persistent' modelSet
    private ModelSet modelSet;

    public TutorialCrawlerHandler() throws ModelException {
        super (new MagicMimeTypeIdentifier(), new DefaultExtractorRegistry(), 
            new DefaultSubCrawlerRegistry());
        modelSet = RDF2Go.getModelFactory().createModelSet();
        modelSet.open();
    }

    public void crawlStopped(Crawler crawler, ExitCode exitCode) {
        try {
            //modelSet.writeTo(System.out, Syntax.Trix);
            modelSet.writeTo(System.out, Syntax.RdfXml);
            
        }
        catch (Exception e) {
            throw new RuntimeException(e);
        }
        finally {
            modelSet.close();
        }
    }

    public RDFContainer getRDFContainer(URI uri) {
        // we create a new in-memory temporary model for each data source
        Model model = RDF2Go.getModelFactory().createModel(uri);
        // A model needs to be opened before being wrapped in an RDFContainer
        model.open();
        return new RDFContainerImpl(model, uri);
    }
    
    public void objectNew(Crawler crawler, DataObject object) {
        // first we try to extract the information from the binary file
        try {
            processBinary(crawler, object);
        } catch (Exception x) {
            // do some proper logging now in real applications
            x.printStackTrace();
        }
        // then we add this information to our persistent model
        modelSet.addModel(object.getMetadata().getModel());
        // don't forget to dispose of the DataObject
        object.dispose();
    }

    public void objectChanged(Crawler crawler, DataObject object) {
        // first we remove old information about the data object
        modelSet.removeModel(object.getID());
        // then we try to extract metadata and fulltext from the file
        try {
            processBinary(crawler, object);
        } catch (Exception x) {
            // do some proper logging now in real applications
            x.printStackTrace();
        }
        // an then we add the information from the temporary model to our
        // 'persistent' model
        modelSet.addModel(object.getMetadata().getModel());
        // don't forget to dispose of the DataObject
        object.dispose();
    }
    
    public void objectRemoved(Crawler crawler, URI uri) {
        // an object has been removed, we delete it from the rdf store
        modelSet.removeModel(uri);
    }
}

设置参数后，运行上面的类，同样输出Model的Syntax.RdfXml格式

在上面的示例中，我们并没有手动编程方式获取文件的InputStream，这些是Aperture自动完成的，其中TutorialCrawlerHandler是一个内部类，用于FileSystemCrawler对象示例的回调类，这种处理方式与java里面的jaxp规范中的sax方式解析xml文件有点类似，两者感觉是相通的。

我们从TutorialCrawlerHandler类可以发现，Model对象被保存在ModelSet对象里面（大概是集合吧），其实我们在编写回调处理方法时也可以持久化到文件系统，具体详情在此不表。

---------------------------------------------------------------------------

本系列WEB数据挖掘系本人原创

作者博客园刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/06/15/3137067.html

本文版权归作者所有，未经作者同意，严禁转载及用作商业传播，否则将追究法律责任。