WEB数据挖掘(十一)——Aperture数据抽取(7):在Aperture中重要的API

本人认为,如果介绍Aperture抽象的API,恐怕使人不知所云;抽象的API失去具体的上下文显得有点苍白。人们认识事物的方式从源头上而言总是从特殊到一般,从具体到抽象 。基于此,本文还是实现具有上下文的example

本文先来演示一下一个简单的数据抽取程序,基本流程是:

 1根据InputStream识别文件的mime类型;

 2 根据识别的mime类型获取ExtractorFactory,进一步获取Extractor

3 调用 Extractor的extract方法填充RDFContainer

 4 输出Model的RDF格式

代码示例如下:

public class ExtractorExample {
    public static void main(String[] args) throws Exception {
        // create a MimeTypeIdentifier
        MimeTypeIdentifier identifier = new MagicMimeTypeIdentifier();
        // create an ExtractorRegistry containing all available
        // ExtractorFactories
        ExtractorRegistry extractorRegistry = new DefaultExtractorRegistry();
        // read as many bytes of the file as desired by the MIME type identifier
        File file = new File("/home/chenying/web/news1.html");
        FileInputStream stream = new FileInputStream(file);
        BufferedInputStream buffer = new BufferedInputStream(stream);
        byte[] bytes = IOUtil.readBytes(buffer, identifier.getMinArrayLength());
        stream.close();
        // let the MimeTypeIdentifier determine the MIME type of this file
        String mimeType = identifier.identify(bytes, file.getPath(), null);
        // skip when the MIME type could not be determined
        if (mimeType == null) {
            System.err.println("MIME type could not be established.");
            return;
        }
        //System.out.println("mimeType:"+mimeType);
        // create the RDFContainer that will hold the RDF model
        URI uri = new URIImpl(file.toURI().toString());
        Model model = RDF2Go.getModelFactory().createModel();
        model.open();
        RDFContainer container = new RDFContainerImpl(model, uri);
        // determine and apply an Extractor that can handle this MIME type
        Set factories=extractorRegistry.getExtractorFactories(mimeType);
        //Set factories = extractorRegistry.get(mimeType);
        if (factories != null && !factories.isEmpty()) {
            // just fetch the first available Extractor
            ExtractorFactory factory = (ExtractorFactory) factories.iterator().next();
            Extractor extractor = factory.get();
 
            // apply the extractor on the specified file
            // (just open a new stream rather than buffer the previous stream)
            stream = new FileInputStream(file);
            buffer = new BufferedInputStream(stream, 8192);
            extractor.extract(uri, buffer, Charset.forName("utf-8"), mimeType, container);
            stream.close();
        }
        // add the MIME type as an additional statement to the RDF model
        container.add(NIE.mimeType, mimeType);
        // report the output to System.out
        //container.getModel().writeTo(new PrintWriter(System.out),Syntax.Ntriples);
        container.getModel().writeTo(new PrintWriter(System.out),Syntax.RdfXml);
    }
}

运行上面的类,会在eclipse的控制台看到Model的Syntax.RdfXml格式的输出,本人的输出如下:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<rdf:Description rdf:about="file:/home/chenying/web/news1.html">
    <rdf:type rdf:resource="http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#HtmlDocument"/>
    <plainTextContent xmlns="http://www.semanticdesktop.org/ontologies/2007/01/19/nie#">本文件为测试的解析文件
 </plainTextContent>
    <mimeType xmlns="http://www.semanticdesktop.org/ontologies/2007/01/19/nie#">text/html</mimeType>
</rdf:Description>

</rdf:RDF>

上面的example里面一些核心功能很多是通过手工编写的,如果Aperture的功能是如此不智能,那我们不免为之泄气;不过本人认为,简单的example是了解高级应用的门径,不如此,往往使我们迷失于高级应用的迷宫。

下面本人来点稍微高级点的example,基本流程是:

1创建Model

2 创建RDFContainer包装Model

3 创建FileSystemDataSource,设置相应属性

4创建FileSystemCrawler并设置DataSource,DataAccessorRegistry和CrawlerHandler(回调处理)

5 FileSystemCrawler的调用crawl方法

代码示例如下:

public class TutorialCrawlingExample {

    public static void main(String[] args) throws Exception {
        // create a new ExampleFileCrawler instance
        TutorialCrawlingExample crawler = new TutorialCrawlingExample();

        if (args.length != 1) {
            System.err.println("Specify the root folder");
            System.exit(-1);
        }      
// start crawling and exit afterwards crawler.doCrawling(new File(args[0])); } public void doCrawling(File rootFile) throws Exception { // create a model that will store the data source configuration Model model = RDF2Go.getModelFactory().createModel(); // open the model model.open(); // .. and wrap it in an RDFContainer RDFContainer configuration = new RDFContainerImpl(model, new URIImpl("source:testSource"), false); // now create the data source FileSystemDataSource source = new FileSystemDataSource(); // and set the configuration container source.setConfiguration(configuration); // now we can call the type-specific setters in each DataSource class source.setRootFolder(rootFile.getAbsolutePath()); // setup a crawler that can handle this type of DataSource FileSystemCrawler crawler = new FileSystemCrawler(); crawler.setDataSource(source); crawler.setDataAccessorRegistry(new DefaultDataAccessorRegistry()); crawler.setCrawlerHandler(new TutorialCrawlerHandler()); // start crawling crawler.crawl(); } } class TutorialCrawlerHandler extends CrawlerHandlerBase { // our 'persistent' modelSet private ModelSet modelSet; public TutorialCrawlerHandler() throws ModelException { super (new MagicMimeTypeIdentifier(), new DefaultExtractorRegistry(), new DefaultSubCrawlerRegistry()); modelSet = RDF2Go.getModelFactory().createModelSet(); modelSet.open(); } public void crawlStopped(Crawler crawler, ExitCode exitCode) { try { //modelSet.writeTo(System.out, Syntax.Trix); modelSet.writeTo(System.out, Syntax.RdfXml); } catch (Exception e) { throw new RuntimeException(e); } finally { modelSet.close(); } } public RDFContainer getRDFContainer(URI uri) { // we create a new in-memory temporary model for each data source Model model = RDF2Go.getModelFactory().createModel(uri); // A model needs to be opened before being wrapped in an RDFContainer model.open(); return new RDFContainerImpl(model, uri); } public void objectNew(Crawler crawler, DataObject object) { // first we try to extract the information from the binary file try { processBinary(crawler, object); } catch (Exception x) { // do some proper logging now in real applications x.printStackTrace(); } // then we add this information to our persistent model modelSet.addModel(object.getMetadata().getModel()); // don't forget to dispose of the DataObject object.dispose(); } public void objectChanged(Crawler crawler, DataObject object) { // first we remove old information about the data object modelSet.removeModel(object.getID()); // then we try to extract metadata and fulltext from the file try { processBinary(crawler, object); } catch (Exception x) { // do some proper logging now in real applications x.printStackTrace(); } // an then we add the information from the temporary model to our // 'persistent' model modelSet.addModel(object.getMetadata().getModel()); // don't forget to dispose of the DataObject object.dispose(); } public void objectRemoved(Crawler crawler, URI uri) { // an object has been removed, we delete it from the rdf store modelSet.removeModel(uri); } }

设置参数后,运行上面的类,同样输出Model的Syntax.RdfXml格式

在上面的示例中,我们并没有手动编程方式获取文件的InputStream,这些是Aperture自动完成的,其中TutorialCrawlerHandler是一个内部类,用于FileSystemCrawler对象示例的回调类,这种处理方式与java里面的jaxp规范中的sax方式解析xml文件有点类似,两者感觉是相通的。

我们从TutorialCrawlerHandler类可以发现,Model对象被保存在ModelSet对象里面(大概是集合吧),其实我们在编写回调处理方法时也可以持久化到文件系统,具体详情在此不表。

--------------------------------------------------------------------------- 

本系列WEB数据挖掘系本人原创

作者 博客园 刺猬的温驯 

本文链接 http://www.cnblogs.com/chenying99/archive/2013/06/15/3137067.html

本文版权归作者所有,未经作者同意,严禁转载及用作商业传播,否则将追究法律责任。

原文地址:https://www.cnblogs.com/chenying99/p/3137067.html