Nutch插件系统

Nutch 基本情况

Nutch 是 Apache 基金会的一个开源项目，它原本是开源文件索引框架 Lucene 项目的一个子项目，后来渐渐发展成长为一个独立的开源项目。它基于 Java 开发，基于 Lucene 框架，提供 Web 网页爬虫功能。另外很吸引人的一点在于，它提供了一种插件框架，使得其对各种网页内容的解析、各种数据的采集、查询、集群、过滤等功能能够方便的进行扩展，正是由于有此框架，使得 Nutch 的插件开发非常容易，第三方的插件也层出不穷，极大的增强了 Nutch 的功能和声誉。本文就是主要描述这个插件框架内部运行的机制和原理。

回页首

Nutch 的插件体系结构

在 Nutch 的插件体系架构下，有些术语需要在这里解释：

扩展点 ExtensionPoint
扩展点是系统中可以被再次扩展的类或者接口，通过扩展点的定义，可以使得系统的执行过程变得可插入，可任意变化。
扩展 Extension
扩展式插件内部的一个属性，一个扩展是针对某个扩展点的一个实现，每个扩展都可以有自己的额外属性，用于在同一个扩展点实现之间进行区分。扩展必须在插件内部进行定义。
插件 Plugin
插件实际就是一个虚拟的容器，包含了多个扩展 Extension、依赖插件 RequirePlugins 和自身发布的库 Runtime，插件可以被启动或者停止。

Nutch 为了扩展，预留了很多扩展点 ExtenstionPoint，同时提供了这些扩展点的基本实现 Extension，Plugin 用来组织这些扩展，这些都通过配置文件进行控制，主要的配置文件包括了多个定义扩展点和插件（扩展）的配置文件，一个控制加载哪些插件的配置文件。体系结构图如下：

图 1. Nutch 插件体系结构图

回页首

插件的内部结构

图 2. 插件的内部结构

runtime 属性描述了其需要的 Jar 包，和发布的 Jar 包
requires 属性描述了依赖的插件
extension-point 描述了本插件宣布可扩展的扩展点
extension 属性则描述了扩展点的实现

典型的插件定义：

<plugin
    id="query-url" 插件的ID
    name="URL Query Filter"  插件的名字
    version="1.0.0"  插件的版本
    provider-name="nutch.org"> 插件的提供者ID

    <runtime>
        <library name="query-url.Jar"> 依赖的Jar包
            <export name="*"/>  发布的Jar包
        </library>
    </runtime>

    <requires>
        <import plugin="nutch-extensionpoints"/> 依赖的插件
    </requires>

    <extension id="org.apache.nutch.searcher.url.URLQueryFilter" 扩展的ID
        name="Nutch URL Query Filter"  扩展的名字
        point="org.apache.nutch.searcher.QueryFilter"> 扩展的扩展点ID
        <implementation id="URLQueryFilter" 实现的ID
            class="org.apache.nutch.searcher.url.URLQueryFilter"> 实现类
            <parameter name="fields" value="url"/> 实现的相关属性
        </implementation>
    </extension>
</plugin>

回页首

插件主要配置

plugin.folders：插件所在的目录，缺省位置在 plugins 目录下。

<property>
    <name>plugin.folders</name>
    <value>plugins</value>
    <description>Directories where nutch plugins are located.  Each
    element may be a relative or absolute path.  If absolute, it is used
    as is.  If relative, it is searched for on the classpath.
    </description>
</property>

plugin.auto-activation：当被配置为过滤（即不加载），但是又被其他插件依赖的时候，是否自动启动，缺省为 true。

<property>
  <name>plugin.auto-activation</name>
  <value>true</value>
  <description>Defines if some plugins that are not activated regarding
  the plugin.includes and plugin.excludes properties must be automaticaly
  activated if they are needed by some actived plugins.
  </description>
</property>

plugin.includes：要包含的插件名称列表，支持正则表达式方式定义。

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)
    |query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|
    urlnormalizer-(pass|regex|basic)
  </value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
  </description>
</property>

plugin.excludes：要排除的插件名称列表，支持正则表达式方式定义。

<property>
  <name>plugin.excludes</name>
  <value></value>
  <description>Regular expression naming plugin directory names to exclude.  
  </description>
</property>

回页首

插件主要类 UML 图

图 3. 插件主要类 UML 图（查看大图）

类包括：

PluginRepository 是一个通过加载 Iconfiguration 配置信息初始化的插件库，里面维护了系统中所有的扩展点 ExtensionPoint 和所有的插件 Plugin 实例
ExtensionPoint 是一个扩展点，通过扩展点的定义，插件 Plugin 才能定义实际的扩展 Extension，从而实现扩展，每个 ExtensionPoint 类实例都维护了宣布实现了此扩展点的扩展 Extension.
Plugin 是一个虚拟的组织，提供了一个启动 start 和一个 shutdown 方法，从而实现了插件的启动和停止，他还有一个描述对象 PluginDescriptor，负责保存此插件相关的配置信息，另外还有一个 PluginClassLoader 负责此插件相关类和库的加载。

回页首

插件加载过程

图 4 . 插件加载过程时序图（查看大图）

通过序列图可以发现，Nutch 加载插件的过程需要 actor 全程直接调用每个关联对象，最终得到的是插件的实现对象。详细过程如下：

首先通过 PluginRepository.getConf() 方法加载配置信息，配置的内容包括插件的目录，插件的配置文件信息 plugin.properties 等，此时 pluginrepository 将根据配置信息加载各个插件的 plugin.xml，同时根据 Plugin.xml 加载插件的依赖类。
当 actor 需要加载某个扩展点的插件的时候，他可以：
1. 首先根据扩展点的名称，通过 PluginRepository 得到扩展点的实例，即 ExtensionPoint 类的实例；
2. 然后调用 ExtensionPoint 对象的 getExtensions 方法，返回的是实现此扩展点的实例列表（Extension[]）；
3. 对每个实现的扩展实例 Extension，调用它的 getExtensionInstance() 方法，以得到实际的实现类实例，此处为 Object；
4. 根据实际情况，将 Object 转型为实际的类对象类型，然后调用它们的实现方法，例如 helloworld 方法。

回页首

插件的典型调用方式

得到某个语言例如“GBK”扩展点的实例：

this.extensionPoint.getExtensions();// 得到扩展点的所有扩展
    for (int i=0; i<extensions.length; i++) {// 遍历每个扩展
        if (“GBK”.equals(extensions[i].getAttribute("lang"))) {// 找到某个属性的扩展
            return extensions[i];// 返回
        } 
    } 
} 
extension.getExtensionInstance()// 得到此扩展实现的实例对象

回页首

插件类加载机制

实际整个系统如果使用了插件架构，则插件类的加载是由 PluginClassLoader 类完成的，每个 Plugin 都有自己的 classLoader，此 classloader 继承自 URLClassLoader，并没有做任何事情：

public class PluginClassLoader extends URLClassLoader { 
    /** 
    * Construtor 
    * 
    * @param urls 
    *          Array of urls with own libraries and all exported libraries of 
    *          plugins that are required to this plugin 
    * @param parent 
    */ 
    public PluginClassLoader(URL[] urls, ClassLoader parent) { 
        super(urls, parent); 
    } 
}

这个 classloader 是属于这个插件的，它只负责加载本插件相关的类、本地库和依赖插件的发布 (exported) 库，也包括一些基本的配置文件例如 .properties 文件。

此类的实例化过程：

if (fClassLoader != null) 
    return fClassLoader; 
ArrayList<URL> arrayList = new ArrayList<URL>(); 
arrayList.addAll(fExportedLibs); 
arrayList.addAll(fNotExportedLibs); 
arrayList.addAll(getDependencyLibs()); 
File file = new File(getPluginPath()); 
try { 
    for (File file2 : file.listFiles()) { 
        if (file2.getAbsolutePath().endsWith("properties")) 
            arrayList.add(file2.getParentFile().toURL()); 
    } 
} catch (MalformedURLException e) { 
    LOG.debug(getPluginId() + " " + e.toString()); 
} 
URL[] urls = arrayList.toArray(new URL[arrayList.size()]); 
fClassLoader = new PluginClassLoader(urls, PluginDescriptor.class 
    .getClassLoader()); 
return fClassLoader;

首先判断缓存是否存在
加载需要的 Jar 包、自身需要的 Jar 包，依赖插件发布的 Jar 包
加载本地的 properties 文件
构造此 classloader，父 classloader 为 PluginDescriptor 的加载者，通常是 contextClassLoader

回页首

总结

Nutch 是一个非常出色的开源搜索框架，它的插件架构更加是它的一个技术亮点，通过此架构，可以保证 Nutch 方便的被灵活的扩展而不用修改原来的代码，通过配置文件可以简单方便的控制加载或者不加载哪些插件，而且这些都不需要额外的容器支持。这些都是我们在系统架构设计的时候可以学习和参考的有益经验。