Apache Tika源码研究(三)

上文我们基本知道Tika是通过SAXParser来解析XHTML文档的,下面我通过一个具体的解析类HtmlParser入手,来看看网页文件的解析过程。

首先看看HtmlParser类的继承层次,HtmlParser继承自抽象类AbstractParser,而AbstractParser实现了Parser接口

Parser接口声明的方法方法如下:

/**
 * Tika parser interface.
 */
public interface Parser extends Serializable {

    /**
     * Returns the set of media types supported by this parser when used
     * with the given parse context.
     *
     * @since Apache Tika 0.7
     * @param context parse context
     * @return immutable set of media types
     */
    Set<MediaType> getSupportedTypes(ParseContext context);

    /**
     * Parses a document stream into a sequence of XHTML SAX events.
     * Fills in related document metadata in the given metadata object.
     * <p>
     * The given document stream is consumed but not closed by this method.
     * The responsibility to close the stream remains on the caller.
     * <p>
     * Information about the parsing context can be passed in the context
     * parameter. See the parser implementations for the kinds of context
     * information they expect.
     *
     * @since Apache Tika 0.5
     * @param stream the document stream (input)
     * @param handler handler for the XHTML SAX events (output)
     * @param metadata document metadata (input and output)
     * @param context parse context
     * @throws IOException if the document stream could not be read
     * @throws SAXException if the SAX events could not be processed
     * @throws TikaException if the document could not be parsed
     */
    void parse(
            InputStream stream, ContentHandler handler,
            Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException;

}

第一个方法返回支持的媒体类型集合

第二个方法为正式的解析方法

抽象类AbstractParser只对上面接口的void parse()方法进行了一层包装,类似于模板方法,方便其他类调用,其代码如下:

public abstract class AbstractParser implements Parser {

    /**
     * Serial version UID.
     */
    private static final long serialVersionUID = 7186985395903074255L;

    /**
     * Calls the
     * {@link Parser#parse(InputStream, ContentHandler, Metadata, ParseContext)}
     * method with an empty {@link ParseContext}. This method exists as a
     * leftover from Tika 0.x when the three-argument parse() method still
     * existed in the {@link Parser} interface. No new code should call this
     * method anymore, it's only here for backwards compatibility.
     *
     * @deprecated use the {@link Parser#parse(InputStream, ContentHandler, Metadata, ParseContext)} method instead
     */
    public void parse(
            InputStream stream, ContentHandler handler, Metadata metadata)
            throws IOException, SAXException, TikaException {
        parse(stream, handler, metadata, new ParseContext());
    }

}

下面来分析HtmlParser类的关键部分,HtmlParser的部分源码如下:

**
 * HTML parser. Uses TagSoup to turn the input document to HTML SAX events,
 * and post-processes the events to produce XHTML and metadata expected by
 * Tika clients.
 */
public class HtmlParser extends AbstractParser {

    /** Serial version UID */
    private static final long serialVersionUID = 7895315240498733128L;

    private static final Set<MediaType> SUPPORTED_TYPES =
        Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
                MediaType.text("html"),
                MediaType.application("xhtml+xml"),
                MediaType.application("vnd.wap.xhtml+xml"),
                MediaType.application("x-asp"))));

    private static final ServiceLoader LOADER =
            new ServiceLoader(HtmlParser.class.getClassLoader());

    /**
     * HTML schema singleton used to amortise the heavy instantiation time.
     */
    private static final Schema HTML_SCHEMA = new HTMLSchema();

    public Set<MediaType> getSupportedTypes(ParseContext context) {
        return SUPPORTED_TYPES;
    }

    public void parse(
            InputStream stream, ContentHandler handler,
            Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
        // Automatically detect the character encoding
        AutoDetectReader reader = new AutoDetectReader(
                new CloseShieldInputStream(stream), metadata, LOADER);
        try {
            Charset charset = reader.getCharset();
            String previous = metadata.get(Metadata.CONTENT_TYPE);
            if (previous == null || previous.startsWith("text/html")) {
                MediaType type = new MediaType(MediaType.TEXT_HTML, charset);
                metadata.set(Metadata.CONTENT_TYPE, type.toString());
            }
            // deprecated, see TIKA-431
            metadata.set(Metadata.CONTENT_ENCODING, charset.name());

            // Get the HTML mapper from the parse context
            HtmlMapper mapper =
                    context.get(HtmlMapper.class, new HtmlParserMapper());

            // Parse the HTML document
            org.ccil.cowan.tagsoup.Parser parser =
                    new org.ccil.cowan.tagsoup.Parser();

            // TIKA-528: Reuse share schema to avoid heavy instantiation
            parser.setProperty(
                    org.ccil.cowan.tagsoup.Parser.schemaProperty, HTML_SCHEMA);
            // TIKA-599: Shared schema is thread-safe only if bogons are ignored
            parser.setFeature(
                    org.ccil.cowan.tagsoup.Parser.ignoreBogonsFeature, true);

            parser.setContentHandler(new XHTMLDowngradeHandler(
                    new HtmlHandler(mapper, handler, metadata)));

            parser.parse(reader.asInputSource());
        } finally {
            reader.close();
        }
    }

  //其他方法略

}

 该类的注释写得很清楚,这里用到了一个TagSoup组件,用来解析HTML的,转换为格式良好的XHTML结构

Set<MediaType> getSupportedTypes(ParseContext context)方法返回支持的媒体类型集合
void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)方法即为具体的解析HTML文档的方法

编码识别类
AutoDetectReader
AutoDetectReader reader = new AutoDetectReader(new CloseShieldInputStream(stream), metadata, LOADER);
该类继承自BufferedReader,封装了输入流stream,AutoDetectReader类的源码如下:
/**
 * An input stream reader that automatically detects the character encoding
 * to be used for converting bytes to characters.
 *
 * @since Apache Tika 1.2
 */
public class AutoDetectReader extends BufferedReader {

    private static final ServiceLoader DEFAULT_LOADER =
            new ServiceLoader(AutoDetectReader.class.getClassLoader());

    private static Charset detect(
            InputStream input, Metadata metadata,
            List<EncodingDetector> detectors)
            throws IOException, TikaException {
        // Ask all given detectors for the character encoding
        for (EncodingDetector detector : detectors) {
            Charset charset = detector.detect(input, metadata);
            if (charset != null) {
                return charset;
            }
        }

        // Try determining the encoding based on hints in document metadata
        MediaType type = MediaType.parse(metadata.get(Metadata.CONTENT_TYPE));
        if (type != null) {
            String charset = type.getParameters().get("charset");
            if (charset != null) {
                try {
                    return CharsetUtils.forName(charset);
                } catch (Exception e) {
                    // ignore
                }
            }
        }

        throw new TikaException(
                "Failed to detect the character encoding of a document");
    }

    private final Charset charset;

    private AutoDetectReader(InputStream stream, Charset charset)
            throws IOException {
        super(new InputStreamReader(stream, charset));
        this.charset = charset;

        // TIKA-240: Drop the BOM if present
        mark(1);
        if (read() != '\ufeff') { // zero-width no-break space
            reset();
        }
    }

    private AutoDetectReader(
            BufferedInputStream stream, Metadata metadata,
            List<EncodingDetector> detectors)
            throws IOException, TikaException {
        this(stream, detect(stream, metadata, detectors));
    }

    public AutoDetectReader(
            InputStream stream, Metadata metadata,
            ServiceLoader loader) throws IOException, TikaException {
        this(new BufferedInputStream(stream), metadata,
                loader.loadServiceProviders(EncodingDetector.class));
    }

    public AutoDetectReader(InputStream stream, Metadata metadata)
            throws IOException, TikaException {
        this(new BufferedInputStream(stream), metadata, DEFAULT_LOADER);
    }

    public AutoDetectReader(InputStream stream)
            throws IOException, TikaException {
        this(stream, new Metadata());
    }

    public Charset getCharset() {
        return charset;
    }

    public InputSource asInputSource() {
        InputSource source = new InputSource(this);
        source.setEncoding(charset.name());
        return source;
    }

}

 这里最关键的方法是

static Charset detect(InputStream input, Metadata metadata,List<EncodingDetector> detectors)

通过该方法获取文档的编码类型

List<EncodingDetector>即为编码识别类的集合,源自loader.loadServiceProviders(EncodingDetector.class)方法,加载编码识别类列表
接下来分析
ServiceLoader类的源码:
/**
 * Internal utility class that Tika uses to look up service providers.
 *
 * @since Apache Tika 0.9
 */
public class ServiceLoader {

    /**
     * The default context class loader to use for all threads, or
     * <code>null</code> to automatically select the context class loader.
     */
    private static volatile ClassLoader contextClassLoader = null;

    /**
     * The dynamic set of services available in an OSGi environment.
     * Managed by the {@link TikaActivator} class and used as an additional
     * source of service instances in the {@link #loadServiceProviders(Class)}
     * method.
     */
    private static final Map<Object, Object> services =
            new HashMap<Object, Object>();

    /**
     * Returns the context class loader of the current thread. If such
     * a class loader is not available, then the loader of this class or
     * finally the system class loader is returned.
     *
     * @see <a href="https://issues.apache.org/jira/browse/TIKA-441">TIKA-441</a>
     * @return context class loader, or <code>null</code> if no loader
     *         is available
     */
    static ClassLoader getContextClassLoader() {
        ClassLoader loader = contextClassLoader;
        if (loader == null) {
            loader = ServiceLoader.class.getClassLoader();
        }
        if (loader == null) {
            loader = ClassLoader.getSystemClassLoader();
        }
        return loader;
    }

    /**
     * Sets the context class loader to use for all threads that access
     * this class. Used for example in an OSGi environment to avoid problems
     * with the default context class loader.
     *
     * @param loader default context class loader,
     *               or <code>null</code> to automatically pick the loader
     */
    public static void setContextClassLoader(ClassLoader loader) {
        contextClassLoader = loader;
    }

    static void addService(Object reference, Object service) {
        synchronized (services) {
            services.put(reference, service);
        }
    }

    static Object removeService(Object reference) {
        synchronized (services) {
            return services.remove(reference);
        }
    }

    private final ClassLoader loader;

    private final LoadErrorHandler handler;

    private final boolean dynamic;

    public ServiceLoader(
            ClassLoader loader, LoadErrorHandler handler, boolean dynamic) {
        this.loader = loader;
        this.handler = handler;
        this.dynamic = dynamic;
    }

    public ServiceLoader(ClassLoader loader, LoadErrorHandler handler) {
        this(loader, handler, false);
    }

    public ServiceLoader(ClassLoader loader) {
        this(loader, LoadErrorHandler.IGNORE);
    }

    public ServiceLoader() {
        this(getContextClassLoader(), LoadErrorHandler.IGNORE, true);
    }

    /**
     * Returns an input stream for reading the specified resource from the
     * configured class loader.
     *
     * @param name resource name
     * @return input stream, or <code>null</code> if the resource was not found
     * @see ClassLoader#getResourceAsStream(String)
     * @since Apache Tika 1.1
     */
    public InputStream getResourceAsStream(String name) {
        if (loader != null) {
            return loader.getResourceAsStream(name);
        } else {
            return null;
        }
    }

    /**
     * Loads and returns the named service class that's expected to implement
     * the given interface.
     *
     * @param iface service interface
     * @param name service class name
     * @return service class
     * @throws ClassNotFoundException if the service class can not be found
     *                                or does not implement the given interface
     * @see Class#forName(String, boolean, ClassLoader)
     * @since Apache Tika 1.1
     */
    @SuppressWarnings("unchecked")
    public <T> Class<? extends T> getServiceClass(Class<T> iface, String name)
            throws ClassNotFoundException {
        if (loader == null) {
            throw new ClassNotFoundException(
                    "Service class " + name + " is not available");
        }
        Class<?> klass = Class.forName(name, true, loader);
        if (klass.isInterface()) {
            throw new ClassNotFoundException(
                    "Service class " + name + " is an interface");
        } else if (!iface.isAssignableFrom(klass)) {
            throw new ClassNotFoundException(
                    "Service class " + name
                    + " does not implement " + iface.getName());
        } else {
            return (Class<? extends T>) klass;
        }
    }

    /**
     * Returns all the available service resources matching the
     *  given pattern, such as all instances of tika-mimetypes.xml 
     *  on the classpath, or all org.apache.tika.parser.Parser 
     *  service files.
     */
    public Enumeration<URL> findServiceResources(String filePattern) {
       try {
          Enumeration<URL> resources = loader.getResources(filePattern);
          return resources;
       } catch (IOException ignore) {
          // We couldn't get the list of service resource files
          List<URL> empty = Collections.emptyList();
          return Collections.enumeration( empty );
      }
    }

    /**
     * Returns all the available service providers of the given type.
     *
     * @param iface service provider interface
     * @return available service providers
     */
    public <T> List<T> loadServiceProviders(Class<T> iface) {
        List<T> providers = new ArrayList<T>();
        providers.addAll(loadDynamicServiceProviders(iface));
        providers.addAll(loadStaticServiceProviders(iface));
        return providers;
    }

    /**
     * Returns the available dynamic service providers of the given type.
     * The returned list is newly allocated and may be freely modified
     * by the caller.
     *
     * @since Apache Tika 1.2
     * @param iface service provider interface
     * @return dynamic service providers
     */
    @SuppressWarnings("unchecked")
    public <T> List<T> loadDynamicServiceProviders(Class<T> iface) {
        List<T> providers = new ArrayList<T>();

        if (dynamic) {
            synchronized (services) {
                for (Object service : services.values()) {
                    if (iface.isAssignableFrom(service.getClass())) {
                        providers.add((T) service);
                    }
                }
            }
        }

        return providers;
    }

    /**
     * Returns the available static service providers of the given type.
     * The providers are loaded using the service provider mechanism using
     * the configured class loader (if any). The returned list is newly
     * allocated and may be freely modified by the caller.
     *
     * @since Apache Tika 1.2
     * @param iface service provider interface
     * @return static service providers
     */
    @SuppressWarnings("unchecked")
    public <T> List<T> loadStaticServiceProviders(Class<T> iface) {
        List<T> providers = new ArrayList<T>();

        if (loader != null) {
            List<String> names = new ArrayList<String>();

            String serviceName = iface.getName();
            Enumeration<URL> resources =
                    findServiceResources("META-INF/services/" + serviceName);
            for (URL resource : Collections.list(resources)) {
                try {
                    collectServiceClassNames(resource, names);
                } catch (IOException e) {
                    handler.handleLoadError(serviceName, e);
                }
            }

            for (String name : names) {
                try {
                    Class<?> klass = loader.loadClass(name);
                    if (iface.isAssignableFrom(klass)) {
                        providers.add((T) klass.newInstance());
                    }
                } catch (Throwable t) {
                    handler.handleLoadError(name, t);
                }
            }
        }

        return providers;
    }

    private static final Pattern COMMENT = Pattern.compile("#.*");

    private static final Pattern WHITESPACE = Pattern.compile("\\s+");

    private void collectServiceClassNames(URL resource, Collection<String> names)
            throws IOException {
        InputStream stream = resource.openStream();
        try {
            BufferedReader reader =
                new BufferedReader(new InputStreamReader(stream, "UTF-8"));
            String line = reader.readLine();
            while (line != null) {
                line = COMMENT.matcher(line).replaceFirst("");
                line = WHITESPACE.matcher(line).replaceAll("");
                if (line.length() > 0) {
                    names.add(line);
                }
                line = reader.readLine();
            }
        } finally {
            stream.close();
        }
    }

}
ServiceLoader类的主要功能是加载服务类,分为动态加载服务类和静态加载服务类,分别对应List<T> loadDynamicServiceProviders(Class<T> iface)方法和List<T> loadStaticServiceProviders(Class<T> iface)方法

HtmlParser类的私有成员
static final ServiceLoader LOADER =new ServiceLoader(HtmlParser.class.getClassLoader())是只调用静态加载方法
List<T> loadStaticServiceProviders(Class<T> iface)方法(this.dynamic值为false)
加载jar文件里面路径为META-INF/services/org.apache.tika.detect.EncodingDetector的文件
#  Licensed to the Apache Software Foundation (ASF) under one or more
#  contributor license agreements.  See the NOTICE file distributed with
#  this work for additional information regarding copyright ownership.
#  The ASF licenses this file to You under the Apache License, Version 2.0
#  (the "License"); you may not use this file except in compliance with
#  the License.  You may obtain a copy of the License at
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector

 通过加载该文件获取编码识别类列表,最后AutoDetectReader类调用static Charset detect(InputStream input, Metadata metadata,List<EncodingDetector> detectors)方法获取文档的编码类型

至于TagSoup组件我这里转载一篇博文供参考:

TagSoup开发指南

http://cactus-jing.iteye.com/blog/1070620

对于TagSoup可能有些人会比较陌生,TagSoup是Java语言开发的,通过SAX引擎解析结构糟糕、令人抓狂的不规范HTML文档的小工具。TagSoup可以将一个HTML文档转换为结构良好的XML文档(近似于XHTML),方便开发人员对获取的HTML文档进行解析等操作。同时TagSoup提供了命令行程序,可以运行TagSoup来对HTML文档进行解析。 

但是TagSoup的缺陷就是,官方网站(http://home.ccil.org/~cowan/XML/tagsoup/)上不提供API文档的链接,同时也不提供开发指南,只提供了一个40页的幻灯片(http://home.ccil.org/~cowan/XML/tagsoup/),是其在Extreme Markup Languages 2004上的演讲。这对于将TagSoup整合到自己的应用程序中还是遇到了很大的挑战! 

使用TagSoup的开发流程: 

  • 创建Parser实例;
  • 提供自己的SAX2内容处理器
  • 提供只想需要解析的HTML的InputSource实例;
  • 开始parse()!

由于个人能力有限,这几句话把我直接搞懵了,所以决定仔细研究下。 

TagSoup包含2个包、16个类文件(文件数目还是比较少的,但是功能很强大!)。其中核心类包括Parser、PYXScanner、XMLWriter。 

  • org.ccil.cowan.tagsoup.Parser,该类继承自org.xml.sax.helpers.DefaultHandler,可知该类是一个SAX型的解析器;
  • org.ccil.cowan.tagsoup.PYXScanner,该类实现了Scanner接口,用于读取解析后的内容;
  • org.ccil.cowan.tagsoup.XMLWriter,该类继承自org.xml.sax.helpers.XMLFilterImpl,同时实现org.xml.sax.ContentHandler接口(这个是最主要的),也就是说XMLWriter是TagSoup为我们提供的HTML解析成XML文档的默认实现。
那么找到了这三个核心类,就按照上面的流程开始解析吧,以下是我个人写的一个小例子:
 
StringReader xmlReader = new StringReader("");
StringReader sr = new StringReader(html);
InputSource src = new InputSource(sr);//构建InputSource实例
Parser parser = new Parser();//实例化Parse
XMLWriter writer = new XMLWriter();//实例化XMLWriter,即SAX内容处理器
parser.setContentHandler(writer);//设置内容处理器
parser.parse(src);//解析
Scanner scan = new PYXScanner();
scan.scan(xmlReader, parser);//通过xmlReader读取解析后的结果
char[] buff = new char[1024];
while(xmlReader.read(buff) != -1) {
    System.out.println(new String(buff));//打印解析后的结构良好的HTML文档
} 

 tagsoup-1.2.jar (87.9 KB)

原文地址:https://www.cnblogs.com/chenying99/p/2948588.html