TikaEntityProcessor 各种示例

<dataConfig>
 <dataSource type="BinFileDataSource" />
<script><![CDATA[
function setIdType(row) {
    row.put('id', 'file::'
     + row.get('fileAbsolutePath'));
    row.put('type', 'file');
    return row;
}
]]></script>
 <document>
  
   <entity name="tika-test" processor="TikaEntityProcessor"
    url="C:UsersAdministratorDesktop测试素材URL URI.pdf"
    format="text"
   transformer="script:setIdType">
   
    <field name="file_author" column="Author" meta="true" />
    <field name="file_title" column="title" meta="true" />
    <field name="file_text" column="text" />
   </entity>
 </document>
</dataConfig>

<dataConfig>
    <script><![CDATA[
        id = 1;
        function GenerateId(row) {
            row.put('id', (id ++).toFixed());
            return row;
        }       
       ]]></script>
   <dataSource type="BinURLDataSource" name="data"/>
    <dataSource type="URLDataSource" baseUrl="http://localhost/tmp/bin/" name="main"/>
    <document>
        <entity name="rec" processor="XPathEntityProcessor" url="data.xml" forEach="/albums/album" dataSource="main" transformer="script:GenerateId">
            <field column="title" xpath="//title" />
            <field column="description" xpath="//description" />
            <entity processor="TikaEntityProcessor" url="http://localhost/tmp/bin/${rec.description}" dataSource="data">
                <field column="text" name="content" />
                <field column="Author" name="author" meta="true" />
                <field column="title" name="title" meta="true" />
            </entity>
        </entity>
    </document>
</dataConfig>

Solr配置Clob字段

<documentname="bulletin">

     <entity name="item" pk="uuid" transformer="ClobTransformer" query="select * from no_bulletin">

             <fieldcolumn="UUID"name="id"/>

           <fieldcolumn="CONTENT"name="content"clob="true"/>

      </entity>

</document>

注：红色部分是配置clob字段必须的，CONTENT必须大些，否则ClobTransformer是不会被执行解析的。（query中的sql语句改成自己的）

Solr配置Blob字段

<dataSourcename="f1"type="FieldStreamDataSource"/>

<dataSourcename="orcle"driver="oracle.jdbc.driver.OracleDriver"url="jdbc:oracle:thin:@192.168.196.253:1521:orcl"user="sample_bus"password="sample_bus"/>

<document>

       <entitydataSource ="orcle"name="attach"query="select att_id,content from no_bul_attcontent where att_id='645cf16b40d4472ca649084c6aa099fe'">

               <fieldcolumn="ATT_ID"name="id"/>

               <entitydataSource="f1"processor="TikaEntityProcessor"url="content" dataField="attach.CONTENT">

                       <fieldcolumn="text"name="docContent"/>

                </entity>

        </entity>

</document>

注意：这里url没有作用，可以去掉（如果dataSource不是数据库，而是本地文件，那这里就是路径，如：url="d:/path ${f.fileAbsolutePath}"等等，f父实体的name），

如果url不对，报无效的sql语句错误。

dataField中attach是父实体的name。attach.CONTENT必须大写，否则报：No field available for name : attach.content Processing Document # 1.

特别注意：数据库中Blob字段名不能与schema.xml中对应的字段同名。否则，Bolb字段导入的结果为<str name="abc">oracle.sql.BLOB@1042c25</str>