1.6.4 Uploading Structured Data Store Data with the Data Import Handler

1.使用DIH上传结构化数据

　　许多搜索应用索引结构化数据,如关系型数据库.DIH提供了一个这样的存储并索引结构化数据的机制.除了关系型数据库,solr可以索引来自HTTP的内容,基于数据源如RSS和ATOM feeds,e-mail库和结构化XML(可以使用XPath来生成字段)

　　更多信息参考 https://wiki.apache.org/solr/DataImportHandler.

1.1 Concepts and Terminology

　　概念和术语

　　Data Import Handler的描述使用了几个相似的术语,如Entity和processor.

术语	定义
Datasource	对于一个数据库,它时一个DNS,对于一个HTTP数据源,它就是一个基础的URL.
Entity	从概念上来讲,一个实体生成一组documents.对于RDBMS数据源来说,一个实体就是一个视图或者一张表.
Processor	一个实体处理器用于从数据源中抽取内容,转换处理,添加到索引中.自定义的实体处理器可以继承或者替换它所支持的实体处理器.
Transformer	实体获取的每一组字段都可以选择被转换处理.这种转换处理可以是修改字段,创建新的字段,或者是由一行生成多行/文档.DIH中有几个内置的转换器,可以修改日期,过滤HTML标签.也可以使用公共可用的接口自定义转换器.

1.2 Configuration

　　1.2.1 Configuring solrconfig.xml

　　Data Import Handler 需要在solrconfig.xml中注册:

<requestHandler name="/dataimport"
    class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">/path/to/my/DIHconfigfile.xml</str>
    </lst>
</requestHandler>

　　只有一个必填的参数config,指定了DIH配置文件的位置,这个配置文件里包含了指定的数据源,抓取什么样的数据,如何抓取数据,如何处理数据产生solr文档,发送到索引中.

　　1.2.2 配置DIH配置文件

　　在example/example-DIH中有一个DIH的例子,访问hsqldb.README.txt文件中有相关运行的细节.对应的配置文件在example/example-DIH/solr/db/conf/db-data-config.xml中.

　　下面是一个带注释的配置文件,从4个表中抽取数据.

<dataConfig>
    <!-- The first element is the dataSource, in this case an HSQLDB database. 
        The path to the JDBC driver and the JDBC URL and login credentials are all 
        specified here. Other permissible attributes include whether or not to autocommit 
        to Solr,the batchsize used in the JDBC connection, a 'readOnly' flag -->
    <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:./example-DIH/hsqldb/ex"
        user="sa" />
    <!-- a 'document' element follows, containing multiple 'entity' elements. 
        Note that 'entity' elements can be nested, and this allows the entity relationships 
        in the sample database to be mirrored here, so that we can generate a denormalized 
        Solr record which may include multiple features for one item, for instance -->
    <document>
        <!-- The possible attributes for the entity element are described below. 
            Entity elements may contain one or more 'field' elements, which map the data 
            source field names to Solr fields, and optionally specify per-field transformations -->
        <!-- this entity is the 'root' entity. -->
        <entity name="item" query="select * from item"
            deltaQuery="select id from item where last_modified  > '${dataimporter.last_index_time}'">
            <field column="NAME" name="name" />
            <!-- This entity is nested and reflects the one-to-many relationship between 
                an item and its multiple features. Note the use of variables; ${item.ID} 
                is the value of the column 'ID' for the current item ('item' referring to 
                the entity name) -->
            <entity name="feature"
                query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"
                deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'"
                parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">
                <field name="features" column="DESCRIPTION" />
            </entity>
            <entity name="item_category"
                query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"
                deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'" 1
                Apache Solr Reference Guide 4
                parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}">
                <entity name="category"
                    query="select DESCRIPTION from category where ID = '${item_category.CATEGORY_ID}'"
                    deltaQuery="select ID from category where last_modified > '${dataimporter.last_index_time}'"
                    parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}">
                    <field column="description" name="cat" />
                </entity>
            </entity>
        </entity>
    </document>
</dataConfig>

View Code

　　在solr4.1之后,添加了一个新的属性propertyWriter元素,定义了日期格式和locale(地区),用于delt(增量)查询.

　　reload-config命令同样支持,用于验证新的配置文件,或者指定一个文件,加载它,不需要在导入时重新加载.

　　1.3 Data Import Handler Commands

　　DIH 命令是通过HTTP请求来发送的,下面操作同样支持.

命令	描述
abort	终止一个持续的操作, URL是 http://<host>:<port>/solr/dataimport?command=abort
delta-import	用于增量导入和改变检测,命令格式:http://<host>:<port>/solr/dataimport?command=delta-import;同样支持clean, commit, optimize 和debug参数.
full-import	全量导入操作,命令格式:http://<host>:<port>/solr/dataimport?command=full-import;这个操作将会以一个新的线程开始,status属性将会频繁在response中显示.这个操作所花费的时间主要依赖于数据设置(dataset)的大小.在全量导入期间,solr的查询不会堵塞. 　　在执行full-import操作时,它会在 conf/dataimport.properties中存储开始时间,这个时间戳用于增量查询.
reload-config	如果这个配置文件发生改变,需要重新加载这个配置文件而不需要重启solr.执行命令: http://<host>:<port>/solr/dataimport?command=reload-config.
status	URL: http://<host>:<port>/solr/dataimport?command=status.它返回文档创建,删除,运行查询,抓取记录,状态等等信息.

　　1.3.1 Command full-import命令参数

参数	描述
clean	默认为true,告诉是否在建立索引之前清除索引.
commit	默认为true,告诉是否在执行这个操作之后,提交索引.
debug	默认为false,在debug模式,文档不会自动提交,如果你想要运行debug模式,同时提交数据,需要添加commit=true参数.
entity	使用这个参数,可以有选择性的执行一个或者多个实体.默认选择多个实体.
optimize	默认为true,告诉solr在执行这个操作之后,优化索引.

　　1.4 Property Writer

　　这个propertyWriter元素定义了日期格式和地区(locale)用于增量查询.是一个可以选择的配置,添加这个元素到DIH配置文件中,直接放在dataConfig下面.

<propertyWriter dateFormat="yyyy-MM-dd HH:mm:ss" type="SimplePropertiesWriter"
    directory="data" filename="my_dih.properties" locale="en_US" />

　　可用参数:

参数	描述
dateFomat	在转换date为文本时可以使用的java.text.SimpleDateFormat.默认是"yyyy-MM-dd HH:mm:ss".
type	实现类,SimplePropertiesWriter可以作为非solrCloud插件,如果使用SolrCloud,使用ZKPropertiesWriter.如果没有指定,它主要依赖于 SolrCloud模式是否开启来默认选择合适的类.
directory	只适用于SimplePropertiesWriter.属性文件的目录,默认为conf
filename	只适用于SimplePropertiesWriter.属性文件的名称,如果没有指定,默认是requestHandler的名称(正如solrconfig.xml中定义的,后面加上".properties" (如 "dataimport.properties")).
locale	地区,如果没有指定,默认为ROOT.

1.5 Data Sources

　　一个数据源可以指定它的原始数据和类型.你可以通过继承 org.apache.solr.handler.dataimport.DataSource来创建一个自定义的数据源.

　　数据源的强制属性是它的name和type.name定义了实体所要使用的数据源,

　　1.5.1 ContentStreamDataSource

　　这里使用POST数据作为数据源,这个数据源可以被任何的EntityProcessor使用.EntityProcessor使用了DataSource<Reader>.

　　1.5.2 FieldReaderDataSource

　　这个可以用于数据库文件中包含XML的字段,和XpathEntityProcessor一起使用处理这个字段.你需要配置JDBC和FieldReader两个数据源,和如下的两个实体:

<dataSource name="a1" driver="org.hsqldb.jdbcDriver" . />
<dataSource name="a2" type=FieldReaderDataSource " />
<!-- processor for database -->
<entity name="e1" dataSource="a1" processor="SQLEntityProcessor"
    pk="docid" query="select * from t1 ...">
    <!-- nested XpathEntity; the field in the parent which is to be used for 
        Xpath is set in the "datafield" attribute in place of the "url" attribute -->
    <entity name="e2" dataSource="a2" processor="XPathEntityProcessor"
        dataField="e1.fieldToUseForXPath"
<!-- Xpath configuration follows -->
        ...
    </entity>
</entity>

　　FieldReaderDataSource有一个encoding参数.默认时UTF-8.

　　1.5.3 FileDataSource

　　这个数据源和URLDataSource很像.但是用于抓取磁盘上的文件内容.和URLDataSource不同的是需要指定一个访问磁盘文件的pathname路径名称.

两个参数:一个encoding参数,默认为utf-8.一个basePath参数,指定磁盘文件路径.

　　1.5.4 JdbcDataSource

　　默认数据源.和SQLEntityProcessor一起使用,参考FieldReaderDataSource部分的细节.

　　1.5.5 URLDataSource

　　这个数据源和XPathEntityProcessor一起使用,从如 file:// 或者http:// 位置抓取内容.这是一个例子:

<dataSource name="a" type="URLDataSource" baseUrl="http://host:port/"
    encoding="UTF-8" connectionTimeout="5000" readTimeout="10000" />

　　参数:

参数	描述
baseURL	用于指定路径
connectionTimeout	指定连接超时的毫秒时间,默认为5000ms
encoding	在响应头部指定的默认编码
readTimeout	指定读取操作的超时毫秒时间,默认为10000ms

1.6 Entity Processors

　　实体处理器抽取数据,转换数据,然后添加数据到索引中去.

属性	描述
datasource	数据源的名称
name	必填,用于标记实体的唯一名称
pk	主键,可选的.在增量时必须使用.和schema.xml中定义的唯一主键没有什么关系,不过它们可以是相同的. 引用 ${dataimporter.delta.<column-name>}.
processor	默认SQLEntityProcessor.如果数据源不是RDBMS,必填.
onError	允许值为(abort\|skip\|continue),默认值为'abort'.'Skip'为跳过当前文档.'Continue'为忽略这个错误,继续处理.
preImportDeleteQuery	在full-import之前,使用这个查询删除索引,这个只被授予给<document>的直接子实体.
postImportDeleteQuery	和上面的相似,不过是在执行full-import之后执行.
rootEntity	默认情况下,document的直接子实体是根实体.如果设置为false,这个实体下面的实体被当作根实体.对于这个根实体返回的每一条记录,都会有一个document被创建.
transformer	可选择的.一个或者多个转换处理器用于实体中.

　　1.6.1 The SQL Entity Processor

　　SqlEntityProcessor是默认处理器,关联的数据源应该是JDBC URL.

　　这个实体的属性如下:

属性	描述
query	必填,用于查询记录的SQL查询.
deltaQuery	用于是否增量的SQL查询,这里查询记录的主键,这个主键对于deltaImportQuery是可用的.可以通过${dataimporter.delta.<column-name>}来使用.
parentDeltaQuery	这个操作是delta-import时使用的SQL查询.
deletedPkQuery	这个操作时delta-import时使用的SQL查询.
deltaImportQuery	如果这个操作时delta-import时使用的SQL查询.如果不存在的话,DIH会通过'query'来构造这个查询(${dataimporter.delta.<column-name>}).例如select * from tbl where id=${dataimporter.delta.id}.

　　1.6.2 The XPathEntityProcessor

　　这个处理器用于索引XML格式的数据.数据源一般都是URLDataSource和FileDataSource.Xpath也可以和 FileListEntityProcessor一起使用.为每一个文件生成一个文档.

　　这个实体的属性如下:

属性	描述
Processor	必填,必须设置为"XpathEntityProcessor"
url	必填,HTTP URL或者文件位置
stream	可选,为大文件或者下载设置为true.
forEach	必填,除非你定义了useSolrAddSchema,这个Xpath区分每一行记录,这个参数用于处理循环.
xsl	可选的,它的值(URL或者文件系统路径)是资源的名称,当作一个处理器被应用于一个XSL转换器.
useSolrAddSchema	设置为true,如果内容是标准Solr更新XML的结构的格式数据.
flatten	可选的,如果设置为true,所有标签下的文本都被抽取到一个字段中.

　　这个实体处理器下的字段属性如下:

属性	描述
xpath	必填,XPath将会从记录中抽取内容到字段中,只有Xpath语法的子集是被支持的.
commonField	可选.如果在一个记录中碰到这个字段,在创建文档时,将被复制到将来的记录中.

　　例子:

<!-- slashdot RSS Feed - -->
<dataConfig>
    <dataSource type="HttpDataSource" />
    <document>
    <!-- forEach  sets up a processing loop ; here there are two expressions -->
        <entity name="slashdot" pk="link"
            url="http://rss.slashdot.org/Slashdot/slashdot" processor="XPathEntityProcessor"
            forEach="/RDF/channel | /RDF/item"
            transformer="DateFormatTransformer">
            <field column="source" xpath="/RDF/channel/title" commonField="true" />
            <field column="source-link" xpath="/RDF/channel/link" commonField="true" />
            <field column="subject" xpath="/RDF/channel/subject" commonField="true" />
            <field column="title" xpath="/RDF/item/title" />
            <field column="link" xpath="/RDF/item/link" />
            <field column="description" xpath="/RDF/item/description" />
            <field column="creator" xpath="/RDF/item/creator" />
            <field column="item-subject" xpath="/RDF/item/subject" />
            <field column="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
            <field column="slash-department" xpath="/RDF/item/department" />
            <field column="slash-section" xpath="/RDF/item/section" />
            <field column="slash-comments" xpath="/RDF/item/comments" />
        </entity>
    </document>
</dataConfig>

　　参考:http://wiki.apache.org/solr/MailEntityProcessor

　　1.6.3 The TikaEntityProcessor

　　TikaEntityProcessor使用了Apache Tika 处理引入的文档.和 Uploading Data with Solr Cell using Apache Tika比较相似,但是使用了DataImportHandler来代替.

　　例子:

<dataConfig>
    <dataSource type="BinFileDataSource" />
    <document>
        <entity name="tika-test" processor="TikaEntityProcessor"
            url="../contrib/extraction/src/test-files/extraction/solr-word.pdf"
            format="text">
            <field column="Author" name="author" meta="true" />
            <field column="title" name="title" meta="true" />
            <field column="text" name="text" />
        </entity>
    </document>
</dataConfig>

　　这个处理器的参数:

参数	描述
dataSource	这个参数定义了数据源,在配置的后面部分可能会被引用. 这个处理器可以使用的数据源: 　　BinURLDataSource: 用于HTTP资源,不过也可以用于文件. 　　BinContentStreamDataSource:用于上载的内容作为流　　BinFileDataSource:用于本地文件系统上的内容
url	源文件路径.可以是一个文件路径或者是传统的互联网URL.这个参数是必填的.
htmlMapper	允许控制Tika如何解析HTML."default"映射剥去文档中多个HTML."identity"传递所有没有修改的HTML.这个参数必须是"default"或者"identity".默认为"default".
format	输出格式,可以是text,xml,html,none.如果没有定义,默认为"text".如果只有元数据被索引,并且没有文档的body,使用"none".
parser	默认解析器是org.apache.tika.parser.AutoDetectParser.如果自定义,使用全路径类名.
fields	输入文档的字段列表,可以被映射到solr文档字段.如果属性meta定义为true,那么这个字段将会从文档的元数据中获取,不会从文本的body中解析.

　　1.6.4 The FileListEntityProcessor

　　这个处理器是一个封装,产生一组满足条件的文件,然后传递给另一个处理器.如XPathEntityProcessor.这个处理器产生4个字段:fileAbsolutePath,fileSize,

fileLastModified,fileName.这个处理器不使用数据源.

　　这个处理器的属性描述:

属性	描述
fileName	必填,正则表达式模式识别指定的文件.
basedir	必填,基础目录(绝对路径)
recursive	是否递归搜索目录,默认为false.
excludes	正则表达式模式识别排除的文件.
newerThan	满足yyyy-MM-ddHH:mm:ss格式的日期,或者是日期表达式(NOW-2YEARS)
olderThan	和newerThan同样格式的日期
rootEntity	这个应该被设置为false,这个参数保证了由处理器发出的每一行记录都被当作一个文档.
dataSource	必须设置为null.

　　下面的例子展示了FileListEntityProcessor和另外一个处理器联合使用:

<dataConfig>
    <dataSource type="FileDataSource" />
    <document>
        <!-- this outer processor generates a list of files satisfying the conditions 
            specified in the attributes -->
        <entity name="f" processor="FileListEntityProcessor" fileName=".*xml"
            newerThan="'NOW-30DAYS'" recursive="true" rootEntity="false"
            dataSource="null" baseDir="/my/document/directory">
            <!-- this processor extracts content using Xpath from each file found -->
            <entity name="nested" processor="XPathEntityProcessor"
                forEach="/rootelement" url="${f.fileAbsolutePath}">
                <field column="name" xpath="/rootelement/name" />
                <field column="number" xpath="/rootelement/number" />
            </entity>
        </entity>
    </document>
</dataConfig>

　　1.6.5 LineEntityProcessor

　　这个处理器从数据源中一行接着一行的读取内容,然后对于读取的每一行返回一个字段叫做rawLine.这个内容不用任何方式解析,尽管如此,你可以添加转换器来在rawLine字段中操作数据,或者创建额外的字段.

　　这些行的内容可以通过指定两个正则表达式acceptLineRegex和omitLineRegex属性来过滤.下面是描述LineEntityProcessor的属性:

属性	描述
url	必填属性,指定输入文件的位置,可以使用FileDataSource和URLDataSource数据源.
acceptLineRegex	可选属性，如果存在,放弃任何不匹配正则表达式的行.
omitLineRegex	可选属性,在acceptLineRegex之后使用,放弃任何匹配正则表达式的行.

　　例子:

<entity name="jc"
    processor="LineEntityProcessor"
    acceptLineRegex="^.*.xml$"
    omitLineRegex="/obsolete"
    url="file:///Volumes/ts/files.lis"
    rootEntity="false"
    dataSource="myURIreader1"
    transformer="RegexTransformer,DateFormatTransformer"
>
    ...

</entity>

　　在你想要对文件中的每一行都创建一个文档时,考虑使用这个处理器.

　　1.6.6 PlainTextEntityProcessor

　　这个处理器从数据源中读取所有的内容形成单独的字段plainText.这个内容不会被任何方式解析,尽管如此,你可以添加转换器来在plainText字段中操作数据,或者创建额外的字段.

　　例子:

<entity processor="PlainTextEntityProcessor" name="x"
    url="http://abc.com/a.txt" dataSource="data-source-name">
    <!-- copies the text to a field called 'text' in Solr -->
    <field column="plainText" name="text" />
</entity>

　　确保这个实体的数据源是DataSource<Reader>(FileDataSource,URLDataSource)类型.

1.7 Transformers

　　转换器操作文档中的字段,然后通过实体返回.一个转换器可以创建新的字段,修改已经存在的字段.

<entity name="abcde" transformer="org.apache.solr....,my.own.transformer,..." />

　　指定了转换器规则之后,添加<field>元素中对应的属性.转换器是按照属性transformer中指定转换器的顺序来执行的.

　　Data Import Handler 包含了几个内置的转换器,你也可以自定义自己的转换器.参考 http://wiki.apache.org/solr/DIHCustomTransformer.ScriptTransformer转换器提供了一个可以替换的方法,用于写入自己的方法.

　　solr包含以下内置的转换器:

属性	描述
ClobTransformer	从数据库Clob类型中创建一个字符串
DateFormatTransformer	解析date/time实例
HTMLStripTransformer	剥离字段中的HTML标签
LogTransformer	用来记录数据到日志文件或者是控制台.
NumberFormatTransformer	使用NumberFormat类解析字符串为数字.
RegexTransformer	使用正则表达式来操作字段
ScriptTransformer	用Javascript或者其他java支持脚本语言写一个转换器,要求Java 6.
TemplateTransformer	使用模板(template)转换数据.

　　1.7.1 ClobTransformer

　　ClobTransformer从数据库中的CLOB中创建字符串.

<entity name="e" transformer="ClobTransformer" .>
    <field column="hugeTextField" clob="true" />
    ...
</entity>

　　ClobTransformer接受的参数:

参数	描述
clob	布尔值,标记ClobTransformer是否处理这个字段.如果这个属性被忽略,这个字段将不会被转换处理
sourceColName	作为输入的column的名称,如果不设置该属性,source和target是相同的.

　　1.7.2 DateFormatTransformer

　　这个转换器将日期从一种格式转为另一种格式.

　　这个转换器识别以下属性:

属性	描述
dateTimeFormat	解析这个字段要使用的格式,必须遵从 JavaSimpleDateFormat类的语法
sourceColName	dateFormat需要应用的column,如果缺少,source和target相同.
locale	日期转换器的区域,没有指定的话默认使用ROOT.必须指定一个国家的语言,如en-US.

　　这是一个返回"2007-JUL"的例子:

<entity name="en" pk="id" transformer="DateTimeTransformer" .>
    ...
    <field column="date" sourceColName="fulldate" dateTimeFormat="yyyy-MMM" />
</entity>

　　1.7.3 HTMLStripTransformer

　　可以使用这个处理器从字段中剥离HTML标签:

<entity name="e" transformer="HTMLStripTransformer" .>
    <field column="htmlText" stripHTML="true" />
    ...
</entity>

　　1.7.4 LogTransformer

　　可以使用这个处理器记录数据到日志文件或者控制台.

<entity . transformer="LogTransformer" logTemplate="The name is ${e.name}"
    logLevel="debug">
    ....
</entity>

　　1.7.5 NumberFormatTransformer

　　使用这个转换器,转换字符串为数字.

属性	描述
formatStyle	用于解析这个字段的格式,可以是(number\|percent\|integer\|currency)中的一个,
sourceColName	应用NumberFormat的column,如果缺少,source和target相同.
locale	地区,默认为ROOT.必须指定国家的语言.如en-US

　　例如:

<entity name="en" pk="id" transformer="NumberFormatTransformer" .>
    ...
    <!-- treat this field as UK pounds -->
    <field name="price_uk" column="price" formatStyle="currency" locale="en-UK" />
</entity>

　　1.7.6 RegexTransformer

　　这个正则转换器使用正则表达式帮助抽取,处理字段(来自源数据).实际处理的类是org.apache.solr.handler.dataimport.RegexTransformer.如果是默认的包,包名可以忽略.

　　这个正则转换器可以识别的属性:

属性	描述
regex	用于匹配column或者sourceColName的值的正则表达式.如果replaceWith缺少的话,每一个正则组(regex group)都被当作一个值,并且返回一个值的列表集合.
sourceColName	正则应用的列(column).如果缺少,source和target相同
splitBy	用于分解字符串,返回一组值的列表(list).
groupNames	一个逗号分割的字段名称.
replaceWith	与regex一起使用,等价于方法new String(<sourceColVal>).replaceAll(<regex>,<replaceWith>).

　　例子:

<entity name="foo" transformer="RegexTransformer"
    query="select full_name , emailids from foo" />
... />
<field column="full_name" />
<field column="firstName" regex="Mr(w*).*" sourceColName="full_name" />
<field column="lastName" regex="Mr.*?(w*)" sourceColName="full_name" />
<!-- another way of doing the same -->
<field column="fullName" regex="Mr(w*)(.*)" groupNames="firstName,lastName" />
<field column="mailId" splitBy="," sourceColName="emailids" />
</entity>

　　1.7.7 ScriptTransformer

　　这个脚本转换器允许JAVA支持的任意的脚本语言写的函数.如Javascript,JRuby,Jython,Groovy,BeanShell.Javascript已经集成到Java 6中了,你需要自己集成其他的语言.

　　你需要写的每一个函数都要接受一个row变量(复合 Java Map<String,Object>类型),因此允许get,put,remove操作.

　　这个脚本写在配置文件的最顶端,对于每一行只被调用一次.

<dataconfig>
    <!-- simple script to generate a new row, converting a temperature from 
        Fahrenheit to Centigrade -->
    <script>
        <![CDATA[
            function f2c(row) { 
                var tempf, tempc; 
                tempf = row.get('temp_f'); 
                if (tempf !=null) { 
                    tempc = (tempf - 32.0)*5.0/9.0
                    row.put('temp_c', temp_c);
                }
                return row;
            }
        ]]>
        </script>
    <document>
        <!-- the function is specified as an entity attribute -->
        <entity name="e1" pk="id" transformer="script:f2c" query="select * from X">
            ....
        </entity>
    </document>
</dataConfig>

　　1.7.8 TemplateTransformer

　　可以使用这个模版转换器构造或者修改字段的值,

<entity name="en" pk="id" transformer="TemplateTransformer" .>
    ...
    <!-- generate a full address from fields containing the component parts -->
    <field column="full_address" template="$en.{street},$en{city},$en{zip}" />
</entity>

1.8 Special Commands for the Data Import Handler

　　可以传递特别的命令给DIH.

变量	描述
$skipDoc	跳过当前文档,也就是说当前文档不会添加到solr中.true\|false.
$skipRow	跳过当前的行(row).true\|false.
$docBoost	为当前文档加权,可以是数字或者是数字的tostring格式的字符串.
$deleteDocById	使用ID删除文档.这个ID值必须是solr的uniqueKey.
$deleteDocByQuery	通过查询删除文档.这个值必须是solr查询.