Hive 特殊分隔符处理

HIVE特殊分隔符处理

Hive对文件中的分隔符默认情况下只支持单字节分隔符,,默认单字符是01。当然你也可以在创建表格时指定数据的分割符号。但是如果数据文件中的分隔符是多字符的,如下图:

01||zhangsan

02||lisi

03||wangwu

 

补充:hive读取数据的机制

1、首先用inputformat的一个具体的实现类读取文件数据,返回一条条的记录(可以是行,或者是你逻辑中的“行”)

2、然后利用SerDe<默认:org.apache.hadoop.hive.serde2.LazySimpleSerDe>的一个具体的实现类,对上面返回的一条条记录进行字段切割

 

使用RegexSerDe通过正则表达式来抽取字段

1、建表

create table t_bi_reg(id string,name string)

row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'

with serdeproperties(

'input.regex'='(.*)\|\|(.*)',

'output.format.string'='%1$s%2$s'

)

stored as textfile;

 

2、加载数据

01||zhangsan

02||lisi

03||wangwu

 

load data local inpath '/root/lianggang.txt' into table t_bi_reg;

3、查询

hive> select * from t_bi_reg;

OK

01      zhangsan

02      lisi

03      wangwu

 

 

通过自定义inputformat解决特殊分隔符问题

其原理是在inputformat读取行的时候将数据中的“多字节分隔符”替换为hive默认的分隔符(ctrl+A 亦即 01)或用于替代的单字符分隔符。以便hive在serde操作的时候按照默认的单字节分隔符进行字段抽取

package cn.gec.bigdata.hive.inputformat;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileSplit;

import org.apache.hadoop.mapred.InputSplit;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.LineRecordReader;

import org.apache.hadoop.mapred.RecordReader;

import org.apache.hadoop.mapred.Reporter;

import org.apache.hadoop.mapred.TextInputFormat;

 

public class BiDelimiterInputFormat extends TextInputFormat {

         @Override

         public RecordReader<LongWritable, Text> getRecordReader(

 

         InputSplit genericSplit, JobConf job, Reporter reporter)

         throws IOException {

                   reporter.setStatus(genericSplit.toString());

                   MyDemoRecordReader reader = new MyDemoRecordReader(new LineRecordReader(job, (FileSplit) genericSplit));

                   // BiRecordReader reader = new BiRecordReader(job, (FileSplit)genericSplit);

                   return reader;

         }

 

         public static class MyDemoRecordReader implements RecordReader<LongWritable, Text> {

                   LineRecordReader reader;

                   Text text;

                   public MyDemoRecordReader(LineRecordReader reader) {

                            this.reader = reader;

                            text = reader.createValue();

                   }

                   @Override

                   public void close() throws IOException {

                            reader.close();

                   }

                   @Override

                   public LongWritable createKey() {

                            return reader.createKey();

                   }

                   @Override

                   public Text createValue() {

                            return new Text();

                   }

 

                   @Override

                   public long getPos() throws IOException {

                            return reader.getPos();

                   }

 

                   @Override

                   public float getProgress() throws IOException {

                            return reader.getProgress();

                   }

 

                   @Override

                   public boolean next(LongWritable key, Text value) throws IOException {  

                            boolean next = reader.next(key, text);

                            if(next){

                                     String replaceText = text.toString().replaceAll("\|\|", "\|");

                                     value.set(replaceText);

                            }

                            return next;         

                   }

         }

 

}

 

1.打包成jar,放到$HIVE_HOME/lib下

2.建表指明自定义的inputformat

create table t_lianggang(id string,name string)

row format delimited

fields terminated by '|'

stored as inputformat 'cn.gec.bigdata.hive.inputformat.BiDelimiterInputFormat'

outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

 

3.加载数据

01||zhangsan

02||lisi

03||wangwu

 

load data local inpath '/root/lianggang.txt' into table t_lianggang;

 

4.查询

hive> select * from t_lianggang;

OK

01      zhangsan

02      lisi

03      wangwu

 

 

 

原文地址:https://www.cnblogs.com/Transkai/p/10628508.html