hive文件格式

hive 默认格式为文本格式,便于文本查看数据,便于与其他工具共享,与二进制文件相比占用较大的空间

hive> create table tb_test(id int,name string) stored as textfile;
OK
Time taken: 0.968 seconds
hive> show create table tb_test;
OK
createtab_stmt
CREATE TABLE `tb_test`(
  `id` int,
  `name` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_test'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{"BASIC_STATS":"true"}',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'totalSize'='0',
  'transient_lastDdlTime'='1536636132')
Time taken: 0.275 seconds, Fetched: 18 row(s)

sequencefile 含有键值对的二进制文件格式,是hadoop本身就支持的一种标准文件格式,可以在块和记录级别进行压缩,对优化磁盘利用率以及I/O有重要的意义,支持按照块级别的文件分割,以方便并行操作。

hive> create table tb_test2(id int,name string) stored as sequencefile;
OK
Time taken: 0.264 seconds
hive> show create table tb_test2;
OK
createtab_stmt
CREATE TABLE `tb_test2`(
  `id` int,
  `name` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION
  'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_test2'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{"BASIC_STATS":"true"}',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'totalSize'='0',
  'transient_lastDdlTime'='1536636180')
Time taken: 0.222 seconds, Fetched: 18 row(s)

RCfile是hive支持的另一种高效的二进制存储格式。大多数hadoop,hive都是行数存储,大多数情况下比较高效。但对于特定类型的数据和应用,列式存储会更好。如果某表有成百上千的字段,大多数查询只查询部分字段,扫描所有行过滤大部分数据显得浪费,如果按列存储的话,只需要选择需要的列就可以了,提高性能。

对于列式存储,压缩通常比较高效,一些列式存储并不需要物理存储null值列

hive功能强大的一个方面表现在不同格式存储转换数据非常简单。不同存储格式的数据通过insert into ....select ,自动完成转换过程。

hive> create table tb_test3(id int,name string) stored as RCfile;
OK
Time taken: 0.438 seconds
hive> show create table tb_test3;
OK
createtab_stmt
CREATE TABLE `tb_test3`(
  `id` int,
  `name` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
LOCATION
  'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_test3'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{"BASIC_STATS":"true"}',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'totalSize'='0',
  'transient_lastDdlTime'='1536645316')
Time taken: 0.244 seconds, Fetched: 18 row(s)

hive提供了rcfilecat工具来展示rcfile的内容:

[root@host ~]# hdfs dfs -ls /user/hive/warehouse/gamedw.db/tb_test3
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Found 4 items
-rwx-wx-wx   1 root supergroup         82 2018-09-11 13:57 /user/hive/warehouse/gamedw.db/tb_test3/part-00000-3b1d31b4-cde8-4054-b5d3-28179d2a4cc8-c000
-rwx-wx-wx   1 root supergroup         87 2018-09-11 14:05 /user/hive/warehouse/gamedw.db/tb_test3/part-00000-565683b3-738e-4627-8c7a-53d43d819a0e-c000
-rwx-wx-wx   1 root supergroup         88 2018-09-11 14:05 /user/hive/warehouse/gamedw.db/tb_test3/part-00000-b2ccd667-954d-4f8e-8110-04915d810e17-c000
-rwx-wx-wx   1 root supergroup         82 2018-09-11 13:57 /user/hive/warehouse/gamedw.db/tb_test3/part-00000-bc3be7c3-86de-4873-8390-a850872fe0c7-c000

[root@host ~]# hive --service rcfilecat /user/hive/warehouse/gamedw.db/tb_test3/part-00000-565683b3-738e-4627-8c7a-53d43d819a0e-c000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/spark/spark-2.2.0-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
       wang    2

原文地址:https://www.cnblogs.com/playforever/p/9627118.html