hive 上篇

hive 是以hadoop为基础的数据仓库,使用HQL查询存放在HDFS上面的数据,HSQL是一种类SQL的语句,最终会被编译成map/reduce,HSQL可以查询HDFS上面的数据;不支持记录级别的插入,更新或者,但用户可以通过查询生产新表或者将查询文件保存在文件中。同时由于hadoop是面向批处理的系统,而mapreduce任务(job)的启动过程需要消耗较多的时间,所以hive查询延迟比较严重,因此提交查询和返回查询结果可能会有较大的延迟,不能进行交互性查询。在传统数据中秒级的查询,在hive中,即便是数据集相对较少,往往也会需要更长的时间;hive不支持事务。因此不支持OLTP,如果想对大规模的数据进行OLTP功能,则可以选择一种NOSQL数据库,如hbase等

hive替代工具最有名的就是PIG了,pig是一种数据流语言,而不是一种查询语言。常用于ETL过程的一部分,也就是将部数据装载到HADOOP集群中,然后转换为期望的数据格式。

hadoop团队通常会将hive和pig结合使用。

如果用户需要hive无法提供的特性(行级更新,快速查询响应以及支持事务),可以用hbase;

hbase是一种分布式,可伸缩的数据存储,支持行级别的更新,快速查询,行级事务(不支持多行事务)。

hbase支持的一个重要特性是列存储,其中列可以组织成列族,列族在分布式存储中物理上是存储在一起的,这使得查询场景涉及的列只是所有列的一个子集时,读写速度会快很多,因为不需要读取所有的行然后丢弃大部分列,而是读取需要的列。

可以像键-值存储一样使用hbase,其每一行都使用了唯一键来提供非常快速的读写这一行的列或者列族,hbase还可以对每个列保存多个版本的值(按时间戳进行标记),版本数据是可以配置的.

hbase使用HDFS持久化存储数据

hbase没有提供类似于sql的查询语言,可以和hive结合使用

变量和属性

hivevar 用户自定义变量  可以读写

hiveconf  hive相关配置属性 可以读写

system java定义的配置属性 可以读写

env  shell环境定义的环境变量 只读

在cli中,可以使用set显示或者修改变量值

查看env的一个变量

hive> set env:HOME;
env:HOME=/root

如果只有set 会打印出命名空间hivevar,hiveconf,system,env中的所有变量

hive> set hivevar:foo;
hivevar:foo is undefined as a hive variable
Query returned non-zero code: 1, cause: null
hive> set foo='aa';
hive> set foo;
foo='aa'

hive> set hivevar:foo='bbb';
hive> set foo;
foo='aa'

hive> set foo='ccc';
hive> set foo;
foo='ccc'

hive> set hivevar:num=20;
hive> set num;
num=20

hive> select * from gamedw.cust where nianling = ${num};
OK
zhangsan        1       20
zhangsan        1       20
Time taken: 0.337 seconds, Fetched: 2 row(s)

hive> set hivevar:name='nihao'
    > ;
hive> select * from gamedw.cust where custname= ${name};
OK
nihao   1       5
nihao   1       5
Time taken: 0.34 seconds, Fetched: 2 row(s)
hive> select * from gamedw.cust where custname= ${hivevar:name};
OK
nihao   1       5
nihao   1       5
Time taken: 0.39 seconds, Fetched: 2 row(s)

hive> set name1='hello';
hive> select * from gamedw.cust where custname= ${hiveconf:name1};
OK
hello   0       100
hello   0       100
Time taken: 0.362 seconds, Fetched: 2 row(s)

由上面可知定义变量前缀hivevar:是可选的

查看建库语句:

hive> show create database gamedw;
OK
CREATE DATABASE `gamedw`
LOCATION
  'hdfs://localhost:9000/user/hive/warehouse/gamedw.db'
WITH DBPROPERTIES (
  'creator'='tianyongtaao')
Time taken: 0.01 seconds, Fetched: 5 row(s)

创建一个表:

hive> create table tb_test(id int,str string);
OK
Time taken: 0.379 seconds
hive> show create table tb_test;
OK
CREATE TABLE `tb_test`(
  `id` int,
  `str` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_test'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{"BASIC_STATS":"true"}',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'totalSize'='0',
  'transient_lastDdlTime'='1521107014')
Time taken: 0.049 seconds, Fetched: 18 row(s)

hive中引用部变量:  在代码中引用时使用${...}; 

[root@host bin]#  hive -d name=username;

hive> set name;
name=username

hive> use gamedw;
OK
Time taken: 0.016 seconds

hive> create table taa(id int,${name} string);
OK
Time taken: 0.516 seconds
hive> desc taa;
OK
id                      int                                        
username                string                                     
Time taken: 0.061 seconds, Fetched: 2 row(s)

hive一次查询命令:

[root@host bin]# hive -e "select * from gamedw.gamerole;"

hive 一次查询命令并保存查询结果(输出到本地文件而不是HDFS):

[root@host bin]# hive -e "select * from gamedw.gamerole">/root/tmpdata/gamerole.txt

查看数据仓库路径:

[root@host bin]# hive -S -e "set" |grep ware
which: no hbase in (/root/app/apache-maven-3.5.2/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/java/jdk1.8.0_101/bin:/root/hadoop/hadoop-2.7.4/bin:/root/hive/apache-hive-2.1.1/bin:/root/spark/spark-2.2.0-bin-hadoop2.7/bin:/root/scala/scala-2.12.3/bin:/root/scala/sbt:/opt/netcat/bin:/usr/java/jdk1.8.0_101/bin:/root/hadoop/hadoop-2.7.4/bin:/root/hive/apache-hive-2.1.1/bin:/root/spark/spark-2.2.0-bin-hadoop2.7/bin:/root/scala/scala-2.12.3/bin:/root/scala/sbt/bin:/root/bin:/root/hadoop/hadoop-2.7.4/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.warehouse.subdir.inherit.perms=true

执行文件中的一个或者多个查询语句:

[root@host ~]# cat /root/tmpdata/getgamerole.hql
select * from gamedw.gamerole;
use gamedw;
show tables;

[root@host bin]# hive -f /root/tmpdata/getgamerole.hql

.......................

OK
Time taken: 2.104 seconds
OK
Time taken: 0.035 seconds
OK
gamerole
taa
tb_test
tb_test2
Time taken: 0.027 seconds, Fetched: 4 row(s)

hive shell里面执行脚本文件:

hive> source /root/tmpdata/getgamerole.hql;
OK
Time taken: 0.144 seconds
OK
Time taken: 0.009 seconds
OK
gamerole
taa
tb_test
tb_test2
Time taken: 0.011 seconds, Fetched: 4 row(s)

从本地文件加载数据:

hive> create table src(ss string);
OK
Time taken: 0.21 seconds

[root@host ~]# cat /root/tmpdata/tian.txt
name tianyongtao
sex man
game wzry
age 100
score 120
home linying

hive> load data local inpath '/root/tmpdata/tian.txt' into table src;
Loading data to table gamedw.src
OK
Time taken: 0.999 seconds
hive> select * from src;
OK
name tianyongtao
sex man
game wzry
age 100
score 120
home linying
Time taken: 0.13 seconds, Fetched: 6 row(s)

hive cli更多功能

tab键 自动补全功能

执行shell命令:在命令前加上!,后面加上;就可以了(不支持管道功能)

hive> !ls /root/app/;
apache-maven-3.5.2
eclipse

hive> !pwd;
/root

hive 使用hadoop的dfs命令(去掉前面的hdfs):

[root@host ~]# hdfs dfs -ls /tmp;
Found 5 items
drwxr-xr-x   - root supergroup          0 2017-10-13 16:06 /tmp/20170915
drwxr-xr-x   - root supergroup          0 2017-12-07 16:35 /tmp/20171024
drwxr-xr-x   - root supergroup          0 2017-12-07 16:04 /tmp/20171206
drwxr-xr-x   - root supergroup          0 2017-12-07 16:20 /tmp/20171207
drwx-wx-wx   - root supergroup          0 2017-11-02 16:42 /tmp/hive

hive>  dfs -ls /tmp;
Found 5 items
drwxr-xr-x   - root supergroup          0 2017-10-13 16:06 /tmp/20170915
drwxr-xr-x   - root supergroup          0 2017-12-07 16:35 /tmp/20171024
drwxr-xr-x   - root supergroup          0 2017-12-07 16:04 /tmp/20171206
drwxr-xr-x   - root supergroup          0 2017-12-07 16:20 /tmp/20171207
drwx-wx-wx   - root supergroup          0 2017-11-02 16:42 /tmp/hive

hive使用-- 进行注释

 hive 显示字段名称

hive> set hive.cli.print.header=true;

hive> select * from gamerole;
OK
gamerole.id     gamerole.rolename       gamerole.createtime     gamerole.zoneid
Time taken: 0.123 seconds

显示当前库名:

hive> set hive.cli.print.current.db=true;
hive (gamedw)>

数据类型 int,float,double,string,timestamp等

集合数据类型struct ,map,array等

HQL

列举以g开头的数据库

hive> show databases like 'g.*';
OK
gamedw
Time taken: 0.008 seconds, Fetched: 1 row(s)

创建数据库

hive> create database testdb with DBPROPERTIES('creater'='tyt')
    > ;
OK
Time taken: 0.27 seconds
hive> desc database testdb;
OK
testdb          hdfs://localhost:9000/user/hive/warehouse/testdb.db     root    USER
Time taken: 0.009 seconds, Fetched: 1 row(s)

hive> desc database extended testdb;
OK
testdb          hdfs://localhost:9000/user/hive/warehouse/testdb.db     root    USER    {creater=tyt}
Time taken: 0.009 seconds, Fetched: 1 row(s)

修改数据库的属性,其他数据库的元数据都是不可以修改的

hive> alter database testdb set  DBPROPERTIES('creater'='tyt','createtime'='2018-03-19');
OK
Time taken: 0.065 seconds
hive> desc database extended testdb;
OK
testdb          hdfs://localhost:9000/user/hive/warehouse/testdb.db     root    USER    {creater=tyt, createtime=2018-03-19}
Time taken: 0.009 seconds, Fetched: 1 row(s)

修改表属性

hive> alter table employees set tblproperties('creator'='tianyongtao');
OK
Time taken: 0.132 seconds

拷贝一个表(不拷贝数据)

hive> create table if not exists role like gamerole;
OK
Time taken: 0.355 seconds
hive> show tables like 'r.*';
OK
role
Time taken: 0.011 seconds, Fetched: 1 row(s)

列举某个数据库的表

hive> show tables in userdb;
OK
name
Time taken: 0.011 seconds, Fetched: 1 row(s)

我们可以通过的desc extended tbname来查看详细的表结构但可读性不强

hive> desc extended employees;
OK
name                    string                                     
salary                  float                                      
subdinates              array<string>                              
deducation              map<string,float>                          
address                 struct<street:string,city:string,state:string,zip:int>                     
                
Detailed Table Information      Table(tableName:employees, dbName:gamedw, owner:root, createTime:1521184118, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:salary, type:float, comment:null), FieldSchema(name:subdinates, type:array<string>, comment:null), FieldSchema(name:deducation, type:map<string,float>, comment:null), FieldSchema(name:address, type:struct<street:string,city:string,state:string,zip:int>, comment:null)], location:hdfs://localhost:9000/user/hive/warehouse/gamedw.db/employees, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{creator=tianyongtao, last_modified_time=1521447397, totalSize=0, numRows=0, rawDataSize=0, numFiles=0, transient_lastDdlTime=1521447397, last_modified_by=root}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.032 seconds, Fetched: 7 row(s)

我们通常会使用formatted 替代extended,这样可以获得更加可读的信息。

hive> desc formatted employees;
OK
# col_name              data_type               comment            
                
name                    string                                     
salary                  float                                      
subdinates              array<string>                              
deducation              map<string,float>                          
address                 struct<street:string,city:string,state:string,zip:int>                     
                
# Detailed Table Information            
Database:               gamedw                  
Owner:                  root                    
CreateTime:             Fri Mar 16 15:08:38 CST 2018    
LastAccessTime:         UNKNOWN                 
Retention:              0                       
Location:               hdfs://localhost:9000/user/hive/warehouse/gamedw.db/employees   
Table Type:             MANAGED_TABLE           
Table Parameters:               
        creator                 tianyongtao        
        last_modified_by        root               
        last_modified_time      1521447397         
        numFiles                0                  
        numRows                 0                  
        rawDataSize             0                  
        totalSize               0                  
        transient_lastDdlTime   1521447397         
                
# Storage Information           
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
InputFormat:            org.apache.hadoop.mapred.TextInputFormat        
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      
Compressed:             No                      
Num Buckets:            -1                      
Bucket Columns:         []                      
Sort Columns:           []                      
Storage Desc Params:            
        serialization.format    1                  
Time taken: 0.032 seconds, Fetched: 36 row(s)

通常我们在hive创建的表为管理表,也叫内部表,存储位置与hive配置信息有关,管理表时,hive也会表里的数据。

管理表不方便与其他工作共享数据,如果一份数据, 想使用hive对其进行查询,我们可以创建一个部表来指向这份数据,而不用对其具有所有权。

创建一个部表

HDFS有一个文件如下:

[root@host ~]# hdfs dfs -cat /tmp/20180321/name.csv
?1,爷爷        ,1
2,大爷        ,1
3,叔叔        ,1
4,堂哥        ,2
5,堂姐        ,2
6,堂妹        ,3

创建部表:

hive> create external  table if not exists t_name
    > (id int,name string,level int)
    > row format delimited fields terminated by ','
    > location '/tmp/20180321';
OK
Time taken: 0.108 seconds
hive> select * from t_name;
OK
1       爷爷            1
2       大爷            1
3       叔叔            1
4       堂哥            2
5       堂姐            2
6       堂妹            3
Time taken: 0.127 seconds, Fetched: 6 row(s)

[root@host ~]# hdfs dfs -cp /tmp/20180321/name.csv /tmp/20180321/name1.csv

[root@host ~]# hdfs dfs -ls /tmp/2018*
Found 2 items
-rw-r--r--   1 root supergroup        123 2017-10-26 11:10 /tmp/20180321/name.csv
-rw-r--r--   1 root supergroup        123 2018-03-21 17:57 /tmp/20180321/name1.csv

hive> select * from t_name;
OK
1       爷爷            1
2       大爷            1
3       叔叔            1
4       堂哥            2
5       堂姐            2
6       堂妹            3
1       爷爷            1
2       大爷            1
3       叔叔            1
4       堂哥            2
5       堂姐            2
6       堂妹            3

 hive> set hive.cli.print.header=true;
hive> select * from t_name where id=1;
OK
t_name.id       t_name.name     t_name.level
1       爷爷            1
1       爷爷            1
Time taken: 0.131 seconds, Fetched: 2 row(s)

hive> desc formatted t_name;
OK
col_name        data_type       comment
# col_name              data_type               comment            
                
id                      int                                        
name                    string                                     
level                   int                                        
                
# Detailed Table Information            
Database:               gamedw                  
Owner:                  root                    
CreateTime:             Wed Mar 21 17:52:22 CST 2018    
LastAccessTime:         UNKNOWN                 
Retention:              0                       
Location:               hdfs://localhost:9000/tmp/20180321      
Table Type:             EXTERNAL_TABLE          
Table Parameters:               
        EXTERNAL                TRUE               
        numFiles                1                  
        totalSize               123                
        transient_lastDdlTime   1521625942         
                
# Storage Information           
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
InputFormat:            org.apache.hadoop.mapred.TextInputFormat        
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      
Compressed:             No                      
Num Buckets:            -1                      
Bucket Columns:         []                      
Sort Columns:           []                      
Storage Desc Params:            
        field.delim             ,                  
        serialization.format    ,                  
Time taken: 0.034 seconds, Fetched: 31 row(s)

可以复制部表结构,生成一个管理表(内部表),也可以复制部表为部表,但需要加external修饰

以下复制部表生成一个管理表:

hive> create table t_name1  like t_name;
OK
Time taken: 0.15 seconds

如果想复制部表生成一个部表,格式如下:

需要添加external修饰,可以指定部数据源

hive> create external table if not exists t_name2 like t_name location '/tmp/20180321'
    > ;
OK
Time taken: 0.297 seconds
hive> select * from t_name2;
OK
t_name2.id      t_name2.name    t_name2.level
1       爷爷            1
2       大爷            1
3       叔叔            1
4       堂哥            2
5       堂姐            2
6       堂妹            3
1       爷爷            1
2       大爷            1
3       叔叔            1
4       堂哥            2
5       堂姐            2
6       堂妹            3
Time taken: 0.206 seconds, Fetched: 12 row(s)

分区表,管理表

数据分区可以水平分散压力,也可以将数据物理转移到距离使用者近的地方。

 创建分区表,分区列为city:

hive> CREATE TABLE `customers`(
    >   `custname` string,
    >   `sex` int,
    >   `age` int)
    > PARTITIONED BY (
    >   `city` string)
    > ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ' '
    >   LINES TERMINATED BY ' ' ;
OK
Time taken: 0.462 seconds

添加分区(导入数据的时候,会自动创建分区,因此此步骤也可以忽略):

hive> alter table customers add partition(city='shenzhen');
OK
Time taken: 0.233 seconds
hive> alter table customers add partition(city='beijing');
OK
Time taken: 0.136 seconds

查看数据表文件目录:

drwx-wx-wx   - root supergroup          0 2018-03-23 14:49 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx   - root supergroup          0 2018-03-23 14:49 /user/hive/warehouse/gamedw.db/customers/city=beijing
drwx-wx-wx   - root supergroup          0 2018-03-23 14:49 /user/hive/warehouse/gamedw.db/customers/city=shenzhen

导入数据(由于分区列city的值已经包含在文件目录里面了,因此也没必要把值的内容存放在它们目录下面的文件中了)

[root@host tmpdata]# cat /root/tmpdata/customers.txt
tianyongtao 1 50
wangwu 1 85
zhangsan 1 20
liuqin 0 56
wangwu 0 47

hive> load data local inpath '/root/tmpdata/customers.txt' into table customers partition(city='shanghai');
Loading data to table gamedw.customers partition (city=shanghai)
OK
Time taken: 0.567 seconds
hive> load data local inpath '/root/tmpdata/customers.txt' into table customers partition(city='beijing');
Loading data to table gamedw.customers partition (city=beijing)
OK
Time taken: 0.447 seconds

 查看HDFS的表文件目录:

drwx-wx-wx   - root supergroup          0 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx   - root supergroup          0 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=beijing
-rwx-wx-wx   1 root supergroup         80 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=beijing/customers.txt
drwx-wx-wx   - root supergroup          0 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=shanghai
-rwx-wx-wx   1 root supergroup         80 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers.txt

从数据表查数据:

hive> set hive.cli.print.header=true;

hive> select * from customers;
OK
customers.custname      customers.sex   customers.age   customers.city
tianyongtao     1       50      beijing
wangwu  1       85      beijing
zhangsan        1       20      beijing
liuqin  0       56      beijing
wangwu  0       47      beijing
liuyang 1       32      beijing
tianyongtao     1       50      shanghai
wangwu  1       85      shanghai
zhangsan        1       20      shanghai
liuqin  0       56      shanghai
wangwu  0       47      shanghai
liuyang 1       32      shanghai
Time taken: 0.152 seconds, Fetched: 12 row(s)
hive> select * from customers where city='beijing';
OK
customers.custname      customers.sex   customers.age   customers.city
tianyongtao     1       50      beijing
wangwu  1       85      beijing
zhangsan        1       20      beijing
liuqin  0       56      beijing
wangwu  0       47      beijing
liuyang 1       32      beijing
Time taken: 0.153 seconds, Fetched: 6 row(s)

在分区不存在的情况下,导入一个新的分区(在管理表中,可以通过载入数据的方式创建分区):

hive> load data local inpath '/root/tmpdata/customers.txt' into table customers partition(city='shenzhen');
Loading data to table gamedw.customers partition (city=shenzhen)
OK
Time taken: 0.439 seconds

再次查看HDFS数据表文件目录:

drwx-wx-wx   - root supergroup          0 2018-03-23 15:04 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx   - root supergroup          0 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=beijing
-rwx-wx-wx   1 root supergroup         80 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=beijing/customers.txt
drwx-wx-wx   - root supergroup          0 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=shanghai
-rwx-wx-wx   1 root supergroup         80 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers.txt
drwx-wx-wx   - root supergroup          0 2018-03-23 15:04 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
-rwx-wx-wx   1 root supergroup         80 2018-03-23 15:04 /user/hive/warehouse/gamedw.db/customers/city=shenzhen/customers.txt

hive> select * from customers where city='shenzhen';
OK
customers.custname      customers.sex   customers.age   customers.city
tianyongtao     1       50      shenzhen
wangwu  1       85      shenzhen
zhangsan        1       20      shenzhen
liuqin  0       56      shenzhen
wangwu  0       47      shenzhen
liuyang 1       32      shenzhen
Time taken: 0.169 seconds, Fetched: 6 row(s)

hive> select * from customers where city='shenzhen' limit 2;
OK
customers.custname      customers.sex   customers.age   customers.city
tianyongtao     1       50      shenzhen
wangwu  1       85      shenzhen
Time taken: 0.175 seconds, Fetched: 2 row(s)

查看分区表的分区:

hive> show partitions customers;
OK
partition
city=beijing
city=shanghai
city=shenzhen
Time taken: 0.067 seconds, Fetched: 3 row(s)

hive> show partitions customers partition(city='shenzhen');
OK
partition
city=shenzhen
Time taken: 0.078 seconds, Fetched: 1 row(s)

hive> desc formatted customers;
OK
col_name        data_type       comment
# col_name              data_type               comment            
                
custname                string                                     
sex                     int                                        
age                     int                                        
                
# Partition Information         
# col_name              data_type               comment            
                
city                    string                                     
                
# Detailed Table Information            
Database:               gamedw                  
Owner:                  root                    
CreateTime:             Fri Mar 23 14:55:22 CST 2018    
LastAccessTime:         UNKNOWN                 
Retention:              0                       
Location:               hdfs://localhost:9000/user/hive/warehouse/gamedw.db/customers   
Table Type:             MANAGED_TABLE           
Table Parameters:               
        transient_lastDdlTime   1521788122         
                
# Storage Information           
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
InputFormat:            org.apache.hadoop.mapred.TextInputFormat        
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      
Compressed:             No                      
Num Buckets:            -1                      
Bucket Columns:         []                      
Sort Columns:           []                      
Storage Desc Params:            
        field.delim                                
        line.delim                               
        serialization.format                       
Time taken: 0.026 seconds, Fetched: 34 row(s)

一个分区:

hive> alter table customers drop partition(city='shanghai');
Dropped the partition city=shanghai
OK
Time taken: 0.143 seconds
hive> show partitions customers;
OK
partition
city=beijing
city=shenzhen
Time taken: 0.042 seconds, Fetched: 2 row(s)

再次查看hdfs数据表目录:

drwx-wx-wx   - root supergroup          0 2018-03-23 15:17 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx   - root supergroup          0 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=beijing
-rwx-wx-wx   1 root supergroup         80 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=beijing/customers.txt
drwx-wx-wx   - root supergroup          0 2018-03-23 15:04 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
-rwx-wx-wx   1 root supergroup         80 2018-03-23 15:04 /user/hive/warehouse/gamedw.db/customers/city=shenzhen/customers.txt

部分区表

部表同样可以分区,这是管理大型生产数据集最常见的情况。给用户提供了与其他工具共享数据的方式,也可以优化查询性能,逻辑数据管理等。

用户可以自己定义文件结构,因此用户对于目录的使用具有更多的灵活性。

本地文件

[root@host tmpdata]# cat log_message0.txt
1 server1 1 aggaggaggewag
2 server2 1 ageggeqagegge
3 server1 2 agegagag4degg
4 server1 5 agetetgdgagggeg
5 server3 7 gahewgjaglewgjew
6 server1 8 agegagegggfewg
7 server2 2 agegwagwggew
8 server1 3 agewgagggfew
9 server1 2 geawtgggggagg
[root@host tmpdata]# cat log_message1.txt
100 server1 2 dsaggegagg
101 server1 3 gaggg
102 server2 5 ageggge
103 server1 1 sagegeg
104 server2 3 agegawggge
105 server2 5 agfeggeg

在hdfs上创建文件夹

[root@host ~]# hdfs dfs -mkdir /tmp/20180325/2018/
[root@host ~]# hdfs dfs -mkdir /tmp/20180325/2018/1
[root@host ~]# hdfs dfs -ls -R /tmp/20180325                                     
drwxr-xr-x   - root supergroup          0 2018-03-26 11:55 /tmp/20180325/2018
drwxr-xr-x   - root supergroup          0 2018-03-26 11:55 /tmp/20180325/2018/1
[root@host ~]# hdfs dfs -mkdir /tmp/20180325/2018/1/1 /tmp/20180325/2018/1/2
[root@host ~]# hdfs dfs -ls -R /tmp/20180325                               
drwxr-xr-x   - root supergroup          0 2018-03-26 11:55 /tmp/20180325/2018
drwxr-xr-x   - root supergroup          0 2018-03-26 11:56 /tmp/20180325/2018/1
drwxr-xr-x   - root supergroup          0 2018-03-26 11:56 /tmp/20180325/2018/1/1
drwxr-xr-x   - root supergroup          0 2018-03-26 11:56 /tmp/20180325/2018/1/2

上传本地文件到hdfs

[root@host ~]# hdfs dfs -put /root/tmpdata/log_message0.txt /tmp/20180325/2018/1/1
[root@host ~]# hdfs dfs -put /root/tmpdata/log_message1.txt /tmp/20180325/2018/1/2
[root@host ~]# hdfs dfs -ls -R /tmp/20180325                                     
drwxr-xr-x   - root supergroup          0 2018-03-26 11:55 /tmp/20180325/2018
drwxr-xr-x   - root supergroup          0 2018-03-26 11:56 /tmp/20180325/2018/1
drwxr-xr-x   - root supergroup          0 2018-03-26 11:57 /tmp/20180325/2018/1/1
-rw-r--r--   1 root supergroup        238 2018-03-26 11:57 /tmp/20180325/2018/1/1/log_message0.txt
drwxr-xr-x   - root supergroup          0 2018-03-26 11:57 /tmp/20180325/2018/1/2
-rw-r--r--   1 root supergroup        137 2018-03-26 11:57 /tmp/20180325/2018/1/2/log_message1.txt

创建分区部表,列分隔符为空格:

 hive>  create external table log_message
    >  (id int,server string,level int,message string)
    >  partitioned by(year int,month int,day int)
    >  row format delimited fields terminated by ' ';
OK
Time taken: 0.099 seconds

添加分区
hive> alter table log_message add  partition(year=2018,month=1,day=1)
    > location '/tmp/20180325/2018/1/1';
OK
Time taken: 0.066 seconds
hive> show partitions log_message;
OK
partition
year=2018/month=1/day=1
Time taken: 0.042 seconds, Fetched: 1 row(s)
hive> select * from log_message;
OK
log_message.id  log_message.server      log_message.level       log_message.message     log_message.year        log_message.month       log_message.day
1       server1 1       aggaggaggewag   2018    1       1
2       server2 1       ageggeqagegge   2018    1       1
3       server1 2       agegagag4degg   2018    1       1
4       server1 5       agetetgdgagggeg 2018    1       1
5       server3 7       gahewgjaglewgjew        2018    1       1
6       server1 8       agegagegggfewg  2018    1       1
7       server2 2       agegwagwggew    2018    1       1
8       server1 3       agewgagggfew    2018    1       1
9       server1 2       geawtgggggagg   2018    1       1
Time taken: 0.122 seconds, Fetched: 9 row(s)
hive> alter table log_message add  partition(year=2018,month=1,day=2)
    > location '/tmp/20180325/2018/1/2';
OK
Time taken: 0.116 seconds
hive> show partitions log_message;
OK
partition
year=2018/month=1/day=1
year=2018/month=1/day=2
Time taken: 0.046 seconds, Fetched: 2 row(s)
hive> select * from log_message;
OK
log_message.id  log_message.server      log_message.level       log_message.message     log_message.year        log_message.month       log_message.day
1       server1 1       aggaggaggewag   2018    1       1
2       server2 1       ageggeqagegge   2018    1       1
3       server1 2       agegagag4degg   2018    1       1
4       server1 5       agetetgdgagggeg 2018    1       1
5       server3 7       gahewgjaglewgjew        2018    1       1
6       server1 8       agegagegggfewg  2018    1       1
7       server2 2       agegwagwggew    2018    1       1
8       server1 3       agewgagggfew    2018    1       1
9       server1 2       geawtgggggagg   2018    1       1
100     server1 2       dsaggegagg      2018    1       2
101     server1 3       gaggg   2018    1       2
102     server2 5       ageggge 2018    1       2
103     server1 1       sagegeg 2018    1       2
104     server2 3       agegawggge      2018    1       2
105     server2 5       agfeggeg        2018    1       2
Time taken: 0.14 seconds, Fetched: 15 row(s)

hive不关心一个分区对应的分区目录是否存在或者分区目录下是否有文件,这个功能的好处,可以将新数据写入专有目录,并与其它目录的数据有明显的区别。

 不管用户将旧数据存档或者直接,新数据被篡改的风险都降低了,因为新数据的子集位于不同的目录下。

与非分区部表一样,hive并不控制这些数据,即使表被,数据也不会被

查看表的信息:

hive> desc formatted log_message;
OK
# col_name              data_type               comment            
                
id                      int                                        
server                  string                                     
level                   int                                        
message                 string                                     
                
# Partition Information         
# col_name              data_type               comment            
                
year                    int                                        
month                   int                                        
day                     int                                        
                
# Detailed Table Information            
Database:               gamedw                  
Owner:                  root                    
CreateTime:             Mon Mar 26 13:29:08 CST 2018    
LastAccessTime:         UNKNOWN                 
Retention:              0                       
Location:               hdfs://localhost:9000/user/hive/warehouse/gamedw.db/log_message 
Table Type:             EXTERNAL_TABLE          
Table Parameters:               
        EXTERNAL                TRUE               
        transient_lastDdlTime   1522042148         
                
# Storage Information           
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
InputFormat:            org.apache.hadoop.mapred.TextInputFormat        
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      
Compressed:             No                      
Num Buckets:            -1                      
Bucket Columns:         []                      
Sort Columns:           []                      
Storage Desc Params:            
        field.delim                                
        serialization.format                       
Time taken: 0.048 seconds, Fetched: 37 row(s)

查看表某个分区的信息:

hive> desc formatted log_message partition(year=2018,month=1,day=1);
OK
# col_name              data_type               comment            
                
id                      int                                        
server                  string                                     
level                   int                                        
message                 string                                     
                
# Partition Information         
# col_name              data_type               comment            
                
year                    int                                        
month                   int                                        
day                     int                                        
                
# Detailed Partition Information                
Partition Value:        [2018, 1, 1]            
Database:               gamedw                  
Table:                  log_message             
CreateTime:             Mon Mar 26 13:29:39 CST 2018    
LastAccessTime:         UNKNOWN                 
Location:               hdfs://localhost:9000/tmp/20180325/2018/1/1     
Partition Parameters:           
        numFiles                1                  
        totalSize               238                
        transient_lastDdlTime   1522042179         
                
# Storage Information           
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
InputFormat:            org.apache.hadoop.mapred.TextInputFormat        
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      
Compressed:             No                      
Num Buckets:            -1                      
Bucket Columns:         []                      
Sort Columns:           []                      
Storage Desc Params:            
        field.delim                                
        serialization.format                       
Time taken: 0.244 seconds, Fetched: 37 row(s)

同样可以使用load insert命令指定分区路径

[root@host ~]# hdfs dfs -ls -R /tmp/20180325                                     
drwxr-xr-x   - root supergroup          0 2018-03-26 11:55 /tmp/20180325/2018
drwxr-xr-x   - root supergroup          0 2018-03-26 14:08 /tmp/20180325/2018/1
drwxr-xr-x   - root supergroup          0 2018-03-26 14:24 /tmp/20180325/2018/1/1
-rw-r--r--   1 root supergroup        238 2018-03-26 11:57 /tmp/20180325/2018/1/1/log_message0.txt
drwxr-xr-x   - root supergroup          0 2018-03-26 11:57 /tmp/20180325/2018/1/2
-rw-r--r--   1 root supergroup        137 2018-03-26 11:57 /tmp/20180325/2018/1/2/log_message1.txt
drwxr-xr-x   - root supergroup          0 2018-03-26 14:25 /tmp/20180325/2018/1/3
-rw-r--r--   1 root supergroup        238 2018-03-26 14:25 /tmp/20180325/2018/1/3/log_message2.txt

hive> load data inpath '/tmp/20180325/2018/1/3' into table log_message partition (year=2018,month=1,day=3);
Loading data to table gamedw.log_message partition (year=2018, month=1, day=3)
OK
Time taken: 0.85 seconds

[root@host ~]# hdfs dfs -ls -R /tmp/20180325
drwxr-xr-x   - root supergroup          0 2018-03-26 11:55 /tmp/20180325/2018
drwxr-xr-x   - root supergroup          0 2018-03-26 14:25 /tmp/20180325/2018/1
drwxr-xr-x   - root supergroup          0 2018-03-26 14:24 /tmp/20180325/2018/1/1
-rw-r--r--   1 root supergroup        238 2018-03-26 11:57 /tmp/20180325/2018/1/1/log_message0.txt
drwxr-xr-x   - root supergroup          0 2018-03-26 11:57 /tmp/20180325/2018/1/2
-rw-r--r--   1 root supergroup        137 2018-03-26 11:57 /tmp/20180325/2018/1/2/log_message1.txt

hive> desc formatted log_message partition(year=2018,month=1,day=3);
OK
# col_name              data_type               comment            
                
id                      int                                        
server                  string                                     
level                   int                                        
message                 string                                     
                
# Partition Information         
# col_name              data_type               comment            
                
year                    int                                        
month                   int                                        
day                     int                                        
                
# Detailed Partition Information                
Partition Value:        [2018, 1, 3]            
Database:               gamedw                  
Table:                  log_message             
CreateTime:             Mon Mar 26 14:25:48 CST 2018    
LastAccessTime:         UNKNOWN                 
Location:               hdfs://localhost:9000/user/hive/warehouse/gamedw.db/log_message/year=2018/month=1/day=3 
Partition Parameters:           
        numFiles                1                  
        numRows                 0                  
        rawDataSize             0                  
        totalSize               238                
        transient_lastDdlTime   1522045548         
                
# Storage Information           
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
InputFormat:            org.apache.hadoop.mapred.TextInputFormat        
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      
Compressed:             No                      
Num Buckets:            -1                      
Bucket Columns:         []                      
Sort Columns:           []                      
Storage Desc Params:            
        field.delim                                
        serialization.format                       
Time taken: 0.153 seconds, Fetched: 39 row(s)

在hdfs文件查看log_message2.txt被转移到了hive,如下:

drwx-wx-wx   - root supergroup          0 2018-03-26 12:01 /user/hive/warehouse/gamedw.db/log_message
drwx-wx-wx   - root supergroup          0 2018-03-26 12:01 /user/hive/warehouse/gamedw.db/log_message/year=2018
drwx-wx-wx   - root supergroup          0 2018-03-26 14:25 /user/hive/warehouse/gamedw.db/log_message/year=2018/month=1
drwx-wx-wx   - root supergroup          0 2018-03-26 12:01 /user/hive/warehouse/gamedw.db/log_message/year=2018/month=1/day=1
drwx-wx-wx   - root supergroup          0 2018-03-26 14:25 /user/hive/warehouse/gamedw.db/log_message/year=2018/month=1/day=3
-rwx-wx-wx   1 root supergroup        238 2018-03-26 14:25 /user/hive/warehouse/gamedw.db/log_message/year=2018/month=1/day=3/log_message2.txt

 hive> show partitions log_message;
OK
year=2018/month=1/day=1
year=2018/month=1/day=2
year=2018/month=1/day=3
Time taken: 0.063 seconds, Fetched: 3 row(s)

外部分区表,查看 /user/hive/warehouse/gamedw.db/log_message/year=2018/month=1/day=3/log_message2.txt 是否会被

hive> drop table log_message;
OK
Time taken: 1.005 seconds

查看hdfs发现加载的数据依然存在;

我们需要记住并非所有的表数据都放在hive(warehouse)目录下,管理表(外部表)时,这些数据不会被连带

 外部表的数据不由hive自身负责管理,虽然数据会被加载到/user/hive/warehouse/,但是不由hive管理,因此drop掉外部表后,加载到/user/hive/warehouse/的数据会依然存在.

内部表的数据由hive自身负责管理,无论是分区表还是非分区表,drop table后,会连同数据一块

自定义表的存储格式

hive存储格式是文本文件格式,也可以通过可选子句stored as textfile显示指定;同时用户在创建表时可以指定各种各样的分割

terminate 终止delimite定义,划定;terminated 终止,delimited分割

hive记录和字符分割符如下:

 :对于文本来说,每行是一条记录,换行符来分割记录

^A:用于分割字段,建表语句使用八进制 01表示

^B:用来分割ARRAY,STRUCT中的元素,或者用于MAP键-值对之间的分割,建表语句使用八进制 02表示

^C:用来MAP键和值之间的分割,建表语句使用八进制 03表示

hive 默认分隔符的查看,输入,修改:

查看:

[root@host tmpdata]# cat employees.txt
tianyongtao100wangZHANGLIUaaa10bb5CC8HENANLUOHELINYINGTAICHEN
[root@host tmpdata]# cat -A employees.txt
tianyongtao^A100^Awang^BZHANG^BLIU^Aaaa^C10^Bbb^C5^BCC^C8^AHENAN^BLUOHE^BLINYING^BTAICHEN$

输入,修改

[root@host tmpdata]# vim employees.txt set list

ctrl + V 可以输入 ^符号
ctrl + a 可以输入A---'01'
ctrl + b 可以输入A---'02'
ctrl + c 可以输入A---'03'

完整建表语句如下:

 CREATE TABLE `employees`(
  `name` string,
  `salary` float,
  `subdinates` array<string>,
  `deducation` map<string,float>,
  `address` struct<street:string,city:string,state:string,zip:int>)

row formatted delimited

fields delimited by '01'

collection items delimited by '02'

map keys delimited by '03'

lines terminated by ' '

stored as textfile;

textfile 意味着所有字段都使用数字,字母,字符编码,每一行被认为是一个单独记录。

用户可以将textfile替换成其他hive支持的内置文件格式,包括sequncefile和rcfile,这两种格式都是采用二进制编码和压缩(可选)来优化磁盘空间的使用以及I/O带宽性能。

对于记录是如何被编码成文件的,以及列是如何被编码成记录的,hive指出了他们的不同,用户可以分别自定义这些行为。

记录编码是通过一个inputformat对象来控制的。

记录的解析是通过序列化器和反序列化器(缩写成SerDe)来控制的.

Serializable  可序列化的
Deserialization 反序列化的

为了保持完整性,hive还使用了一个叫做outputformat的对象来将查询的输出写入到文件或者输出到控制台。

 hive使用一个inputformat对象将输入流分割成记录,然后使用一个outputformat对象将记录格式为输出流(如查询的结果),再使用一个SerDe在读数据时将记录解析成列,在写数据时将列编码成记录。

用户可以指定第三方的输入输出格式以及SerDe

hive> show create table  employees;
OK
CREATE TABLE `employees`(
  `name` string,
  `salary` float,
  `subdinates` array<string>,
  `deducation` map<string,float>,
  `address` struct<street:string,city:string,state:string,zip:int>)
ROW FORMAT SERDE  
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/employees'
TBLPROPERTIES (
  'creator'='tianyongtao',
  'last_modified_by'='root',
  'last_modified_time'='1521447397',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'totalSize'='0',
  'transient_lastDdlTime'='1521447397')
Time taken: 0.032 seconds, Fetched: 23 row(s)

以上建表语句ROW FORMAT SERDE   指定了使用的SerDe

STORED AS INPUTFORMAT ........OUTPUTFORMAT.............指定了用于输出格式和输入格式对应的java类

修改表名

hive> show tables;
OK
......
role
...............
Time taken: 0.013 seconds, Fetched: 5 row(s)
hive> alter table role rename to roleinfor;
OK
Time taken: 0.162 seconds
hive> show tables;
OK
.........
roleinfor
........................
Time taken: 0.013 seconds, Fetched: 5 row(s)

 修改分区路径

修改前:

drwx-wx-wx   - root supergroup          0 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx   - root supergroup          0 2018-03-28 17:02 /user/hive/warehouse/gamedw.db/customers/city=shanghai
-rwx-wx-wx   1 root supergroup         80 2018-03-28 16:58 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers.txt
-rwx-wx-wx   1 root supergroup         80 2018-03-28 16:58 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers_copy_1.txt
-rwx-wx-wx   1 root supergroup         80 2018-03-28 17:02 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers_copy_2.txt
drwx-wx-wx   - root supergroup          0 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
-rwx-wx-wx   1 root supergroup         80 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen/customers.txt

修改:

hive> alter table customers partition(city='shanghai')set location 'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/customers/city=wudu'
    > ;
OK
Time taken: 0.174 seconds

查看修改结果:
hive> show partitions customers;
OK
partition
city=shanghai
city=shenzhen
Time taken: 0.053 seconds, Fetched: 2 row(s)

hive> select * from customers;
OK
customers.custname      customers.sex   customers.age   customers.city
tianyongtao     1       50      shenzhen
wangwu  1       85      shenzhen
zhangsan        1       20      shenzhen
liuqin  0       56      shenzhen
wangwu  0       47      shenzhen
liuyang 1       32      shenzhen
Time taken: 0.115 seconds, Fetched: 6 row(s)

由于调整了分区的位置,新的位置没数据,所有city=shanghai分区没数据。

查看hdfs:

drwx-wx-wx   - root supergroup          0 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx   - root supergroup          0 2018-03-28 17:02 /user/hive/warehouse/gamedw.db/customers/city=shanghai
-rwx-wx-wx   1 root supergroup         80 2018-03-28 16:58 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers.txt
-rwx-wx-wx   1 root supergroup         80 2018-03-28 16:58 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers_copy_1.txt
-rwx-wx-wx   1 root supergroup         80 2018-03-28 17:02 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers_copy_2.txt
drwx-wx-wx   - root supergroup          0 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
-rwx-wx-wx   1 root supergroup         80 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen/customers.txt

由此可见,重新设置分区位置,旧的数据不会被转移的新的位置,旧的数据也不会被

还原分区位置:

hive> alter table customers partition(city='shanghai')set location 'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/customers/city=shanghai';
OK
Time taken: 0.151 seconds

查看表中的数据:发现上海的数据又回来了

hive> select * from customers;
OK
customers.custname      customers.sex   customers.age   customers.city
tianyongtao     1       50      shanghai
wangwu  1       85      shanghai
zhangsan        1       20      shanghai
liuqin  0       56      shanghai
wangwu  0       47      shanghai
liuyang 1       32      shanghai
tianyongtao     1       50      shanghai
wangwu  1       85      shanghai
zhangsan        1       20      shanghai
liuqin  0       56      shanghai
wangwu  0       47      shanghai
liuyang 1       32      shanghai
tianyongtao     1       50      shanghai
wangwu  1       85      shanghai
zhangsan        1       20      shanghai
liuqin  0       56      shanghai
wangwu  0       47      shanghai
liuyang 1       32      shanghai
tianyongtao     1       50      shenzhen
wangwu  1       85      shenzhen
zhangsan        1       20      shenzhen
liuqin  0       56      shenzhen
wangwu  0       47      shenzhen
liuyang 1       32      shenzhen
Time taken: 0.114 seconds, Fetched: 24 row(s)

修改列将字段名age修改为niangling:

hive> alter table tb_cust change column age nianling int COMMENT 'nianling alter at 20180329';
OK
Time taken: 0.093 seconds
hive> select * from tb_cust;
OK
tb_cust.custname        tb_cust.sex     tb_cust.nianling        tb_cust.city
tianyongtao     1       50      beijing
wangwu  1       85      beijing
zhangsan        1       20      beijing
liuqin  0       56      beijing
wangwu  0       47      beijing
liuyang 1       32      beijing
Time taken: 0.103 seconds, Fetched: 6 row(s)

hive> desc tb_cust;
OK
col_name        data_type       comment
custname                string                                     
sex                     int                                        
nianling                int                     nianling alter at 20180329
city                    string                                     
                
# Partition Information         
# col_name              data_type               comment            
                
city                    string                                     
Time taken: 0.015 seconds, Fetched: 9 row(s)

-----------------------------------------------------------------------

调整字段位置将字段sex调整到字段niangling后面:

hive> alter table tb_cust change column sex sex int after nianling;
OK
Time taken: 0.113 seconds

hive> desc tb_cust;
OK
col_name        data_type       comment
custname                string                                     
nianling                int                     nianling alter at 20180329
sex                     int                                        
city                    string                                     
                
# Partition Information         
# col_name              data_type               comment            
                
city                    string                                     
Time taken: 0.015 seconds, Fetched: 9 row(s)

与上次查询相比变换了位置。

修改字段,就算没有修改字段名,字段类型没有改变,用户也要指定旧的字段名。

重要的一点,字段位置做了变动,只是修改元数据信息,数据内容的位置没变;如果需要调整包括数据的表的字段位置,对应的数据的位置也应该做调整。

修改表属性

hive> alter table mess set tblproperties('note'='message table');
OK
Time taken: 0.173 seconds

CREATE TABLE `mess`(
  `aa` string,
  `cc` string,
  `bb` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/mess'
TBLPROPERTIES (
  'last_modified_by'='root',
  'last_modified_time'='1522393964',
  'note'='message table',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'totalSize'='0',
  'transient_lastDdlTime'='1522393964')
Time taken: 0.039 seconds, Fetched: 21 row(s)

修改存储属性

hive> alter table mess set fileformat sequencefile ;
OK
Time taken: 0.12 seconds

hive> show create table mess;
OK
createtab_stmt
CREATE TABLE `mess`(
  `aa` string,
  `cc` string,
  `bb` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileOutputFormat'
LOCATION
  'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/mess'
TBLPROPERTIES (
  'last_modified_by'='root',
  'last_modified_time'='1522394155',
  'note'='message table',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'totalSize'='0',
  'transient_lastDdlTime'='1522394155')
Time taken: 0.045 seconds, Fetched: 21 row(s)

修改表的存储属性:

hive> alter table t_name2 set fileformat sequencefile;
OK
Time taken: 0.09 seconds
hive> select * from t_name2;
OK
t_name2.id      t_name2.name    t_name2.level
Failed with exception java.io.IOException:java.io.IOException: hdfs://localhost:9000/tmp/20180321/name.csv not a SequenceFile
Time taken: 0.104 seconds 

修改成功,外部表t_name2数据文件不是SequenceFile格式报错

hive> alter table tb_cust clustered by(city,custname) sorted by(nianling) into 10 buckets;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Invalid column reference //分区无效
hive> alter table tb_cust clustered by(custname) sorted by(nianling) into 10 buckets;
OK
Time taken: 0.126 seconds

上述中sorted by是可以选的 clustered by....into ....buckets是必选的。

 用户可以指定一个新的SerDe,并为其指定SerDe属性,或者修改已经存在SerDe属性;

hive "执行钩子"

alter table .... touch .....用于触发钩子

如 alter table t_cust touch partition(city='shenzhen') 其中partition子句用于分区表.这个语句的典型应用场景就是,当表的数据文件在hive以外做了修改,就会触发钩子的执行

如果某个分区写入了新的日志信息可以执行:

hive -e 'alter table tb_cust touch partition(city='shenzhen');'

暂时没有验证成功

如果表或者分区不存在,这个语句也不会建表或者创建分区.

archive  ['arkaiv] n  档案,档案馆  vt 存档

altere table .... archive partition会将这个分区的文件打成一个hadoop压缩包(har)文件,这样仅仅是降低文件数据,降低namenode的压力,而不会减少任何存储空间。

hive> set hive.archive.enabled=true;
hive> alter table tb_cust archive partition(city='shenzhen');
intermediate.archived is hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_cust/city=shenzhen_INTERMEDIATE_ARCHIVED
intermediate.original is hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_cust/city=shenzhen_INTERMEDIATE_ORIGINAL
Creating data.har for hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_cust/city=shenzhen
in hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_cust/city=shenzhen/.hive-staging_hive_2018-04-02_15-54-07_256_1308807572832742090-1/-ext-10000/partlevel
Please wait... (this may take a while)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/tools/HadoopArchives

报错,待处理

使用unarchive可以反向操作

hive提供了各种保护,可以防止分区被查询或者被删除:

 alter table tb_cust  partition(city='shenzhen') ENABLE NO_DROP;

alter table tb_cust  partition(city='shenzhen') ENABLE OFFLINE;

使用disable进行反向操作,以上验证失败;这些操作只能用于分区表;

数据操作

向管理表装载数据

hive没有行级别的插入,更新,删除操作,往表中装载数据唯一途径使用一种大量的数据装载操作。或通过其他方式,将文件写入正确的目录下。

hive> show partitions tb_cust;
OK
city=beijing
city=shenzhen
Time taken: 0.063 seconds, Fetched: 2 row(s)
hive> load data local inpath '/root/tmpdata/customers.txt' into table tb_cust partition(city='guangzhou');
Loading data to table gamedw.tb_cust partition (city=guangzhou)
OK
Time taken: 0.781 seconds

hive> show partitions tb_cust;
OK
city=beijing
city=guangzhou
city=shenzhen
Time taken: 0.059 seconds, Fetched: 3 row(s)

如果分区不存在load语句会先创建分区,然后再将数据拷贝到这个目录下。

如果不是分区表,则去掉partition子句

 通常指定的路径是一个目录,而不是一个单独的文件,hive会将所有文件拷贝到这个目录下:

[root@host ~]# ls /root/tmpdata/suzhou
customers2.txt  customers.txt

hive> load data local inpath '/root/tmpdata/suzhou' into table tb_cust partition(city='suzhou');
Loading data to table gamedw.tb_cust partition (city=suzhou)
OK
Time taken: 0.596 seconds

hive> show partitions tb_cust;
OK
city=beijing
city=guangzhou
city=shenzhen
city=suzhou
Time taken: 0.055 seconds, Fetched: 4 row(s)

在hdfs下查看suzhou分区:

drwx-wx-wx   - root supergroup          0 2018-04-02 16:59 /user/hive/warehouse/gamedw.db/tb_cust/city=suzhou
-rwx-wx-wx   1 root supergroup         96 2018-04-02 16:59 /user/hive/warehouse/gamedw.db/tb_cust/city=suzhou/customers.txt
-rwx-wx-wx   1 root supergroup         96 2018-04-02 16:59 /user/hive/warehouse/gamedw.db/tb_cust/city=suzhou/customers2.txt

 由于使用了local,使用的本地系统的路径,省略掉local,使用的就是分布式系统的路径。

hive要求源文件 目标文件以及目录应该在同一个文件系统中 ,例如,load data不可以将数据从一个集群的hdfs转移到另个一个集群的hdfs中.

 指定全路径,也可以指定相对路径

 如果用户指定了overwrite关键字,那么目标文件夹中的之前存在的数据将会被删除掉。如果没有这个关键字,仅仅会把新增的文件增加到目标文件夹而不删除之前的数据。然而如果目标文件夹中存在与装载的文件同名的文件,那么旧的文件就会覆盖重写。(事实上,没有overwrite关键字,同名的文件会被保留,新文件会被改名为 文件名_序列号)

 inpath路径中不能包含任何文件夹

hive不会验证装载的数据与表的模式是否匹配,然而hive会验证文件格式是否与表定义的一致。

通过查询语句向表中插入数据

hive> select * from cust;
OK
cust.custname   cust.sex        cust.nianling
tianyt_touch100 1       50
wangwu  1       85
zhangsan        1       20
liuqin  0       56
wangwu  0       47
liuyang 1       32
hello   0       100
Time taken: 0.151 seconds, Fetched: 7 row(s)

hive> create external table tb_cust_externl
    > (custname string,sex int,nianling int) row format delimited fields terminated by ' '
    > location '/root/tmpdata/customers.txt';
OK
Time taken: 0.288 seconds

[root@host ~]# hdfs dfs -ls -R /tmp/201804*
-rw-r--r--   1 root supergroup         14 2018-04-12 16:50 /tmp/20180410/customer1.txt

 hive> alter  table tb_cust_externl set location '/tmp/20171024/customers1.txt';
OK
Time taken: 0.146 seconds

hive> select * from tb_cust_externl;
OK
tb_cust_externl.custname        tb_cust_externl.sex     tb_cust_externl.nianling
nihao   1       5
Time taken: 0.176 seconds, Fetched: 1 row(s)

查看HDFS:

 drwxr-xr-x   - root supergroup          0 2018-04-13 11:41 /user/hive/warehouse/gamedw.db/cust
-rwxr-xr-x   1 root supergroup         96 2018-04-11 17:14 /user/hive/warehouse/gamedw.db/cust/customers.txt

通过spark插入数据:

scala> hivecon.sql("insert into table gamedw.cust select * from gamedw.tb_cust_externl").show
18/04/13 11:50:56 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
++
||
++
++

hive> select * from cust;
OK
cust.custname   cust.sex        cust.nianling
tianyt_touch100 1       50
wangwu  1       85
zhangsan        1       20
liuqin  0       56
wangwu  0       47
liuyang 1       32
hello   0       100
nihao   1       5
Time taken: 0.137 seconds, Fetched: 8 row(s)

发现tb_cust_externl外部表的nihao成功插入

然后查看HDFS:

drwxr-xr-x   - root supergroup          0 2018-04-13 11:50 /user/hive/warehouse/gamedw.db/cust
-rwxr-xr-x   1 root supergroup         96 2018-04-11 17:14 /user/hive/warehouse/gamedw.db/cust/customers.txt
-rwxr-xr-x   1 root supergroup         10 2018-04-13 11:50 /user/hive/warehouse/gamedw.db/cust/part-00000-ebb85cb3-73e4-4d7b-aa07-8f007c8887bb-c000

发现cust多了一个文件,咱们来查看一下文件内容:

[root@host ~]# hdfs dfs -text /user/hive/warehouse/gamedw.db/cust/part-00000-ebb85cb3-73e4-4d7b-aa07-8f007c8887bb-c000
nihao 1 5

也就是说插入的数据会生成一个新的文件

如果插入使用overwrite而不是into语句,管理表的数据将会被覆盖。

原文地址:https://www.cnblogs.com/playforever/p/8568640.html