Hive-02 DDL| DML

类型转换

可以使用CAST操作显示进行数据类型转换
例如CAST('1' AS INT)将把字符串'1' 转换成整数1；如果强制类型转换失败，如执行CAST('X' AS INT)，表达式返回空值 NULL。

0: jdbc:hive2://hadoop101:10000> select '1'+2, cast('1'as int) + 2;
+------+------+--+
| _c0  | _c1  |
+------+------+--+
| 3.0  | 3    |

对于Hive的String类型相当于数据库的varchar类型，该类型是一个可变的字符串，不过它不能声明其中最多能存储多少个字符，理论上它可以存储2GB的字符数。

集合数据类型

Hive有三种复杂数据类型ARRAY、MAP 和 STRUCT。ARRAY和MAP与Java中的Array和Map类似，而STRUCT与C语言中的Struct类似，它封装了一个命名字段集合，复杂数据类型允许任意层次的嵌套。

[kris@hadoop101 datas]$ vim test.txt
songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long guan_beijing
yangyang,caicai_susu,xiao yang:18_xiaoxiao yang:16, chao yang_beijing

hive (default)> create table test(
              > name string,
              > friends array<string>,
              > children map<string, int>,
              > address struct<street:string, city:string>)
              > row format delimited fields terminated by ','
              > collection items terminated by '_'
              > map keys terminated by ':'
              > lines terminated by '
';
OK
Time taken: 0.249 seconds
hive (default)> load data local inpath '/opt/module/datas/test.txt/' into table test;
Loading data to table default.test
Table default.test stats: [numFiles=1, totalSize=145]
OK
Time taken: 1.365 seconds
0: jdbc:hive2://hadoop101:10000> select * from test;
0: jdbc:hive2://hadoop101:10000> select friends[1], children['xiao song'], address.city from test where name="songsong";
+-------+------+----------+--+
|  _c0  | _c1  |   city   |
+-------+------+----------+--+
| lili  | 18   | beijing  |
+-------+------+----------+--+
1 row selected (0.321 seconds)

DDL数据定义

创建数据库

创建一个数据库，数据库在HDFS上的默认存储路径是/user/hive/warehouse/*.db。

修改

用户可以使用ALTER DATABASE命令为某个数据库的DBPROPERTIES设置键-值对属性值，来描述这个数据库的属性信息。数据库的其他元数据信息都是不可更改的，包括数据库名和数据库所在的目录位置。

① 创建数据库
0: jdbc:hive2://hadoop101:10000> create database if not exists db_hive;  避免要创建的数据库已经存在错误，增加if not exists判断。（标准写法）
No rows affected (0.032 seconds)
0: jdbc:hive2://hadoop101:10000> create database if not exists db_hive2 location '/db_hive2.db'; 指定数据库在HDFS上存放的位置

② 修改数据库

hive (db_hive)> alter database db_hive set dbproperties('createtime'='20190215');
OK
Time taken: 0.031 seconds
③ 查看数据库| 切换数据库 use xx;
hive (db_hive)> desc database extended db_hive;  显示数据库详细信息； 也可以去掉extended即显示数据库信息；
OK
db_name comment location        owner_name      owner_type      parameters
db_hive         hdfs://hadoop101:9000/user/hive/warehouse/db_hive.db    kris    USER    {createtime=20190215}
Time taken: 0.016 seconds, Fetched: 1 row(s)
④ 删除数据库
hive (db_hive)> drop database db_hive2;
hive (db_hive)> drop database if exists db_hive2; 
hive (db_hive)> drop database db_hive cascade;  ##若数据库不为空，则强制删除用cascade；

创建表

hive (default)> create table if not exists student2(
              > id int, name string)
              > row format delimited fields terminated by '	'
              > stored as textfile
              > location '/user/hive/warehouse/student2';
OK

管理表| 内部表

管理表，有时也被称为内部表。因为这种表，Hive会（或多或少地）控制着数据的生命周期。Hive默认情况下会将这些表的数据存储在由配置项hive.metastore.warehouse.dir(例如，/user/hive/warehouse)所定义的目录的子目录下。当我们删除一个管理表时，Hive也会删除这个表中数据。管理表不适合和其他工具共享数据。

外部表，Hive并非认为其完全拥有这份数据。删除该表并不会删除掉这份数据，不过描述表的元数据信息会被删除掉。

使用场景：每天将收集到的网站日志定期流入HDFS文本文件。在外部表（原始日志表）的基础上做大量的统计分析，用到的中间表、结果表使用内部表存储，数据通过SELECT+INSERT进入内部表。

内部表数据可进可出元数据+hdfs
外部表元数据---HDFS，只包含元数据；不会删hdfs数据

① 普通创建表
hive (default)> create table if not exists student3 as select id, name from student;
hive (default)> create table if not exists student4 like student; //根据已经存在的表机构创建表
hive (default)> desc formatted student2; #查询表的类型；查看格式化数据
OK
col_name        data_type       comment
② 外部表
hive (default)> dfs -mkdir /student;
hive (default)> dfs -put /opt/module/datas/student.txt /student;
hive (default)> create external table stu_external(  //创建外部表
id int, 
name string) 
row format delimited fields terminated by '	' 
location '/student';

0: jdbc:hive2://hadoop101:10000> select * from stu_external;
0: jdbc:hive2://hadoop101:10000> desc formatted stu_external; 
 Table Type:                   | EXTERNAL_TABLE    
 0: jdbc:hive2://hadoop101:10000> drop table stu_external;
 外部表删除后，hdfs中的数据还在，但是metadata中stu_external的元数据已被删除

 ③ 内部表和外部表的互相转换
  desc formatted student2;
  Table Type:                   | MANAGED_TABLE        
0: jdbc:hive2://hadoop101:10000> alter table student2 set tblproperties('EXTERNAL'='TRUE');
   Table Type:                   | EXTERNAL_TABLE      
0: jdbc:hive2://hadoop101:10000> alter table student2 set tblproperties('EXTERNAL'='FALSE');

分区表

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

① 创建分区表
hive (default)> create table dept_partition(
              > deptno int, dname string, loc string)
              > partitioned by (month string)
              > row format delimited fields terminated by '	';
OK
　　加载数据
hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201709');
Loading data to table default.dept_partition partition (month=201709)
Partition default.dept_partition{month=201709} stats: [numFiles=1, numRows=0, totalSize=71, rawDataSize=0]
OK
load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201708');
load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201707');
② 单分区查询
0: jdbc:hive2://hadoop101:10000> select * from dept_partition where month='201708';
+------------------------+-----------------------+---------------------+-----------------------+--+
| dept_partition.deptno  | dept_partition.dname  | dept_partition.loc  | dept_partition.month  |
+------------------------+-----------------------+---------------------+-----------------------+--+
| 10                     | ACCOUNTING            | 1700                | 201708                |
| 20                     | RESEARCH              | 1800                | 201708                |
| 30                     | SALES                 | 1900                | 201708                |
| 40                     | OPERATIONS            | 1700                | 201708                |
+------------------------+-----------------------+---------------------+-----------------------+--
　　多分区联合查询
0: jdbc:hive2://hadoop101:10000> select * from dept_partition where month='201707'
0: jdbc:hive2://hadoop101:10000> union
0: jdbc:hive2://hadoop101:10000> select * from dept_partition where month='201708'
0: jdbc:hive2://hadoop101:10000> union
0: jdbc:hive2://hadoop101:10000> select * from dept_partition where month='201709';

③ 增加分区| 增加单个、增加多个分区
0: jdbc:hive2://hadoop101:10000> alter table dept_partition add partition(month='201705') partition(month='201704');

④ 删除分区| 单个、删多个用，连接
0: jdbc:hive2://hadoop101:10000> alter table dept_partition drop partition(month='201705'), partition(month='201706');

⑤ 查看分区有多少分区
0: jdbc:hive2://hadoop101:10000> show partitions dept_partition;
+---------------+--+
|   partition   |
+---------------+--+
| month=201707  |
| month=201708  |
| month=201709  |
+---------------+--+

⑥ 查看分区表结构
0: jdbc:hive2://hadoop101:10000> desc formatted dept_partition;

⑦ 创建二级分区

　　hive (default)> create table dept_partition2(
　　 deptno int, dname string, loc string)
　　 partitioned by (month string, day string)
　　 row format delimited fields terminated by ' ';

　　加载数据到二级分区
0: jdbc:hive2://hadoop101:10000> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition2 partition(month='201709', day='13');
0: jdbc:hive2://hadoop101:10000> select * from dept_partition2 where month='201709' and day='13'; 查看分区数据

　　把数据直接上传到分区目录上，让分区表和数据产生关联的三种方式

方式一：上传数据后修复
0: jdbc:hive2://hadoop101:10000> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=12;
0: jdbc:hive2://hadoop101:10000> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12;
0: jdbc:hive2://hadoop101:10000> msck repair table dept_partition2;  //修复下才能查到数据
No rows affected (0.15 seconds)
0: jdbc:hive2://hadoop101:10000> select * from dept_partition2 where month='201709' and day='12';
 alter table dept_partition2 drop partition(month='201709', day='11'); 删除
 方式二：上传数据后添加分区
 0: jdbc:hive2://hadoop101:10000> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=11; 不能加引号
 0: jdbc:hive2://hadoop101:10000> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=11;
 0: jdbc:hive2://hadoop101:10000> alter table dept_partition2 add partition(month='201709', day='11');
 0: jdbc:hive2://hadoop101:10000> select * from dept_partition2 where month='201709' and day='11';
 方式三：创建文件夹后load数据到分区
 0: jdbc:hive2://hadoop101:10000> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month='201709'/day='10';
 0: jdbc:hive2://hadoop101:10000> load data local inpath '/opt/module/datas/dept.txt' into table dept_partition2 partition(month='201709',day='10');
 0: jdbc:hive2://hadoop101:10000> select * from dept_partition2 where month='201709' and day='10';

修改表

重命名表
 jdbc:hive2://hadoop101:10000> alter table teacher rename to new_teacher;
添加列
 0: jdbc:hive2://hadoop101:10000> alter table dept_partition add columns(deptdesc string);
更新列
 0: jdbc:hive2://hadoop101:10000> alter table dept_partition change column deptdesc desc int;
No rows affected (0.112 seconds)
0: jdbc:hive2://hadoop101:10000> desc dept_partition;
替换列
0: jdbc:hive2://hadoop101:10000> alter table dept_partition replace columns(deptid int, name string, loc string);
删除表
0: jdbc:hive2://hadoop101:10000> drop table new_teacher;

DML数据操作

数据导入

向表中装载数据（Load）

① 向表中装载数据： 
　　从本地到hive
0: jdbc:hive2://hadoop101:10000> create table student(id int, name string) row format delimited fields terminated by '	'; 
0: jdbc:hive2://hadoop101:10000> load data local inpath '/opt/module/datas/student.txt' into table default.student;  加载本地文件到hive
　　从HDFS到hive
0: jdbc:hive2://hadoop101:10000> dfs -mkdir -p /user/kris/hive;
0: jdbc:hive2://hadoop101:10000> dfs -put /opt/module/datas/student.txt /user/kris/hive;
0: jdbc:hive2://hadoop101:10000> load data inpath '/user/kris/hive/student.txt' into table default.student; //移动hdfs上的文件；加载HDFS上的数据

0: jdbc:hive2://hadoop101:10000> load data inpath '/user/kris/hive/student.txt' overwrite into table default.student; 加载数据覆盖表中已有的数据


② 通过查询语句向表中插入数据Insert
create table student(id int, name string) partitioned by (month string) row format delimited fields terminated by '	'; 创建一张分区表
0: jdbc:hive2://hadoop101:10000> insert into table student partition(month='201902') values (1, "kris"), (2, "egon"); 插入数据
-rwxrwxr-x    kris    supergroup    14 B    2019/2/15 下午7:16:26    3    128 MB    000000_0
　　根据单张表查询结果来插入insert into是追加数据的方式插入表或分区，原有数据不会被删除；
　　　　　　　　　　　　　　insert overwrite是会覆盖表或分区中已有数据；

0: jdbc:hive2://hadoop101:10000> insert overwrite table student partition(month="201905") select id,name from student where month='201902'; 在原本基础上追加
0: jdbc:hive2://hadoop101:10000> select * from student;
+-------------+---------------+----------------+--+
| student.id  | student.name  | student.month  |
+-------------+---------------+----------------+--+
| 1           | kris          | 201902         |
| 2           | egon          | 201902         |
| 1           | kris          | 201905         |
| 2           | egon          | 201905         |
+-------------+---------------+----------------+--+
　　多表查询结果插入
hive (default)> from student insert overwrite table student partition(month="201904")
              > select id, name where month="201905"
              > insert overwrite table student partition(month="201903")
              > select id, name where month="201905";
0: jdbc:hive2://hadoop101:10000> select * from student;
+-------------+---------------+----------------+--+
| student.id  | student.name  | student.month  |
+-------------+---------------+----------------+--+
| 1           | kris          | 201902         |
| 2           | egon          | 201902         |
| 1           | kris          | 201903         |
| 2           | egon          | 201903         |
| 1           | kris          | 201904         |
| 2           | egon          | 201904         |
| 1           | kris          | 201905         |
| 2           | egon          | 201905         |
+-------------+---------------+----------------+-
③ 查询语句中创建并加载数据 AS Select

create table if not exists student3 as select id, name from student;
create table if not exists student4 like student;

④ 创建表时通过Location指定加载数据路径
0: jdbc:hive2://hadoop101:10000> create external table if not exists stu(id int, name string) row format delimited fields terminated by '	' location '/student';

⑤ Import数据到指定Hive表中；要先使用export导出后，才能将数据import导入

　　export table student to '/hive_data/student';
　　import table student from '/hive_data/student';

create table student22(
id int, name string)
partitioned by (month string)
row format delimited fields terminated by '	';

import table student22 partition(month='201904') from  //student22必须要有分区才能导入成功
 '/user/hive/warehouse/export/student';

数据导出（Impala都不支持）

① Insert导出
　　将输出文件导出到本地/opt/module/datas/export/student中；
0: jdbc:hive2://hadoop101:10000> insert overwrite local directory '/opt/module/datas/export/student' select * from student;               
　　结果格式化导出到本地
hive (default)> insert overwrite local directory '/opt/module/datas/export/student1'
              > ROW FORMAT DELIMITED FIELDS TERMINATED BY '	' select * from student; 
    结果导出到HDFS；只能用overwrite，不能用into          
hive (default)> insert overwrite directory '/user/kris/student2'
              > row format delimited fields terminated by '	'
              > select * from student;
② Hadoop命令导出到本地
hive (default)> dfs -get /user/hive/warehouse/student/month=201902/000000_0 /opt/module/datas/export/student3.txt;
[kris@hadoop101 export]$ cat student3.txt 
1       kris
2       egon
[kris@hadoop101 export]$ pwd
/opt/module/datas/export
③ Shell命令导出到本地
[kris@hadoop101 hive]$ bin/hive -e 'select * from default.student;' > /opt/module/datas/export/student4.txt
④ Export导出到HDFS上
hive (default)> export table default.student to '/user/hive/warehouse/export/student';
⑤ Sqoop导出（导入）
　　https://www.cnblogs.com/shengyang17/p/10512510.html

Hive表导出成csv文件

hive -e "
set hive.cli.print.header=true; 
select * from student where sex = 'male';
" | sed 's/[	]/,/g'  > /opt/module/student.csv

清除表中数据（Truncate）

注意：Truncate只能删除管理表，不能删除外部表中数据

　　hive (default)> truncate table student;