hive的简单操作

hive创建表的基本命令

命令	解释
external	外部表关键字不写表示内部表,也叫管理表
if not exists	判断是否存在,存在就不建了.
comment	表注释给表的解释
row format	设置文件的格式化内容
delimited fields terminated by ' '	设置一行上每个列之间的分割
lines terminated by ''	设置行分割
collection items terminated by ''	集合型和数组型和对象型数据的内部分割
map keys terminated by ''	map集合的内部分割
partitioned by()	分区表把表中的数据分到不同的地方分区表中的分割原则是不能再表中出现的列
into num_buckets buckets	分桶表把表分到不同的地方分桶表中的分割原则是按照表中已有的列进行分割的
stored as textfile	存储文件的类型,默认是文本
location ''	设置文件位置.

建库

　　　　hive中有一个默认的库：

　　　　库名： default

　　　　库目录：hdfs://192.168.14.101:50070/user/hive/warehouse

　　新建库：

　　　　　　create database db_order;

　　　　　　库建好后，在hdfs中会生成一个库目录：

　　　　　　hdfs://192.168.14.101:50070/user/hive/warehouse/hive_test.db

建表

基本建表语句

　　　　create table t_order(id string,create_time string,amount float,uid string);

　　　　表建好后，会在所属的库目录中生成一个表目录 /user/hive/warehouse/hive_test.db/t_order

　　　　只是，这样建表的话，hive会认为表数据文件中的字段分隔符为 ^A

　　　　正确的建表语句为：

　　　　create table t_order(id string,create_time string,amount float,uid string)

　　　　row format delimited

　　　　fields terminated by ',';

　　　　这样就指定了，我们的表数据文件中的字段分隔符为 ","

删除表

　　　　drop table t_order;

　　　　删除表的效果是：

　　　　hive会从元数据库中清除关于这个表的信息；

　　　　hive还会从hdfs中删除这个表的表目录；

　　内部表与外部表

　　　　内部表(MANAGED_TABLE)：表目录按照hive的规范来部署，位于hive的仓库目录/user/hive/warehouse中

　　　　外部表(EXTERNAL_TABLE)：表目录由建表用户自己指定

　　　　create external table t_access(ip string,url string,access_time string)

　　　　row format delimited

　　　　fields terminated by ','

　　　　location '/access/log';

　　　　外部表和内部表的特性差别：

　　　　1、内部表的目录在hive的仓库目录中 VS 外部表的目录由用户指定

　　　　2、drop一个内部表时：hive会清除相关元数据，并删除表数据目录

　　　　3、drop一个外部表时：hive只会清除相关元数据；

　　　　一个hive的数据仓库，最底层的表，一定是来自于外部系统，为了不影响外部系统的工作逻辑，在hive中可建external表来映射这些外部系统产生的数据目录；

　　　　然后，后续的etl操作，产生的各种表建议用managed_table

分区表

　　　　分区表的实质是：在表目录中为数据文件创建分区子目录，以便于在查询时，MR程序可以针对分区子目录中的数据进行处理，缩减读取数据的范围。

　　　　比如，网站每天产生的浏览记录，浏览记录应该建一个表来存放，但是，有时候，我们可能只需要对某一天的浏览记录进行分析

　　　　这时，就可以将这个表建为分区表，每天的数据导入其中的一个分区；

　　　　当然，每日的分区目录，应该有一个目录名（分区字段）

　　示例如下

　　　　create table t_access(ip string,url string,access_time string)

　　　　partitioned by(dt string)

　　　　row format delimited

　　　　fields terminated by ',';

　　　　注意：分区字段不能是表定义中的已存在字段

　　向分区中导入数据

　　　　load data local inpath '虚拟机的地址' into table t_access partition(dt='20170804');

　　　　load data local inpath '虚拟机的地址' into table t_access partition(dt='20170805');

　　　　注意：local 是本机的意思，去掉之后，语句代表的是从hdfs上传到hive表中

　　针对分区数据进行查询

　　　　a、统计8月4号的总PV：

　　　　select count(*) from t_access where dt='20170804';

　　　　实质：就是将分区字段当成表字段来用，就可以使用where子句指定分区了

　　　　b、统计表中所有数据总的PV：

　　　　select count(*) from t_access;

　　　　实质：不指定分区条件即可

　　多个分区字段示例

　　建表：

　　　　create table t_partition(id int,name string,age int)

　　　　partitioned by(department string,sex string,howold int)

　　　　row format delimited fields terminated by ',';

　　导数据：

　　　　load data local inpath '/root/p1.dat' into table t_partition partition(department='xiangsheng',sex='male',howold=20);