2.hive入门篇

hive-基础篇

说明：在下实习生一枚，不足之处，或者错误之处还请多多指教。

Chap1：基础概念篇

1. 什么是hive

hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供简单的sql查询功能，可以将sql语句转换为MapReduce任务进行运行。【来自百度百科】。

2. 如何操作hive

hql语句，也就是sql语言的一种方言，换句话说：只要你之前写过sql语句，不管是oracle,db2,mysql，还是sqlserver。你都可以无障碍的写hql语句。

3. hive数据的存储

分两部分，元数据（metastore）和具体数据。

a) 元数据：

存于默认元数据库Derby（嵌入模式），也可修改存于mysql数据库中。所谓元数据：hive表的数据库名、表名、字段名称与类型、分区字段与类型表的分区，分区的属性location，即：一些基本的属性。

b) 大数据：

存于hadoop的hdfs,以文件的形式，存于大数据平台。

从元数据到文件系统的映射显然由hive完成，其作用就可见一斑了。

Chap2：建表的基本操作

1. 进入linux平台

键入：hive或者beeline。

见：hive> ###开始你的操作。

2. 建第一张表

create table temp_tab(id int ,name string)

row format delimited fields terminated by ' '

stored as textfile;

说明：指定了字段的分隔符为逗号，所以load数据的时候，load的文本也要为tab，否则加载后为NULL。hive只支持单个字符的分隔符，hive默认的分隔符是01。

导入数据：hive> load data local inpath '/home/ocdc/gxs/hive1.txt' overwrite into table temp_tab;

有overwrite会将表中原来数据覆盖，否则进行累加。

3. 表的分类及建表操作。

a) 内部表

create table temp_tab(id int ,name string)

row format delimited fields terminated by ','

stored as textfile;

b) 外部表

create external table temp_tab(id int ,name string)

row format delimited fields terminated by ','

stored as textfile;

c) 分区表

create table test1_gxs1(id int ,name string)partitioned by(sex string)

row format delimited fields

terminated by ',' stored as textfile;

导入数据：load data local inpath '/home/ocdc/gxs/hive2.txt'

overwrite into table test1_gxs1 partition(sex='unkonw')

d) 桶（我目前还没有接触到）

4. 内部表和外部表的区别

a) 未被external修饰的是内部表，被external修饰的为外部表（external table）；

b) 内部表数据由Hive自身管理，外部表数据由HDFS管理；

c) 内部表数据存储的位置是hive.metastore.warehouse.dir（默认：/user/hive/warehouse），外部表数据的存储位置由自己制定；

d) 删除内部表会直接删除元数据（metadata）及存储数据；删除外部表仅仅会删除元数据，HDFS上的文件并不会被删除；

e) 对内部表的修改会将修改直接同步给元数据，而对外部表的表结构和分区进行修改，则需要修复（MSCK REPAIR TABLE table_name;）

5. 分区的理解

分区即：分类、归类，大大提高了查询效率，在业务中我们常以“地域”或者“时间”作为分区键，这样便于管理和维护。

6. 数据的导入导出

a) 装载数据 (本地数据)：load data local inpath '/home/ocdc/gxs/hive3.txt' overwrite into table temp_external1;

b) (hdfs数据)：load data inpath '***/hive3.txt' into table temp_external1;

c) 导出数据 (到本地)hive> insert overwrite local directory '/home/ocdc/gxs/ss1.txt' select id,name from temp_external1;

d) (到hdfs)hive> insert overwrite directory 'hdfs://master:9090/**/mate_load' select * from temp_external1;

Chap3：基本命令函数和查询语句

1. 基本命令：

a) 查看表hdfs上存储地址：show create table table_name;

b) 查看表分区: hive> show partitions test1_gxs1;

c) 修改表名：hive> alter table temp_tab rename to temp_tab1;

d) 增加列： hive> alter table temp_tab1 add columns(sex string);

e) 修改字段名：hive> alter table temp_tab1 change sex gender string;

f) 查看内置函数：hive> show functions;

g) 具体函数含义：hive> desc function year;

h) 杀死job: hadoop job -kill job_×;

i) 查看表结构：desc tab_name;

j) 替换列结构：hive> alter table temp_tab1 replace columns(id string,name string);

替换前：

hive> desc temp_tab1;

id int

name string

gender string

替换后：

hive> desc temp_tab1;

id string

name string

2. 常用内置函数：

count()数据量；avg()平均值；

distinct去重；min();max();substr();

将16进制转换成10进制：conv(cell_id,16,10)；

3. 插入语句：

insert into table test1_gxs1 partition(sex)

select id ,name,'boy' from temp_tab;

4. 查询语句：

a) 限制行数：大数据中心的数据自然数量很大，所以查询时一定要进行限定，否则会很慢，占用很多资源，导致大家的任务瘫痪。

eg：select * from temp_tab limit 10;

eg:select * from temp_tab sort by age desc limit 5;(Top 5)

b) 表的连接操作，遵循小表在前的规则：小表在前产生中间缓存数据较少，皮面内存区缓存溢出。（left join）

c) 此时分区的重要性也可以体会到了。

说明：hive不支持数据的改写。。。。

-------完美分割线---------NOT END-------待续---------

我们不一样