Hive数据库操作

本篇目录：

1. Hive数据结构

2. DDL操作

3. DML操作

4. UDF函数

Hive数据结构

除了基本数据类型（与java类似），hive支持三种集合类型

Hive集合类型数据

array、map、structs

hive (default)> create table test(
              > name string,
              > friends array<string>,　　　 -- 创建array字段
              > children map<string,int>,　　-- map字段
              > address struct<street:string,city:string>)　　-- struct字段
              > row format delimited        -- 限制多个字段分段符
              > fields terminated by ','    -- 字段之间以','分割
              > collection items terminated by '_'  -- 字段内部用'_'分割（包括array、map）
              > map keys terminated by ':'          -- map内key-value用'：'
              > lines terminated by '
';           -- 不同行，用回车'
'

按表格式，写一份数据，传到hdfs对应的hive-test表下，

Lili,bingbing_xinxin,Lucifa:18_Jack:19,Nanjing_Beijing

然后查询数据库，即可得到查询结果；

test.name       test.friends        test.children           test.address
Lili        ["bingbing","xinxin"]   {"Lucifa":18,"Jack":19} {"street":"Nanjing","city":"Beijing"}

所以Hive的数据，一定是要按设计的格式，严格排列才能读取的！！！

查询集合数据

hive (default)> select friends[0] from test;    -- 可以像java数组那样访问
OK
bingbing

查询map数据

hive (default)> select children['Lucifa'] from test;    -- 只能用key来访问
OK
18

查询结构体数据

hive (default)> select address.street from test;        -- address.street访问
OK
street
Nanjing

DDL操作

库、表的增删改查

数据库

创建数据库

除了location参数，其他跟mysql一样，支持like，desc

hive (default)> create database if not exists hive;
OK
-- 同时HDFS增加文件/user/hive/warehouse/hive.db
hive (default)> create database if not exists hive location /hive;
OK
-- 自定义创建的数据库在HDFS的路径
-- 查看库信息
hive (default)> desc database hive;
OK
db_name comment location    owner_name  owner_type  parameters
hive        hdfs://master:9000/user/hive/warehouse/hive.db  whr USER

修改数据库

无法修改数据库名和目录位置；

alter

删库

-- 库必须为空
hive (default)> drop database test;
-- 强制删除cascade
hive (default)> drop database test cascade;

表

查看一下表信息

hive (default)> show create table test;

CREATE TABLE `test`(
  `name` string, 
  `friends` array<string>, 
  `children` map<string,int>, 
  `address` struct<street:string,city:string>)
ROW FORMAT DELIMITED    --分隔符
  FIELDS TERMINATED BY ',' 
  COLLECTION ITEMS TERMINATED BY '_' 
  MAP KEYS TERMINATED BY ':' 
  LINES TERMINATED BY '
' 
STORED AS INPUTFORMAT   --输入格式
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT            --输出格式
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION                --存储位置
  'hdfs://master:9000/user/hive/warehouse/test'
TBLPROPERTIES (         
  'transient_lastDdlTime'='1569750450')

内部表（管理表）、外部表

内部表：删除，同时删除元数据和hdfs数据；

外部表：删除，只会删除元数据信息，不删hdfs数据；

修改student内部表为外部表：
```
alter table student set tblproperties('EXTERNAL'='TRUE');--后面必须大写
```
修改外部表为内部表：
```
alter table student set tblproperties('EXTERNAL'='False');
```
查询表类型
```
desc formatted student;
```

分区表

避免暴力扫描；

一个分区就是hdfs上的一个独立文件夹；Hive的分区就是hdfs的目录分割；

创建一个分区表：（在元数据PARTITIONS表中存有分区信息）

hive (default)> create table dept_partition(
              > deptno int,dname string, loc string)
              > partitioned by (month string)   -- 以month分区,month默认也算作一个字段
              > row format delimited fields terminated by '	';

加载数据：

load data [local] inpath 'path'  [overwrite] into table [partition_psc];
local：
    有==>从linux本地加载数据
    无==>从hdfs加载数据，相当于执行mv操作(无指的是没有local参数时，而不是本地中没有这个文件)
overwrite
    有==>覆盖掉表中原来的数据
    无==>在原来的基础上追加新的数据

查询：

-- 分区查询where
hive (default)> select * from dept_partition where month = '2019-9-31';

单独添加分区

-- 可添加多个分区
hive (default)> alter table dept_partition add partition(month='2019-9-29') partition(month='2019-9-28');

删除分区

-- add改成drop，每个分区间加 ','
hive (default)> alter table dept_partition drop partition(month='2019-9-29'),partition(month='2019-9-28');

查看有多少分区

hive (default)> show partitions dept_partition;

二级分区表：

其实就是以两个字段来分区

hive (default)> create table dept_2(
              > deptno int,dname string,loc string)
              > partitioned by (month string,day string)
              > row format delimited fields terminated by '	';

上传数据

在hdfs是显示两层目录：/user/hive/warehouse/dept_2/month=2019-9/day=30/dept.txt

-- 这里分区，要写两个
hive (default)> load data local inpath '/home/whr/Desktop/dept.txt' into table dept_2 partition(month='2019-9',day='30');

分区表的数据加载的三种方式：

（1）load命令，自动创建文件夹，以及元数据；（常用）

（2）手动添加分区文件夹以及分区数据，需要修复元数据，才能查询；（了解）

这里会自动根据hdfs文件，来修复，如果说存在大量的没有元数据的数据，可以用此命令；
```
hive (default)> msck repair table dept_parition;
```
（3）手动添加分区文件夹以及分区数据，使用添加分区命令，自动补充元数据；（常用）

第三种例子：
```
# 通过hadoop命令，创建了文件夹，并上传数据
$ hadoop fs -mkdir -p /user/hive/warehouse/dept_partition/month=2019-9-17
$ hadoop fs -put '/home/whr/Desktop/dept.txt' /user/hive/warehouse/dept_partition/month=2019-9-17
```
添加分区

-- 添加分区
hive (default)> alter table dept_partition add partition(month='2019-9-17');
OK
Time taken: 0.1 seconds
-- 查询所有分区
hive (default)> show partitions dept_partition;
OK
partition
month=2019-9-17 --存在
month=2019-9-30
month=2019-9-31

DML数据操作

添加数据：

（1）load

（2）insert（不管数据是否重复，只管追加，多次insert，追加重复数据）

hive (default)> insert into table test
              > select id,name from mess;   -- 从mess表查询，插入test
-- 会执行MR程序            
2019-09-30 14:53:55,879 Stage-1 map = 0%,  reduce = 0%
2019-09-30 14:54:01,365 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.68 sec

（3）overwrite（重复->覆盖；不重复->追加）

hive (default)> insert overwrite table test
              > select * from mess;

（4）as select（在创建表的时候，导入数据）

hive (default)> create table pika
              > as select id,name from test;

（5）location（创建的时候通过location，指定加载数据路径）

hive (default)> create table jieni(
              > id int,name string)
              > location 'user/hive/warehouse/mess/dept.txt';   --hdfs文件目录

（6）import（讲数据导入Hive表中，很少用，前提有export数据，需要export的数据格式）

必须列完全相同并且是个空表，才能导入；

数据导出

（1）insert（insert到本地，可以认为是导出）

-- 导出到本地，也可以导出到hdfs（删掉local）
hive (default)> insert overwrite local directory '/home/whr/Desktop/data' select * from test;
-- 导出数据为一个目录，数据在000000_0文件中，并且没有分隔符

（2）用hadoop命令，下载数据

（3）export导出（少用）

（4）sqoop导出：实现MySql和HDFS（Hive）数据之间导入导出；

清空表：truncate

只会清除数据，表结构不变，只能删除内部表（管理表），不能删除外部表

至于查询操作，基本上与MySql一致，不再赘述；

自定义函数UDF

可以分为三种：

UDF：自定义函数；一进一出

public class MyUDF extends UDF {
    public int evaluate(int data){
        return data+5;
    }
}

UDTF：自定义Table函数；一进多出；

public class MyUDTF extends GenericUDTF {
    private List<String> dataList = new ArrayList<>();
    // 定义输出数据的列名和数据类型
    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs)
            throws UDFArgumentException {
        // 定义输出数据的列名
        List<String> fieldName = new ArrayList<>();
        fieldName.add("word");
        // 定义输出数据的类型
        List<ObjectInspector> fieldOIs = new ArrayList<>();
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        //
        return ObjectInspectorFactory.getStandardStructObjectInspector(fieldName, fieldOIs);
    }
    /**
     * 函数处理逻辑:函数需要两个参数：
     * 1.args[0]：一个字符数组，
     * 2.args[1]：字符数组的分隔符
     * 使用方法：select myudtf('hello,word,qqq,new',',');
     */
    @Override
    public void process(Object[] args) throws HiveException {
        /**
         * 1.获取数据
         * 2.获取分隔符
         * 3.切分数据
         * 4.输出数据
         */
        String data = args[0].toString();
        String splitKey = args[1].toString();
        String[] words = data.split(splitKey);
        for (String word : words) {
            dataList.clear();
            dataList.add(word);
            forward(dataList);
        }
    }
    @Override
    public void close() throws HiveException {
    }
}

UDAF：自定义聚合函数；多进一出；

使用：

# 添加jar包，建议添加到hive/lib下，不需要add，可以直接使用
hive (default)> add jar /home/whr/Desktop/notes/Hadoop_notes/Hive_code/target/MyUDTF.jar;
# 创建函数
hive (default)> create function myudtf as 'UDF.MyUDTF';
# 传参
hive (default)> select myudtf('hello,word,qqq,new',',');
OK
word # 这里是自定义的列名
hello
word
qqq
new