BG.Hive

1. Hive架构

　　What is hive?　　Facebook，https://en.wikipedia.org/wiki/Apache_Hive

　　a> 一种工具，可以通过SQL轻松的访问数据，可以完成数据仓库任务，如ETL，报表及数据分析

　　b> 一种机制，增强多样化数据格式的结构

　　c> 数据访问，HDFS或者其他的数据存储系统（HBase）

　　d> 查询方式，类SQL的HiveQL

　　　　默认引擎为MapReduce，简单的Select * From..不会转换为MR任务

　　e> 快速查询引擎，MapReduce，Spark，Tez

　　f> 支持存储过程，通过HPL/SQL实现

　　　　HPL为apache的另外一个开源项目

　　g> LLAP（Live Long And Process），使Hive实现内存计算

　　　　将数据缓存到了多台服务器的内存中

2. Hive特性和支持的格式

　　Hive提供了标准的SQL函数，HiveQL可以扩展用户自定义函数

　　Hive提供内置的格式

　　　　a> 逗号和Tab字段分割的文本文件

　　　　b> Apache Parquet文件，https://parquet.apache.org/

　　　　c> Apache ORC文件，ORC：OptimizedRC File，RC：RecordColumnar File

　　　　d> 其他格式

3. 单用户模式（derby，in memory database），多用户模式（mysql，其他RDMS），远程模式（服务器端启动MetaStore Server，客户端通过Thrift协议访问）

4. 为什么会出现Hive

　　MR程序繁琐，使用HQL可以非常简单的实现任务

5. 环境搭建

　　要先具有：CentOS, Hadoop, MySQL

　　下载Hive，并放入虚拟机/opt下，https://mirrors.tuna.tsinghua.edu.cn/apache/hive/

　　tar zxf apache-hive-2.1.1-bin.tar.gz　　#解压

　　mv apache-hive-2.1.1-bin hive-2.1.1　　#重命名

　　cd /opt/hive-2.1.1/conf/　　#进入conf目录

　　cp hive-env.sh.template hive-env.sh　　#拷贝配置文件

　　cp hive-default.xml.template hive-site.xml　　#拷贝配置文件

　　vim /etc/profile　　#配置环境变量

　　source /etc/profile　　#应用环境变量

　　vim hive-env.sh　　#配置hive-env.sh

　　　　HADOOP_HOME=/opt/hadoop-2.7.3　　#设置HADOOP_HOME

　　/opt/hive-2.1.1/bin/schematool -dbType derby -initSchema　　#使用derby作为metastore，并初始化（message:Version information not found in metastore.错误解决方案）

　　vim hive-site.xml

　　　　${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D.错误解决方案　　　　

<property>
    <name>hive.exec.local.scratchdir</name>
    <value>/opt/hive-2.1.1/hivetmp/scratchdir/</value>
    <description>Local scratch space for Hive jobs</description>
</property>
<property>
    <name>hive.downloaded.resources.dir</name>
    <value>/opt/hive-2.1.1/hivetmp/resources</value>
    <description>Temporary local directory for added resources in the remote file system.</description>
</property>

　　单用户模式（derby）检查：hive

　　同一时间，只允许一个用户打开Hive Session

　　Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

　　Hive的元数据：表信息、字段属性、分区、列、表Owner等信息，存储于metastore_db

　　Hive的实际数据，存储于HDFS上

　　vim /opt/hive-2.1.1/conf/hive-site.xml

　　　　javax.jdo.option.ConnectionURL, javax.jdo.option.ConnectionDriverName, javax.jdo.option.ConnectionDriverName, javax.jdo.option.ConnectionPassword

<name>javax.jdo.option.ConnectionURL</name>
    <!--<value>jdbc:derby:;databaseName=/opt/hive-2.1.1/conf/metastore_db;create=true</value>-->
    <value>jdbc:mysql://bigdata.mysql:3306/hive?createDatabaseIfNotExist=true</value>
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>

<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <!--<value>org.apache.derby.jdbc.EmbeddedDriver</value>-->
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>

<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <!--<value>APP</value>-->
    <value>bigdata</value>
    <description>Username to use against metastore database</description>
  </property>

<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <!--<value>mine</value>-->
    <value>pas$w0rd</value>
    <description>password to use against metastore database</description>
  </property>

　　cp mysql-connector-java-5.1.41-bin.jar /opt/hive-2.1.1/lib/　　#copy jdbc到lib下，解决("com.mysql.jdbc.Driver") was not found.错误

　　/opt/hive-2.1.1/bin/schematool -dbType mysql -initSchema　　#初始化metaStore db，解决Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

　　多用户模式（mysql）测试：hive

6. Hive CLI

　　Hive Command Line Interface　　Hive命令行接口

usage: hive
 -d,--define <key=value>          Variable substitution to apply to Hive
                                  commands. e.g. -d A=B or --define A=B
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
-H,--help                        Print help information
 -h <hostname>                    Connecting to Hive Server on remote host
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable substitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -p <port>                        Connecting to Hive Server on port number
-S,--silent                      Silent mode in interactive shell
-v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

　　设置配置属性的3中方式：1. hive CLI set property = value > 2. --hiveconf property = value > 3. hive-site.xml

　　hive> create database HelloHive;　　#创建数据库，数据库文件存放于Hadoop

　　hive> show databases;　　#显示所有数据库

　　hive> use HelloHive;　　#切换到HelloHive数据库

　　hive> create table T1(id int, name varchar(30));　　#创建表

　　hive> show tables;　　#显示所有表

　　hive> insert into t1(id,name) values(1,'Niko'),(2,'Jim');　　#向T1表中插入数据

　　hive> select * from t1;　　#查询T1表

　　hive -d col=id --database HelloHive　　#启动Hive时，定义变量col等于id，并连接上HelloHive数据库

　　hive> select ${col},name from T1;　　#使用col代替id进行查询，输出结果为id列的内容

　　hive> select '${col}',name from T1;　　#${col}的值为id，所以输出结果为字符串“id”

　　hive> set mapred.reduce.tasks;　　#设置MR的任务数，不加参数输出当前任务数
　　　　[output] mapred.reduce.tasks=-1　　#Hive默认的MR任务数-1代表Hive会根据实际情况设置任务数

　　hive --hiveconf mapred.reduce.tasks=3　　#在启动Hive时指定MR任务数为3

　　hive> set mapred.reduce.tasks=5;　　#在Hive CLI中重新设定MR任务数为5

　　hive -e "select * from T1;" --database HelloHive;　　#使用-e将查询语句传入Hive并取回结果

　　vim t1.hql　　#创建t1.hql文件

　　　　use HelloHive;　　#文件中的SQL语句，每行必须要用;结尾
　　　　Select * From T1 Where id < 4;

　　hive -f t1.hql　　#使用hive只是文件中的SQL语句

　　hive -S -e "select count(1) from T1;" --database HelloHive;　　#-S会去掉不必要的信息，如MR的信息等不会被显示出来

7. Hive Shell

　　hive> quit;　　hive> exit;　　#退出interactive Hive Shell

　　hive> reset;　　#重置所有hive配置项，重置为hive-site.xml中的配置信息

　　hive> set XXX;　　hive> set XXX=Y;　　#设置或者显示配置项信息

　　hive> set -v;　　#显示所有Hadoop和Hive的配置项信息

　　hive> !ls;　　#在hive中执行Shell命令

　　hive> dfs -ls;　　#在hive中执行dfs命令

　　hive> add file t1.hql　　#添加t1.hql文件到分布式缓存

　　hive> list file;　　#显示所有当前的分布式缓存文件

　　hive> delete file t1.hql　　#删除指定的分布式缓存文件

8. Beeline

　　HiveServer2的CLI，一个JDBC客户端；

　　嵌入式模式，返回一个嵌入式的Hive，类似Hive CLI；（beeline）

　　远程模式，通过Thrift协议与某个单独的Hive Server2进程进行连接通信（使用代码连接HiveServer2）

　　HiveServer2的配置 hive-site.xml

　　　　Hive.Server2.thrift.min.worker.threads　　#最小工作线程数，默认5，最大500

　　　　Hive.Server2.thrift.Port　　#TCP监听端口，默认是10000

　　　　Hive.Server2.thrift.bind.host　　#TCP绑定主机，默认是localhost

　　　　Hive.Server2.thrift.transport.mode　　#默认TCP，可选择HTTP

　　　　Hive.Server2.thrift.http.port　　#HTTP的监听端口，默认值为10001

　　启动HiveServer2

　　　　hive -service hiveserver2

　　　　hiveserver2

　　启动Beeline　

　　　　hive -service beeline

　　　　beeline

　　查看服务是否启动：ps -ef | grep hive

　　cp /opt/hive-2.1.1/jdbc/hive-jdbc-2.1.1-standalone.jar /opt/hive-2.1.1/lib/　　#解决hive-jdbc-*-standalone.jar:No such file or directory文件

　　beeline　　#启动beeline

　　beeline> !connect jdbc:hive2://localhost:10000/HelloHive　　#使用beeline连接Hive数据库

　　　　Enter username for jdbc:hive2://localhost:10000/HelloHive: root

　　　　Enter password for jdbc:hive2://localhost:10000/HelloHive: ********

　　Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000/HelloHive: Failed to open new session: java.lang.RuntimeException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: root is not allowed to impersonate root (state=08S01,code=0)

　　解决方案：

　　　　kill -9 15544　　#关闭hiveserver2进程

　　　　/opt/hadoop-2.7.3/sbin/stop-all.sh　　#停止Hadoop集群

　　　　vim /opt/hadoop-2.7.3/etc/hadoop/core-site.xml　　#配置hadoop的core-site，增加下面2个配置项。表示root用户可以代理所有主机上的所有用户

<property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
  </property>

  <property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
  </property>

　　　　scp /opt/hadoop-2.7.3/etc/hadoop/core-site.xml root@bigdata.hadoop.slave1:/opt/hadoop-2.7.3/etc/hadoop/　　#将core-site.xml文件分发到Hadoop集群的所有slave上

　　　　scp /opt/hadoop-2.7.3/etc/hadoop/core-site.xml root@bigdata.hadoop.slave2:/opt/hadoop-2.7.3/etc/hadoop/

　　　　scp /opt/hadoop-2.7.3/etc/hadoop/core-site.xml root@bigdata.hadoop.slave3:/opt/hadoop-2.7.3/etc/hadoop/

　　　　/opt/hadoop-2.7.3/sbin/start-all.sh　　#启动Hadoop集群

　　　　hiveserver2　　#启动Hive Server2

　　　　beeline　　#启动beeline

　　　　beeline> !connect jdbc:hive2://localhost:10000/HelloHive　　#连接Hive数据库 => 输入用户名，密码

　　　　17/03/07 14:01:55 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ　　#警告

　　　　0: jdbc:hive2://localhost:10000/HelloHive> set autoCommit=false;　　#beeline启动成功；设置autoCommit为false
　　　　　　[output]No rows affected (0.286 seconds)　　#设置成功

　　　　0: jdbc:hive2://localhost:10000/HelloHive> show tables;　　#显示表

　　　　0: jdbc:hive2://localhost:10000/HelloHive> select * from t1;　　#查询表

　　　　0: jdbc:hive2://localhost:10000/HelloHive> !quit　　#退出beeline

9. Hive数据类型

　　数值型：

　　　　TINYINT，1字节，-128 ~ 127，如：1　　Postfix：Y　　100Y

　　　　SMALLINT，2字节，-32768 ~ 32767，如：1　　Postfix：S　　100S

　　　　INT/INTEGER，4字节，-2,147,483,648 ~ 2,147,483,647，如：1

　　　　BIGINT，8字节，如：1　　Postfix：L　　100L

　　　　FLOAT，4字节单精度，如：1.0　　默认为double，在数值后面加上F代表Float。

　　　　DOUBLE，8字节双精度，（Hive 2.2.0开始引入DOUBLE PRECISION），如：1.0　　FOLAT和DOUBLE都不支持科学计数法

　　　　DECIMAL，38位小数精度，（HIVE 0.11.0开始引入），支持科学/非科学计数法；默认为小数点后1位，或者指定小数点后位数decimal(10,2)

　　日期时间型：

　　　　TIMESTAMP，0.8.0开始引入，如：2017-03-07 14:00:00；支持传统Unix时间戳，精确到纳秒级。

　　　　DATE，0.12.0开始引入，0001-01-01 ~ 9999-12-31，如：2017-03-07

　　字符：

　　　　STRING，用单引号或者双引号引起来的字符串

　　　　VARCHAR，0.12.0引入，字符数量1 ~ 65535

　　　　CHAR，0.13.0引入，固定长度，长度最大支持到255

　　Misc

　　　　BOOLEAN，布尔型，TRUE和FALSE

　　　　BINARY，0.8.0引入，二进制类型

　　数组

　　　　ARRAY<TYPE>，如ARRAY<INT>，元素访问下标由0开始

　　映射

　　　　MAP<PRIMITIVE_TYPE,DATA_TYPE>，如MAP<STRING,INT>

　　结构体

　　　　STRUCT<COL_NAME:DATA_TYPE,...>，如STRUCT<a:STRING,b:INT,c:DOUBLE>

　　联合体

　　　　UNIONTYPE<DATA_TYPE,DATA_TYPE,...>，如UNIONTYPE<STRING,INT,DOUBLE...>

CREATE TABLE complex(
    col1 ARRAY<INT>,
    col2 MAP<STRING,INT>,
    col3 STRUCT<a:STRING,b:INT,c:DOUBLE>,
    col4 UNIONTYPR<STRING,INT,STRUCT,MAP,ARRAY,...>
)

col1 = Array('Hadoop','spark','hive','hbase','sqoop')
col1[1] = 'spark'

col2 = MAP(1:hadoop,2:sqoop,3:hive)
col2[1] = hadoop

col3 = STRUCT(a:5,b:'five')
col3.b = 'five'

	void	boolean	tinyint	smallint	int	bigint	float	double	decimal	string	varchar	timestamp	date	binary
void to	true	true	true	true	true	true	true	true	true	true	true	true	true	true
boolean to	false	true	false	false	false	false	false	false	false	false	false	false	false	false
tinyint to	false	false	true	true	true	true	true	true	true	true	true	false	false	false
smallint to	false	false	false	true	true	true	true	true	true	true	true	false	false	false
int to	false	false	false	false	true	true	true	true	true	true	true	false	false	false
bigint to	false	false	false	false	false	true	true	true	true	true	true	false	false	false
float to	false	false	false	false	false	false	true	true	true	true	true	false	false	false
double to	false	false	false	false	false	false	false	true	true	true	true	false	false	false
decimal to	false	false	false	false	false	false	false	false	true	true	true	false	false	false
string to	false	false	false	false	false	false	false	true	true	true	true	false	false	false
varchar to	false	false	false	false	false	false	false	true	true	true	true	false	false	false
timestamp to	false	false	false	false	false	false	false	false	false	true	true	true	false	false
date to	false	false	false	false	false	false	false	false	false	true	true	false	true	false
binary to	false	false	false	false	false	false	false	false	false	false	false	false	false	true

10. Hive表基本操作及概念

　　内表（Managed Table），其数据文件、元数据及统计信息全部由Hive进程自身管理。内表的数据存储是有hive.metastroe.warehouse.dir指定的路径下。

　　外表（External Table），通过元信息或者Schema描述外部文件的结构，外表可以被Hive之外的进程访问和管理，如HDFS。

　　hive> desc formatted t1;　　#查看表的信息；Table Type显示Managed Table或者External Table

　　hive> create external table t2(id int,name string);　　#创建外表

　　hive> desc t1;　　#查看表的字段及字段类型信息

11. Hive数据文件存储格式

　　STORED AS TEXTFILE，默认的文件格式（除非特别用hive.default.fileformat指定，在hive-site.xml中设定）

　　STORED AS SEQUENCEFILE，已压缩的序列化文件

　　STORED AS ORC，存储ORC格式的文件，支持ACID事务操作及CBO（Cost_based Optimizer）

　　STORED AS PARQURT，存储Parquet文件

　　STORED AS AVRO，存储AVRO格式文件

　　STORED AS RCFILE，存储RC（Record Columnar）格式的文件

　　STORED BY，由非内置的表格式存储，例如HBase/Druid/Accumulo存储数据

　　创建表

　　hive> create external table users(
    > id int comment 'id of user',
    > name string comment 'name of user',
    > city varchar(30) comment 'city of user',
    > industry varchar(20) comment 'industry of user')
    > comment 'external table, users'
    > row format delimited　　#使用分隔符形式，下面描述了3种序列化的形式
    > fields terminated by ','
    > stored as textfile
    > location '/user/hive/warehouse/hellohive.db/users/';

　　row format内置类型：Regex（正则表达式），JSON，CSV/TSV

　　row format serde 'org.apache.hive.hcatalog.data.JsonSerDe' stored as textfile

　　row format serde 'org.apache.hive.serde2.RegexSerDe' with serdeproperties ("input.regex"="<regex>") stored as textfile

　　row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerDe' stored as textfile

　　或者使用hive -f的方式创建表

vim t_users_ext.hql

create external table users(
id int,
name string,
city varchar(30),
industry varchar(20))
row format delimited
fields terminated by ','
stored as textfile
location '/user/hive/warehouse/hellohive.db/users';

insert into users(id,name,city,industry)
values(1,'Niko','Shanghai','Bigdata'),
(2,'Eric','Beijing','NAV'),
(3,'Jim','Guangzhou','IT');

hive -f t_users_ext.hql --database hellohive

12. Hive表

　　分区表 Partition Table

　　在Hive Select查询中，一般会扫描这个表的内容（HDFS某个目录下的所有文件），会消耗很多时间

　　分区表创建时，指定partition的分区空间，分区粒度 > 桶粒度

　　语法： partition by (par_col par_type)

　　静态分区：如按照年-月进行分区　　#set hive.exec.dynamic.partition;

　　动态分区：如按照产品类别进行分区，产品类别会有新增　　#默认为动态分区，如果设置动态分区为false，则不能创建动态分区

　　　　动态分区模式：set hive.exec.dynamic.partition.mode = strict/nonstrict　　#默认模式为严格（strict），在strict模式下，动态分区表必须有一个字段为静态分区字段

　　采用分区后，每个分区值都会形成一个具体的分区目录

　　桶表 Bucketed Sorted Table

　　倾斜表 Skewed Table

　　　　通过将倾斜特别严重的列分开存储为不同的文件，每个倾斜值指定为一个目录或者文件，在查询的时候，可以根据过滤条件来避免全表扫描的费时操作

　　　　Skewed by (field) on (value)

　　临时表 Temporary Table

　　　　只在当前会话中可见的表为临时表，临时表所在的hdfs目录为tmp目录

　　DROP TABLE [IF EXISTS] TABLE_NAME [PURGE];　　#对于内表，使用PURGE，元数据和表数据一起删除，不进入垃圾箱。对于外表，只删除元数据