Hadoop-Hive

1、配置

　　1）解压到/opt/moduels

　　2）配置HIVE_HOME

　　3)配置HADOOP_HOME和HIVE_CONF_DIR到hive-env.sh

　　4)在HDFS文件系统上创建HIVE元数据存储目录并赋予权限

　　5）bin/hive　　-->使用sql语句

2、安装MySQL并配置

　　1）unzip mysql

　　2)rpm -e --nodeps mysql

　　3)rpm -ivh mysql-server

　　4)cat /root/.mysql_secret

　　5)rpm -ivh mysql-client

　　6)mysql -uroot -p[password]

　　7)set password=password('123456');

　　8)update user set Host='%'-> where User='root' and Host = 'localhost';

　　9）flush privileges;

　　10）tar -zxvf mysql-connector

　　11) cp mysql-connector-java /opt/moduels/hive/lib

　　12)配置hive-site.xml中URL、DrvierName、UserName、Password　　　　-->端口号3306，DataBase=metastore

　　-->https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin

3、hive基本操作

　　1)列分隔符

　　　　ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';

　　2)加载本地数据

　　　　load data local inpath '/opt/datas/student.txt' (overwrite) into table student;

　　3）desc formated(extended) student;

　　4)show functions;　　-->desc function(extended) substring;

　　5)数据的清除　　truncate table table_name [partition parition_spec];

4、一些配置

　　1）配置client.header和client.currentdb来显示当前数据库

　　2）日志文件配置

　　3）set;　　-->查看配置信息　　-->set hive.root.logger=INFO,console;设置日志信息打印在控制台

　　4）常用交互式命令 bin/hive -help(-i,-f,-e)

5、创建表的三种方式

　　1）create table test01(ip string comment '...',user string)

　　　　comment 'access log'

　　　　row format delimited fields terminated by ' '

　　　　stored as textfile

　　　　location '/user/hive/warehouse/logs'

　　2)create table test02

　　　　as select ip,user from test01;

　　3)create table test03

　　　　like test01;

6、Hive的数据类型

　　1）table ,load　　　E

　　2)select,python　　T

　　3)sub table　　　　L

7、Hive中表的类型

　　1）管理表　　　　-->　　删除表时，会删除表数据以及元数据

　　2）托管表（外部表，external）　　-->　　删除表时，只会删除元数据而不会删除表数据

　　3）分区表（partitioned tables）　-->　　查询时可以通过where子句来指定分区　

　　　　　　create table dept_partition(deptno int,dname string,loc string)

　　　　　　　　partitioned by(event_month string[,event_day string])　　-->二级分区

　　　　　　　　row format delimited fields terminated by ' ';

　　　　加载数据：

　　　　　　load data local inpath '/opt/datas/emp.txt' into table emp_partition partition (mouth='201509');

　　　　查询：

　　　　　　where mouth = '201509';

　　　　注意事项：

　　　　　　a.自己手动创建分区表文件夹并put数据，并没有将分区元数据写入元数据库，所以无法读取数据，可以手动修复：

　　　　　　　　msck repair table dept_partition;

　　　　　　　　或者　　alter table dept_part add partition(day='20150913');

　　　　　　b.查看表的分区数：show partitions dept_partition;

8、导出表的方式

　　1）insert overwrite local directory '/opt/datas/hive_exp_emp'　　-->去掉local导出到Hdfs文件系统上

　　　　　　row format delimited fields terminated by ' '

　　　　　　collection items terminated by ' '

　　　　　　　select * from db_hive.emp;

　　2)bin/hive -e "select * from db_hive.emp" > /opt/datas/hive_exp_exp.txt　　-->没有跑MapReduce任务

　　3）scoop　　hdfs/hive->rdbms　　or　　rdbms->hdfs/hive/hbase

9、Hive中常见的查询　

[WITH CommonTableExpression (, CommonTableExpression)*] (Note: Only available starting with Hive 0.13.0)

SELECT [ALL | DISTINCT] select_expr, select_expr, ...　　-->全部和查重

FROM table_reference

[WHERE where_condition]

[GROUP BY col_list]

[ORDER BY col_list]

[CLUSTER BY col_list

| [DISTRIBUTE BY col_list] [SORT BY col_list]

]

[LIMIT number]

　　1）select t.empno,t.ename,t.deptno from emp t;　　-->t

　　2)between　　-->

　　　　select t.empno,t.ename,t.deptno from emp t where t.sal between 800 and 1500;

　　3)is null/is not null

　　　　select t.empno,t.ename,t.deptno from emp t where comm is null;

　　4)group by/having　

　　　　查询每个部门的平均工资

　　　　select avg(sal) avg_sal from emp

　　　　group by deptno;

　　　　查询每个部门中每个岗位的最高薪水

　　　　select t.deptno,t.job,max(t.sal) max_sal from emp t group by t.deptno,t.job;　　-->双重分组

　　　　having与where区别

　　　　　　where 针对单条记录进行筛选

　　　　　　having针对分组结果进行筛选　

10、Export/Import

　　1)Export　　-->将Hive表中的数据导出

　　　　EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]
 　　　　 TO 'export_target_path' [ FOR replication('eventid') ]　　　　-->path指的为HDFS上的路径
　　2）Import　　　
　　　　IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column="value"[, ...])]]
 　　　　 FROM 'source_path'
 　　　　 [LOCATION 'import_target_path']
11、sort
　　1）order by　　-->全局排序，一个Reduce
　　2)sort by　　-->每个reduce内部进行排序，全局不是排序
　　3)distribute by　　-->类似MR中的partition，进行分区，结合sort by使用
　　　　insert overwrite local directory '/opt/datas/dist_emp' 
　　　　select * from emp 
　　　　distribute by deptno
　　　　sort by empno asc;
　　4)cluster by　　-->当distribute 和sort字段相同时，使用此方式
12、Hive自带function及udf编程　　-->user defination function
　　1）https://cwiki.apache.org/confluence/display/Hive/HivePlugins
　　2)编程步骤：
　　　　a.继承org.apache.hadoop.hive.ql.UDF
　　　　b.需要实现evaluate函数
　　　　ps.必须有返回类型，常用Text/LongWritable等类型，不推荐使用java类型
　　3）配置pom.xml文件中的hive-jdbc和hive-exec
　　4)使用：　　add jar /opt/datas/udf-tolower.jar;
　　　　　　　　create temporary function my_lower as "com.cnblog.hive.udf.LowerUDF";　　-->类名
　　　　　　或者 create function myfunc as 'myclass' using jar 'hdfs://hostname/path/to/jar';　　-->文件必须在hdfs文件系统上