hive 安装文档

1.hive的安装
解压就完事了
配置/etc/profile环境变量
启动hdfs
启动hive
cp $HIVE_HOME/lib/jline.xxxxx $HADOOP_HOME/share/hadoop/yarn/lib
2.show databases;查看数据库
3.show tables;
4.create database xxxxx
5.desc tablename;
6.create table tablename(column columnType....)
tinyInt smallint int bigint String float double array struct map timestamp binary
7.show create table;查看表的详细信息
8.'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 切分的类，在使用这个hdfs中文件的时候，我们要用到这个类进行切分，
一定是在查询数据的时候进行切分的，所以是懒加载的
'org.apache.hadoop.mapred.TextInputFormat'在取hdfs中数据的时候，我们其实是将mr提交完毕以后用mapper进行数据读取的，读取的时候用到的就是textInputFormat
进行的数据读取
 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' mr任务在执行的时候，读取完毕的数据要输出到一个文件中，那么输出的时候只要value不要key
 'hdfs://master:9000/user/hive/warehouse/student'指向的是一个hdfs中的文件夹目录，这个目录中的所有数据都是这个表中的数据
 
9. 默认情况下在我们没有设置这个数据库连接的时候，那么存储的数据位置就在当前的目录中，默认用的derby数据库，那么这个数据库就是hive本地自带的，存储
的位置就在当前目录中
10.如果换了一个启动目录那么对应的信息就不存在了，为了保证数据的持久性，我们将数据存放到一个mysql中
11.安装mysql  rpm -qa|grep mysql
rpm -e --nodeps xxx.mysql.xxxx
rpm -ivh MysqlServer ....
rpm -ivh MysqlClient...
service mysql start
/usr/bin/mysql_secure_installation
进行设置
eg:千万不需要disallow远程用户登陆

12.mysql -uroot -p  -h 使用客户端登陆服务端
13.将mysql数据库中的mysql数据库的user表中的host改为%就可以统配所有的ip地址
grant all privileges on *.* to root@"%" identified by "123456";
flush privileges;
delete from user where host!="%";
service mysql restart;

14.关于mysql的常识
千万别删除元root用户

15.配置数据库的远程连接，创建hive数据库的时候要选择latin1编码格式，如果使用的是navicate这个工具 latin swedish

16.在hive中的所有的数据都存在hdfs中，其实一个表对应的是一个文件夹
/user/hive/warehouse是hive存放文件的基础目录，默认不变直接指向这个目录
这个是根目录，创建的数据库会在这个目录中存在一个叫xxx.db的一个文件夹  创建一个表就会产生一个和表明一样的文件夹
如果想要更换位置 create .... location "hdfs:xx:9000/xxxxx"

17.默认进入到hive中的时候，不需要选择数据库，默认会有一个数据库default
默认进入就是default数据库，这个数据库的目录在/user/hive/warehouse/这个目录中
如果创建的是default数据库中的表，那么这个表的文件夹在/user/hive/warehouse/里面
如果是自己创建的数据库，/user/hive/warehouse/xx.db/文件夹

18.插入数据
stored as textFile我们创建的这个表的数据在hdfs中，那么存储的时候是以文本形式存储的
在hive中是和hdfs保持长连接 dfs -command  和原来的hdfs的命令是一样的

19.hdfs中的分隔符
如果我们不指定分隔符的话，那么默认的分隔符就是^A,如果模板中不指定用什么进行分割，那么hive对一个数据默认会按照^A
进行分割，虽然我们在insert的数据中没有看到但是确实按照^A分割的
create table stu120(id int,name string) row format delimited fields terminated by " "；
20.加载数据
直接将数据放入hdfs的对应的表的路径中
通过hive命令加入数据 load data [local] inpath "本地路径" [overwrite] into table tablename;
如果将一个hdfs中的文件加入到表中，那么这个文件直接被移除掉

21.hive中的DDL关于表的修改
drop table  tableName;删除表
删除的是元数据和表中的数据
内部表我们自己创建的表如果没有指定那么就是internal，内部表的元数据和hdfs中的数据可以一起删除
一般创建的表都是外部表external
创建的表的类型managed_table / external table
外部表在删除的时候只删除元数据，hdfs中的数据不丢失
外部表一般作为共享数据使用

22.DDL
修改表的名字 alter table tablename rename to newName;
修改 表中的字段信息：alter table tablename change column oldColumn newColumn columnType
eg:注意模板的类型，如果和文件映射不对应，那么就不显示这个数据不会报错
添加字段：alter table tablename add columns(column columnType....)
如果模板中的字段太多，那么就会有一部分的列没有数据，但是不会报错
替换字段：alter table tablename replace columns(col colType.....)一次性替换所有的字段，不能单独的替换某个字段

23.DML数据库的sql语句
insert  update delete select
hive本质是一个数据仓库，只是存放数据
前几个版本完全不支持insert，现在支持了，但是我们完全不推荐使用
select column from table where group by having order by limit;
子查询中有 =  in
根据查询语句关键字的顺序进行学习：
t_emp员工表 t_dept部门表
empno                   int                                         
ename                   string                                      
job                     string                                      
salary                  double                                      
bonus                   double                                      
hiredate                string                                      
mgr                     int                                         
deptno                  int
==============================
deptno                  int                                         
danem                   string                                      
location                string 

where = ！= <>  < >  is null  is not null  
都支持

在研发部的人员有谁？
 select * from t_emp e,(select deptno from t_dept where dname="yfabu")t where e.deptno = t.deptno;
我们在使用子查询的时候那么这个键不能用=，我们一般使用join关联来代替

查询每个部门最高的薪资的人？
select e.ename,t.max,t.deptno
from t_emp e,
(select max(salary) max,deptno from t_emp group by deptno)t
where e.salary = t.max and e.deptno = t.deptno
哪个部门的员工大于三个人？
select count(*),deptno from t_emp group by deptno having count(*)>3
哪个部门的平均工资大于5000；
select avg(salary),deptno from t_emp group by deptno having avg(salary)>5000
最高工资的那个人所属的部门有哪些人？
select e1.*
from t_emp e1,
(select deptno from t_emp e,(select max(salary) max from t_emp)t where e.salary=t.max)t2
where e1.deptno = t2.deptno
谁的工资比tom的工资高？
select * from t_emp where salary >(select salary from t_emp where ename="tom")
select * from t_emp e,(select salary from t_emp where ename="tom")t where e.salary>t.salary;
在子查询中不能是使用>=<,等值或者不等值连接

全公司最高工资的两个人所在部门的员工名单？
limit 0,2这个写法是mysql   hive中的sql语句写法是limit 2没有从什么位置开始，只有多长
eg:在hdfs中使用命令是dfs -command file;
eg:在hive中使用linux的命令 !command ;
select e.* from
t_emp e,
(select deptno,salary from t_emp order by salary desc limit 2)t
where e.deptno = t.deptno;

tom的下属有哪些人？
select e.*
from t_emp e,
(select empno from t_emp where ename="tom")t
where e.mgr=t.empno
========================原理和优化==================================
groupBy分组？会造成一个数据倾斜
1.解决方案：将输入在map端进行部分的打乱重分
set hive.groupby.skewindata=true;
2.在map到reduce端的时候设置combiner进行合并
set hive.map.aggr=true;
3.在combiner进行合并的时候要知道数据量的大小，如果不是特别大就不需要进行合并
set hive.groupby.mapaggr.checkinterval=100000;如果数据小于10w条那么没必要合并
4.看在combiner合并期间做的合并率
set hive.map.aggr.hash.min.reduction=0.5

order by是排序？全局排序，reduce就应该是一个
其实orderby就是一个reduce在进行排序处理，那么压力特别大，并且容易产生宕机
那么我们在使用这个order by的时候就不能进行全局排序，加上limit
set hive.mapred.mode = strict; 
在严格模式下如果向使用order by进行排序，那么必须使用limit进行指定条数

sort by排序,不是全局排序，单个reduce的排序
问题：将每个部门的数据都按照工资进行倒序？

set mapreduce.job.reduces=3;
select * from t_emp distribute by deptno sort by salary desc;

集群问题：
1.hive无法启动，connection refused 。。。。
namenode is in safe mode启动的时候会进入安全模式，有的电脑就没有办法自己离开安全模式
hdfs dfsadmin safemode -leave
2.hive connection failed
hive启动要直接连接hdfs，那么hdfs中就会找 $HADOOP_HOME/etc/hadoop/core-site.xml
namenode在第一台机器上启动 但是core-site.xml配置得 fs.defaultFS 配置得是其他的机器

sort by 每个mr自己得文件单独排序
distribute by  分发将map端得数据按照一定得规则分发给不同得reduce端
set mapreduce.job.reduces=3;
与order by不同，order by是全局排序 其实sortby也可以全局排序  reduce是一个得时候就可以全局排序
cluster by:分发+排序 == sort by+distribute by,但是cluster by 这个分发加上排序是只能指定一个字段

==========================================================
union union all  distinct
select * from t_emp
    > union
    > select * from t_emp;
union是两个结果集得关联，但是可以将重复得数据去重

union all是两个结果集得全部集合不去重

distinct去重
问题：在研发部得人员和财务部得人员中总共有多少个职位？
select count(distinct(job)) from t_emp where deptno in
    > (select deptno from t_dept where dname in ("yfabu","cwubu"));

join表得关联
where 内关联  join内关联  inner join 
left join  left outer join 在hive中都是左外连接
right join right outer join 在hive中都是右外连接
left semi join：相当于in
select * from t_emp e left semi join t_dept d on e.deptno =d.deptno;
select * from t_emp where deptno in (select deptno from t_dept)
以上两个sql一样的功能，但是left semi join在做数据关联的时候会有一定的优化功能，不会将所有数据都便利一遍

full join  只有在hive中才可以使用，在mysql中不能使用
mapjoin：两个表是两个文件，两个文件不能都用mapper进行处理，因为两个进程不能相遇，所以将一个文件放入到分布式缓存中
另一个文件放入到mapper端进行处理，然后关联放在缓存中的数据是以map形式存储的
mapjoin一般我们将数据存放到本地集合中，map形式存储。在hive中自己会进行map存储，hashTable ==hashMap
select /*+MAPJOIN(t_dept)*/ * from t_emp e join t_dept d on e.deptno = t.deptno
会将小的表存在在分布式缓存中，一旦执行mr任务，在执行之前就将数据存放到mr任务所在的机器上的本地文件夹中
这个mapjoin已经不用了！！！！！！在hive0.70以前会这么使用
优化mapjoin
set hive.auto.convert.join=true自动将join转换为map端的join
set hive.mapjoin.smalltable.filesize=25000000;小于25M的都是小表

reduceJoin一般我们不会使用，因为有昂贵得shuffle流程，所有得数据都给一个reduce进行处理了，那么实在压力太大了


分区表中的数据分区的列也是表中的一个列，但是这个列不能直接操作，比如在加载数据的时候不能直接将数据加载到表中的这个字段上
我们需要手动指定，因为我们指定的这个字段将会保存在hdfs中的一个文件夹上
以上的分区叫做静态分区，死的分区标识
create table tablename(col coltype...) partitioned by(col coltype...) row format delimited fields terminated by " "
load data local inpath "path" into table tablename partition(col=value)
查询数据的时候把分区字段作为一个基础字段进行使用就可以了，其实我们在查询数据的时候/user/hive/warehouse/temperature/month=5
+ mysql(template模板)可以大量的节省数据的查询量

多层分区：比如温度数据可以按照month  day
show partitions展示分区，展示的分区数据在mysql中month=5/day=10
select * from tmp1 where month=5 and day=10---->mysql中的数据模板 --->month=5/day=10
/user/hive/warehouse/tmp1/month=5/day=10,可以很大增加查询效率，因为直接指向的就是哪个小文件夹的路径
**********实际上分区数量越多越精确查询效率越高，不能越多越好？因为如果分区数量特别大那么会使得元数据压力太大，所以我们会取中间值*****

eg:温度数据是50年的，按照month  day进行数据的分区 300
对于以上情况的时候分区数量实在太大时候我们需要用到hive的动态分区机制
分区不需要我们手动指定，我们只需要将数据插入到表中，那么hive会根据字段的值进行自动的分区
create table tablename(col coltype...) partitioned by(col coltype...) row format delimited fields terminated by " "
set hive.exec.dynamic.partition = true;
因为是动态分区所以我们不会指定分区字段，会根据字段的值自己识别，load数据的时候hive表没有办法识别谁是分区字段
insert数据到这个分区表中，这个表就会知道谁是分区字段insert into table values(1,2,3,4)
load数据到一个临时表中 然后再从临时表中查询数据  再将数据插入到这个d_p的表中
create tmptable;
load data into tmptable;
insert into tablename partition(col coltype...) select * from tmptable;
会根据字段进行自动识别创建多个分区，分区的内容是根据字段顺序查找的，不是按照字段名称

混合分区：
在某种情况下，因为数据量比较大，那么产生的分区特别多，如果分区全部都是动态分区的话，那么分区的数量会给namenode产生很大的压力
我们不能任由动态分区自动识别所有 set hive.exec.dynamic.partition.mode=strict;
insert into d_p partition(month,day=20) select a,b,c from tp_tmp;
在动态分区的模式下设置分区为严格模式，防止动态的生成太多的分区
指定其中一个分区字段为固定的值，那么这个值只能是父分区的值不能是子分区的值
（month=6,day） /month/01、02、03、04.。。。
（month,day=20） 不管你生成month=xxx这个文件夹中都会存在一个day=20的子文件夹

分桶：
create table tablename(col coltype..) clustered by(col) into N buckets;
没有进行数据的切分，因为默认情况下桶的数量和文件的数量是对应的，默认情况reduce是一个
所以在分桶之前一定要设置reduce的数量，按照桶的个数自动适配reduce的数量
set hive.enforce.bucketing=true;
insert into table select * from tmptable;

1.将数据分为更小的单位，让数据更加规整
2.因为join会较少笛卡儿积
以后在两个表进行关联的时候直接是对应的分区找对应的分区就OK了,可以大量的进行join的优化，有效的放置笛卡儿积的产生
set hive.optimize.bucketmapjoin=true;

索引：
1.因为数据量太大，所以我们进行了优化，但是优化比例不能太高，会产生很多的元数据信息表的分区和分桶信息
我们可以考虑建立一个新的文件，这个文件就一份，这个文件记录的是每个记录在文件中的位置
我们在分区和分桶的基础上进行更细的优化
只有一个文件  这个文件记录表中每一个数据所在的位置，一个文件会对namenode中的元素据较小压力
1.将数据进行排版a-z
2.建立索引，记录了 id =1 ----> /user/hive/warehouse/student/000000_0/200
select * from t_emp where id = 1 --- id=1 --- /user/hive/warehouse/student/000000_0/200

创建索引，先有表，将表中的数据进行排版，建立索引，你建立的索引是一个文件
table  ---- table_index也是一个表
create index id_index on table t_cluster(id) as "org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler" with deferred rebuild;
alter index indexname on tablename rebuild;重新构建索引，将数据进行排版，然后根据某一列进行索引结果的生成放入到一个新的表中
hdfs中的文件 ++++  索引表 ++++去hive的表中查询

复杂数据类型：
tinyint smallint int bigint string timestamp binary map  array struct
array:数组
create table teen(id int,name string,gf Array<string>)
    > row format delimited fields terminated by " "
    > collection items terminated by ",";
列和列直接用空格拆分，第三列的内容用逗号分割放入到array数组中
 select id,name,gf[0] from teen;
 select * from teen where gf[0]='cls';
可以作为列查询，也可以作为过滤条件进行使用

map集合
create table t_map(id int,name string,info Map<string,string>)
    > row format delimited fields terminated by " "
    > collection items terminated by ","
    > map keys terminated by ":";
fileds切分  k_V切分  k v之间的切分
select * from t_map where info["address"]='fs';
select info from t_map;
select info["address"] from t_map;

struct
create table t_struct(id int,name string,address struct<province:string,city:string,country:string>)
    > row format delimited fields terminated by " "
    > collection items terminated by ",";
select * from t_struct where address.province = 'sx';
select address.province from t_struct;
hive 安装 文档

hive 安装文档