hive知识点总结

1.什么是HIVE?

　　1.hive是hadoop生态圈的一个工具，提供一种结构化查询语言，可以查询HDFS或者其他文件系统上的文件。

2.hive操作：

　　1.hive一次使用命令：hive -S -e "select * from mytable limit 3"; //临时应急使用，-S开启静默模式，去掉结果的OK 和timeTaken。

　　2.从文件中执行hive查询：hive -f /path/to/file/file.sql

　　3.用正则表达式筛选数据库： show databases like 'h.*';

　　4.删除库：drop database if exists user; #默认hive 不允许删除表的库。要么先删除库中的表，要么在删除语句后面加上cascade

　　5.拷贝表：create table if not exists mydb.table like mydb.table2;

　　6.向表中装载数据：load data inpath '${env:HOME}'/california-employees'

　　　　　　　　　　　overwrite into table employees #overwrite:如果分区目录不存在则创建

　　　　　　　　　　　partition (country = 'us',state ='ca'); #如果使用local关键字，表明从本地copy到目标路径，如果没有使用则是在分布式文件系统中进行转移。

3.hive中的特殊数据类型：map ,array,struct

　　1.建表实例：create table emploees{

　　　　　　　　　　name string,

　　　　　　　　　　salary float,

　　　　　　　　　　subordinates array<string>,

　　　　　　　　　　deductions map<string,float>,

　　　　　　　　　　address struct<street:string,city:string,state:string,zip:int>

　　　　　　　　} row format delimited

　　　　　　　　　filelds terminated by '01' #01 是^A的八进制数

　　　　　　　　　collection item terminated by '02' #02 是^B的八进制数

　　　　　　　　　map keys terminated by '03' #03 是^C的八进制数

　　　　　　　　　line terminated by ' '

　　　　　　　　　stored as textfile;

4.hive的读时模式：

　　1.读时模式：对于hive 要查询的数据，有很多种方法创建，修改，甚至损坏，因此hive 不会再数据加载时进行验证而是在数据查询时验证，即读时模式。

　　2.如果模式和文件内容不匹配，那么用户将会看到很多null。

5.HQL数据定义：

　　1.hive特点：不支持行级插入，更新和删除操作。不支持事务。

7.表的分类：

　　1.管理表

　　　　1.hive或多或少管理表周期

　　　　2.删除管理表时，hive也会删除这个表的数据

　　2.外部表：

　　　　1.数据源来自于三方，比如hdfs

　　　　2.删除表只会删除元数据，而不会删除数据

　　　　3.建表实例：create external table if not exists stocks( # external 表示为外部表

　　　　　　　　　　　　exchange string,

　　　　　　　　　　　　symbol string,

　　　　　　　　　　　　price_open string)

　　　　　　　　　　　　row format delimited fields terminated by ","

　　　　　　　　　　　　location '/data/stocks'; #location表示数据路径

　　3.分区表：

　　　　1.将数据以一种符合逻辑的方式进行组织，比如分层存储。

　　　　2.建表实例：create table exployees(

　　　　　　　　　　　　name string,

　　　　　　　　　　　　salary float)

　　　　　　　　　　　　parttioned by (country string, state string);

　　4.外部分区表：

　　　　1.管理大量生产数据最为常见，比如日志文件分析。

　　　　2.建表实例： create external table if not exists log_messages(

　　　　　　　　　　　　hms int,

　　　　　　　　　　　　sevverity string,

　　　　　　　　　　　　server string,

　　　　　　　　　　　　process_id int,

　　　　　　　　　　　　message string)

　　　　　　　　　　　　partitioned by (year int,month int,day int)

　　　　　　　　　　　　row format delimited fields terminated by ' ';

8.聚合函数：（部分）

　　1.count(*):计算总行数，包括null行

　　2.count(expr)：计算提供expr表达式非null的行数

　　3.sum（distinct col）:计算排重后的和

　　4.set hive.map.aggr=true：提高聚合性能，需要更多的内存

9.表生成函数：

　　1.explode(Array array) :返回0到多行结果，每一行对应array数组的每一个元素

　　2.explode（Map map）：同理，每行对应每个map键值对

　　3.inline(ARRAY<STRUCT[,STRUCT]>):将结构体数组提取出来并插入表中。
　　4.json_tuple(string json_Str, p1,p2 ,..,pn)：本函数可以接受多个标签，对json字符串进行处理。

10.case when then:

　　1.用于单列查询结果：select name，salary，

　　　　　　　　　　　　　　case

　　　　　　　　　　　　　　　　when salary <50000.0 then "low"

　　　　　　　　　　　　　　　　when salary >=50000.0 and salary <70000.0 then "middle"

　　　　　　　　　　　　　　　　when salary >=70000.0 and salary <1000000.0 then "high"

　　　　　　　　　　　　　　　　else "very high"

　　　　　　　　　　　　　　end as bracket from employees;

11.like & rlike:

　　1.like:通过字符串开头或结尾，以及特定的字符串进行匹配。

　　2.rlike：可以通过java正则来匹配条件。

12.join语句：

　　1.inner join：只有两个链接的表都存在与连接匹配的数据才会被保留下来

　　2.join优化：当对三个或更多表join时，如果连接关键词相同的话，哪么只会产生一个MR job。

　　　　　　　　hive同时假定查询最后一张表是最大表，再对每行记录进行连接时，会试图将其他表缓存下来，然后扫描最后那张表进行计算，因此需要保证查询的表大小从左到右是依次增加的。

　　3.left outer join：将左表符合where子句的所有记录返回，右表没有符合的列的值为null。

　　4.outer join：外链接会忽略掉分区过滤条件

　　5.right outer join ：会返回右表符合where语句的记录，左表匹配不上的用null。

　　6.full outer join :完全外链接，返回所有符合where条件的记录，任何不满足用null。

　　7.left semi-join：左半开连接，返回左表记录，前提是其记录对于右边满足on的判定

　　　　semi-join 通常比inner join效率高，因为对于左表的一条指定的记录，右表一旦找到对应的

　　　　就会停止扫描。

　　8.笛卡尔积join:表示左表行数乘以右表行数产生的数据。

　　9.map-side join:如果所有表中只有一张小表，那么在最大的表通过map时将小表完全

　　　　放到内存中，可以在map端执行连接。提升hive性能。

　　　　在hive 0.7以后需要设置：set hive。auto。convert。join =true

13.order by & sort by

　　1.order by :全局排序，所有数据通过一个reducer处理，耗时长。

　　2.sort by :局部排序，对每个reduce的数据进行排序，方便后面的全局排序。

14.union all

　　1.可以将两个表或者多个表进行合并。每个union子查询必须拥有相同列。

　　2.union也可以用于同一个原表数合并。

15.使用视图来降低复杂查询：

　　1.create view shorter_join as select * from people join cart on (cart.people_id =people.id) where firstname ="join";

　　　　select lastname from short_join where id =3;

16.索引：

　　1.建立索引：crate index employees_index

　　　　　　　　on table employees (country)

　　　　　　　　as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' # as 指定；索引处理器

　　　　　　　　with deferred rebuild

　　　　　　　　idxproperties ('creartor' = 'me' ,'created_at' = 'some_time')

　　　　　　　　in table employees_index_table

　　　　　　　　partitioned by (country, name);

　　2.Bitmap索引：create index employees_index

　　　　　　　　　　on table employees(country)

　　　　　　　　　　as 'bitmap'

　　　　　　　　　　with deferred rebuild

　　　　　　　　　　idxproperties ('creator' = 'me' , 'created_at' = 'some_time')

　　　　　　　　　　in table employess_index_table

　　　　　　　　　　partitioned by (country,name);

　　3.重建索引：如果指定了deferred rebuild ，那么新索引将呈现空白状态，任何时候都可以进行索引创建或者

　　　　　　使用alter index重建。

　　　　alter index employees_index

　　　　　on table employees

　　　　　　partioned by(country ='us') rebuild;

　　4.显示索引：show formated index on employees;

　　5.删除索引：drop if exists employees_index on table empolyees;

17.hive优化：

　　1.explain:帮助我们学习hive是如何将hql转化为MapReduce任务的。

　　2、explain extended：可以产生更多的输出信息。

　　3.并行执行：hive会将一个任务切分为多个阶段，可以是MapReduce阶段，抽样阶段，合并阶段，limit阶段等，默认hive一次只会执行一个阶段，而特定的job可能包含众多阶段，这些阶段可能并非相互依赖，也就是说有些阶段是可以并行执行的，这样可以使整个执行时间变短。通过设置参数hive.exec.parallel值为true来设定。如果并行度增加，那么集群资源的利用率就会上升。

　　4.严格模式：hive.mapred.mode 的值为true，禁止三类查询

　　　　1.对于分区表，除非where语句中包含分区字段过滤条件来限制数据范围，否则不允许执行。即不允许用户扫描所有分区表，原因是耗费巨大资源。

　　　　2.对于order by语句的查询，必须要求使用limit语句。因为order by 为了执行排序过程会将所有的数据放到一个reducer去处理，强制增加limit会防止reducer额外执行更长时间。

　　　　3.限制笛卡尔积的查询。

　　5.调节mapper和reducer的数量：mapper和reducer太多，造成不必要的开销，太少则没有充分利用集群的并行度。

　　　　1.利用dfs -count命令来显示计算数据量大小，属性hive.exec.reducer.bytes.per.reducer默认为1GB。通过调整为750MB，hive就会使用4个reducer。

　　　　2.hive的默认reducer为3，可以设置mapred.reduce.tasks 的值。

　　　　3.当集群处理大任务时，为了控制资源利用情况，需要控制hive.exec.reducers.max。一个hadoop集群的map和reduce槽数是有限的，某个大的job会消耗所有的槽会导致其他job无法执行，

　　　　通过设置hive.exec.reducers.max，阻止某个job消耗过多资源，对于这个属性值有个经验公式：（集群总reduce槽数*1.5）/执行中查询的平均个数。

　　6.jvm重用：hadoop默认使用派生的JVM来执行map和reduce任务。JVM的启动会造成很大开销，尤其是job中会包含数百上千的task时，JVM重用会让JVM实例在一个job内重用n次，n的值可以在

　　　　　　mapred.site.xml中配置：<name>mapred.job.reuse.jvm.num.tasks</name>

　　　　　　缺点：开启JVM重用会一直占用使用的task槽数，以便进行重用，直到任务完成后释放。

18.hive函数

　　1.hive自带UDF：

　　　　1.show functions:abs ,acos,and,array,...

　　2.UDF：用户自定义标准函数：输入一行的一到多列数据，输出一个值。

　　3.UDAF：用户自定义聚合函数：接受一行到多行的零到多个列，输出一个值。

　　4.UDTF：表生成函数：接受多行多列，输出多行多列。