Hive简记

在大数据工作中难免遇到数据仓库(OLAP)架构，以及通过Hive SQL简化分布式计算的场景。所以想通过这篇博客对Hive使用有一个大致总结，希望道友多多指教！

摘要：

　　1.Hive安装

　　2.Hive DDL命令

　　3.Hive DML初步

　　4.Hive DML高级

　　5.Hive与HBASE,MongoDB等整合

　　6.Hive 优化与配置参数

附：SQL执行顺序

　　7.Hive 复杂数据类型

　　8.Hive group by聚合增强

内容：

　　1.Hive安装

　　依赖：mysql,jdk,hadoop

　　安装文档参考：官方文档；注意这里hive默认使用Derby数据库，只支持单用户登录。修改具体配置请参考官网说明：

Metadata Store

Metadata is in an embedded Derby database whose disk storage location is determined by the Hive configuration variable named javax.jdo.option.ConnectionURL. By default this location is ./metastore_db (see conf/hive-default.xml).

Right now, in the default configuration, this metadata can only be seen by one user at a time.

Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables javax.jdo.option.ConnectionURL and javax.jdo.option.ConnectionDriverName. Refer to JDO (or JPOX) documentation for more details on supported databases. The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model.

In the future, the metastore itself can be a standalone server.

If you want to run the metastore as a network server so it can be accessed from multiple nodes, see Hive Using Derby in Server Mode.

　　2.Hive DDL命令

　　建表语句：

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]  --分区
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] --分桶
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [　　　　　　　　　　　　　　　　--存储格式
   [ROW FORMAT row_format] 　　　　　
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]　　　　　　--外部表指定存储路径
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
 
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name　　--复制表
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];

　　删除表：DROP TABLE [IF EXISTS] table_name [PURGE];

　　截断表：TRUNCATE TABLE table_name [PARTITION partition_spec];

　　查看表结构：

　　DESCRIBE [EXTENDED|FORMATTED]

table_name[.col_name ( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )* ];

　其他请查看官网DDL文档

　　3.Hive DML初步

　　加载数据到Hive表：LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

　　插入数据：
　　INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
　　INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
　　INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES ( value [, value ...] ) [, ( value [, value ...] )
　　Hive数据导出
　　INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
  SELECT ... FROM ...
　　举一个例子：导出hive数据到本地的/tmp/out目录，并制定分隔符是'	'：

insert overwrite local DIRECTORY '/tmp/out/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '	'
select 
　　更新数据：UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
　　删除数据：DELETE FROM tablename [WHERE expression]
　　查询数据：

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT [offset,] rows]
　　详细部分请参考官网DML部分（load/insert/update/delete/merge, import/export, explain plan）
Hive内置函数

int

year(string date)

Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970

string

upper(string A)

returns the string resulting from converting all characters of A to upper case, for example, upper('fOoBaR') results in 'FOOBAR'

string

ucase(string A)

Same as upper

string

trim(string A)

returns the string resulting from trimming spaces from both ends of A, for example, trim(' foobar ') results in 'foobar'

string

to_date(string timestamp)

Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01"

string

substr(string A, int start, int length)

returns the substring of A starting from start position with the given length, for example, 

  substr('foobar', 4, 2) results in 'ba'

string

substr(string
  A, int start)

returns the
  substring of A starting from start position till the end of string A. For example,
  substr('foobar', 4) results in 'bar'

int

size(Map<K.V>)

returns the
  number of elements in the map type

int

size(Array<T>)

returns the
  number of elements in the array type

string

rtrim(string
  A)

returns the
  string resulting from trimming spaces from the end(right hand side) of A. For
  example, rtrim(' foobar ') results in ' foobar'

BIGINT

round(double
  a)

returns the
  rounded BIGINT value of the double

string

regexp_replace(string
  A, string B, string C)

returns the
  string resulting from replacing all substrings in B that match the Java
  regular expression syntax(See Java regular expressions syntax) with C. For
  example, regexp_replace('foobar', 'oo|ar', ) returns 'fb'

double

rand(),
  rand(int seed)

returns a
  random number (that changes from row to row). Specifiying the seed will make
  sure the generated random number sequence is deterministic.

int

month(string
  date)

Return the
  month part of a date or a timestamp string: month("1970-11-01
  00:00:00") = 11, month("1970-11-01") = 11

string

ltrim(string
  A)

returns the
  string resulting from trimming spaces from the beginning(left hand side) of
  A. For example, ltrim(' foobar ') results in 'foobar '

string

lower(string
  A)

returns the
  string resulting from converting all characters of B to lower case, for
  example, lower('fOoBaR') results in 'foobar'

string

lcase(string
  A)

Same as lower

string

get_json_object(string
  json_string, string path)

Extract json
  object from a json string based on json path specified, and return json
  string of the extracted json object. It will return null if the input json
  string is invalid.

string

from_unixtime(int
  unixtime)

convert the
  number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string
  representing the timestamp of that moment in the current system time zone in
  the format of "1970-01-01 00:00:00"

BIGINT

floor(double
  a)

returns the
  maximum BIGINT value that is equal or less than the double

int

day(string
  date)

Return the day
  part of a date or a timestamp string: day("1970-11-01 00:00:00") =
  1, day("1970-11-01") = 1

string

concat(string
  A, string B,...)

returns the
  string resulting from concatenating B after A. For example, concat('foo',
  'bar') results in 'foobar'. This function accepts arbitrary number of
  arguments and return the concatenation of all of them.

BIGINT

ceil(double a)

returns the
  minimum BIGINT value that is equal or greater than the double

BIGINT

count(*), count(expr), count(DISTINCT expr[, expr_.])

count(*)—Returns the total number of retrieved rows, including rows containing NULL values; count(expr)—Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the supplied expression(s) are unique and non-NULL.

DOUBLE

avg(col), avg(DISTINCT col)

returns the average of the elements in the group or the average of the distinct values of the column in the group

DOUBLE

max(col)

returns the maximum value of the column in the group

DOUBLE

min(col)

returns the minimum value of the column in the group

DOUBLE

sum(col), sum(DISTINCT col)

returns the sum of the elements in the group or the sum of the distinct values of the column in the group

Built-in Aggregate Functions (UDAF)The following built-in aggregate functions are supported in Hive:

Return Type
Name(Signature)
Description

BIGINT

count(*), count(expr), count(DISTINCT expr[, expr...])

count(*) - Returns the total number of retrieved rows, including rows containing NULL values.
count(expr) - Returns the number of rows for which the supplied expression is non-NULL.
count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Execution of this can be optimized with hive.optimize.distinct.rewrite.

DOUBLE

sum(col), sum(DISTINCT col)

Returns the sum of the elements in the group or the sum of the distinct values of the column in the group.

DOUBLE

avg(col), avg(DISTINCT col)

Returns the average of the elements in the group or the average of the distinct values of the column in the group.

DOUBLE

min(col)

Returns the minimum of the column in the group.

DOUBLE

max(col)

Returns the maximum value of the column in the group.

DOUBLE

variance(col), var_pop(col)

Returns the variance of a numeric column in the group.

DOUBLE

var_samp(col)

Returns the unbiased sample variance of a numeric column in the group.

DOUBLE

stddev_pop(col)

Returns the standard deviation of a numeric column in the group.

DOUBLE

stddev_samp(col)

Returns the unbiased sample standard deviation of a numeric column in the group.

DOUBLE

covar_pop(col1, col2)

Returns the population covariance of a pair of numeric columns in the group.

DOUBLE

covar_samp(col1, col2)

Returns the sample covariance of a pair of a numeric columns in the group.

DOUBLE

corr(col1, col2)

Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.

DOUBLE

percentile(BIGINT col, p)

Returns the exact pth percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.

array<double>

percentile(BIGINT col, array(p1 [, p2]...))

Returns the exact percentiles p1, p2, ... of a column in the group (does not work with floating point types). pimust be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.

DOUBLE

percentile_approx(DOUBLE col, p [, B])

Returns an approximate pth percentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value.

array<double>

percentile_approx(DOUBLE col, array(p1 [, p2]...) [, B])

Same as above, but accepts and returns an array of percentile values instead of a single one.

double
regr_avgx(independent, dependent)

Equivalent to avg(dependent). As of Hive 2.2.0.

double
regr_avgy(independent, dependent)

Equivalent to avg(independent). As of Hive 2.2.0.

double
regr_count(independent, dependent)

Returns the number of non-null pairs used to fit the linear regression line. As of Hive 2.2.0.

double
regr_intercept(independent, dependent)

Returns the y-intercept of the linear regression line, i.e. the value of b in the equation dependent = a * independent + b. As of Hive 2.2.0.

double
regr_r2(independent, dependent)

Returns the coefficient of determination for the regression. As of Hive 2.2.0.

double
regr_slope(independent, dependent)

Returns the slope of the linear regression line, i.e. the value of a in the equation dependent = a * independent + b. As of Hive 2.2.0.

double
regr_sxx(independent, dependent)

Equivalent to regr_count(independent, dependent) * var_pop(dependent). As of Hive 2.2.0.

double
regr_sxy(independent, dependent)

Equivalent to regr_count(independent, dependent) * covar_pop(independent, dependent). As of Hive 2.2.0.

double
regr_syy(independent, dependent)
Equivalent to regr_count(independent, dependent) * var_pop(independent). As of Hive 2.2.0.

array<struct {'x','y'}>

histogram_numeric(col, b)

Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights

array

collect_set(col)

Returns a set of objects with duplicate elements eliminated.

array

collect_list(col)

Returns a list of objects with duplicates. (As of Hive 0.13.0.)

INTEGER
ntile(INTEGER x)
Divides an ordered partition into x groups called buckets and assigns a bucket number to each row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common summary statistics. (As of Hive 0.11.0.)

Built-in Table-Generating Functions (UDTF)Normal user-defined functions, such as concat(), take in a single input row and output a single output row. In contrast, table-generating functions transform a single input row to multiple output rows.

Row-set columns types
Name(Signature)
Description

T

explode(ARRAY<T> a)

Explodes an array to multiple rows. Returns a row-set with a single column (col), one row for each element from the array.

Tkey,Tvalue

explode(MAP<Tkey,Tvalue> m)

Explodes a map to multiple rows. Returns a row-set with a two columns (key,value) , one row for each key-value pair from the input map. (As of Hive 0.8.0.).

int,T
posexplode(ARRAY<T> a)
Explodes an array to multiple rows with additional positional column of int type (position of items in the original array, starting with 0). Returns a row-set with two columns (pos,val), one row for each element from the array.

T1,...,Tn

inline(ARRAY<STRUCT<f1:T1,...,fn:Tn>> a)

Explodes an array of structs to multiple rows. Returns a row-set with N columns (N = number of top level elements in the struct), one row per struct from the array. (As of Hive 0.10.)

T1,...,Tn/r
stack(int r,T1 V1,...,Tn/r Vn)
Breaks up n values V1,...,Vn into r rows. Each row will have n/r columns. r must be constant.

string1,...,stringn

json_tuple(string jsonStr,string k1,...,string kn)

Takes JSON string and a set of n keys, and returns a tuple of n values. This is a more efficient version of the get_json_object UDF because it can get multiple keys with just one call.

string 1,...,stringn

parse_url_tuple(string urlStr,string p1,...,string pn)

Takes URL string and a set of n URL parts, and returns a tuple of n values. This is similar to the parse_url() UDF but can extract multiple parts at once out of a URL. Valid part names are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY>.

下面列举几个常用的操作:
使用UDF:
　　官网UDF文档
　　UDF简记

LateralView:行转列语句
　　官网文档
抽样：
　　官网文档 
Hive窗口分析函数：
　　Analytics functions
RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE
　　更多例子参考官网窗口分析函数文档
　　
举一个例子：

select 1,tag_id
from (
    select parse_json_array(get_json_object('{"data":[{"id":1082},{"id":1082},{"id":1082}]}','$.data'),'id') as ids
    from dual
) a
lateral view explode(split(a.ids,'_')) s as tag_id
;

<!--
+------+---------+--+
| _c0  | tag_id  |
+------+---------+--+
| 1    | 1082    |
| 1    | 1082    |
| 1    | 1082    |
+------+---------+--+
-->
　　5.Hive与HBASE,MongoDB等整合
　　官网文档：HBaseIntegration
　　以下做简单步骤记录：
　　　1.添加mongodb整合包和驱动包：

add jar /data/dmp/hive/lib/hive-mongo-1.0.0-jar-with-dependencies.jar;
add jar /data/dmp/hive/lib/mongo-java-driver-3.2.2.jar;
add jar /data/dmp/hive/lib/mongo-hadoop-hive-1.5.1.jar;
add jar /data/dmp/hive/lib/mongo-hadoop-core-1.5.1.jar;

　　  2.创建hive外部表

DROP TABLE IF EXISTS mongodb_table;
CREATE EXTERNAL TABLE mongodb_table
(
  id int
)
stored by 'com.mongodb.hadoop.hive.MongoStorageHandler'
with serdeproperties(
  'mongo.columns.mapping'='{"id":"_id"}'
 )
TBLPROPERTIES('mongo.uri'='mongodb://username:passwd@ip:port/db.table');

　　这里遇到一个问题：

 If the username or password contains a colon (:) or an at-sign (@) then it must be urlencoded (state=,code=0)

　　按照说明，把用户名或者密码的特殊字符用url编码就可以了
　　6.Hive 优化与配置参数
附：SQL执行顺序
当一个查询语句同时出现了where,group by,having,order by的时候，执行顺序和编写顺序是： 
1.执行where xx对全表数据做筛选，返回第1个结果集。 2.针对第1个结果集使用group by分组，返回第2个结果集。 
3.针对第2个结果集中的每1组数据执行select xx，有几组就执行几次，返回第3个结果集。 
4.针对第3个结集执行having xx进行筛选，返回第4个结果集。 5.针对第4个结果集排序。 
7.Hive 复杂数据类型 7.1 array
 7.2 map
 7.3 struct 
Complex Type ConstructorsThe following functions construct instances of complex types.

Constructor Function
Operands
Description

map

(key1, value1, key2, value2, ...)

Creates a map with the given key/value pairs.

struct

(val1, val2, val3, ...)

Creates a struct with the given field values. Struct field names will be col1, col2, ....

named_struct

(name1, val1, name2, val2, ...)

Creates a struct with the given field names and values. (As of Hive 0.8.0.)

array

(val1, val2, ...)

Creates an array with the given elements.

create_union

(tag, val1, val2, ...)

Creates a union type with the value that is being pointed to by the tag parameter.

 Operators on Complex TypesThe following operators provide mechanisms to access elements in Complex Types.
 
Operator
Operand types
Description

A[n]

A is an Array and n is an int

Returns the nth element in the array A. The first element has index 0. For example, if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'.

M[key]

M is a Map<K, V> and key has type K

Returns the value corresponding to the key in the map. For example, if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'.

S.x

S is a struct

Returns the x field of S. For example for the struct foobar {int foo, int bar}, foobar.foo returns the integer stored in the foo field of the struct.

Collection FunctionsThe following built-in collection functions are supported in Hive:

Return Type
Name(Signature)
Description

int

size(Map<K.V>)

Returns the number of elements in the map type.

int

size(Array<T>)

Returns the number of elements in the array type.

array<K>

map_keys(Map<K.V>)

Returns an unordered array containing the keys of the input map.

array<V>

map_values(Map<K.V>)

Returns an unordered array containing the values of the input map.

boolean

array_contains(Array<T>, value)

Returns TRUE if the array contains value.

array<t>

sort_array(Array<T>)

Sorts the input array in ascending order according to the natural ordering of the array elements and returns it (as of version 0.9.0).

8.Hive group by聚合增强8.1.grouping setsgrouping sets子句都可以根据UNION连接的多个GROUP BY查询进行逻辑表示
```
SELECT a，b，SUM（c）FROM tab1 GROUP BY a，b GROUPING SETS（（a，b），a，b，（））
等价于
SELECT a，b，SUM（c）FROM tab1 GROUP BY a，b
union
SELECT a，null，SUM（c）FROM tab1 GROUP BY a，null
union
SELECT null，b，SUM（c）FROM tab1 GROUP BY null，b
union
SELECT null，null，SUM（c）FROM tab1
```
8.2.GROUPING__ID  注意是两个下划线相连,说明聚合结果是属于(grouping sets)哪一个子集的的。
```
SELECT key, value, GROUPING__ID,count(*)
FROM T1
GROUP BY key, value
GROUPING SETS（（key，value），key，value）
;
等价于
SELECT key, value,1,count(*) -- 属于第1个GROUPING SETS子集,即（key，value）
FROM T1
GROUP BY key, value
union 
SELECT key, NULL,2,count(*) -- 属于第2个GROUPING SETS子集,即key
FROM T1
GROUP BY key
union 
SELECT NULL, value,3,count(*) -- 属于第3个GROUPING SETS子集,即value
FROM T1
GROUP BY value

```
8.3 WITH CUBECUBE是是group by字段的所有组合
```
GROUP BY a，b，c WITH CUBE
等同于
GROUP BY a,b,c GROUPING SETS（（a,b,c），（a,b）,（b,c）, （a,c）,（a），（b），（c），（）
```
8.4 WITH ROLLUPROLLUP子句与GROUP BY一起用于计算维度的层次结构级别的聚合。
```
GROUP BY a，b，c，WITH ROLLUP
等同于
GROUP BY a，b，c GROUPING SETS（（a，b，c），（a，b），（a），（））。
```
官网文档:
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup

Row-set columns types	Name(Signature)	Description
T	explode(ARRAY<T> a)	Explodes an array to multiple rows. Returns a row-set with a single column (col), one row for each element from the array.
T_key,T_value	explode(MAP<T_key,T_value> m)	Explodes a map to multiple rows. Returns a row-set with a two columns (key,value) , one row for each key-value pair from the input map. (As of Hive 0.8.0.).
int,T	posexplode(ARRAY<T> a)	Explodes an array to multiple rows with additional positional column of int type (position of items in the original array, starting with 0). Returns a row-set with two columns (pos,val), one row for each element from the array.
T₁,...,T_n	inline(ARRAY<STRUCT<f₁:T₁,...,f_n:T_n>> a)	Explodes an array of structs to multiple rows. Returns a row-set with N columns (N = number of top level elements in the struct), one row per struct from the array. (As of Hive 0.10.)
T₁,...,T_n/r	stack(int r,T₁V₁,...,T_n/rV_n)	Breaks up n values V₁,...,V_n into r rows. Each row will have n/r columns. r must be constant.

string₁,...,string_n	json_tuple(string jsonStr,string k₁,...,string k_n)	Takes JSON string and a set of n keys, and returns a tuple of n values. This is a more efficient version of the `get_json_object` UDF because it can get multiple keys with just one call.
string ₁,...,string_n	parse_url_tuple(string urlStr,string p₁,...,string p_n)	Takes URL string and a set of n URL parts, and returns a tuple of n values. This is similar to the `parse_url()` UDF but can extract multiple parts at once out of a URL. Valid part names are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY>.

int	year(string date)	Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970
string	upper(string A)	returns the string resulting from converting all characters of A to upper case, for example, upper('fOoBaR') results in 'FOOBAR'
string	ucase(string A)	Same as upper
string	trim(string A)	returns the string resulting from trimming spaces from both ends of A, for example, trim(' foobar ') results in 'foobar'
string	to_date(string timestamp)	Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01"
string	substr(string A, int start, int length)	returns the substring of A starting from start position with the given length, for example, substr('foobar', 4, 2) results in 'ba'
string	substr(string A, int start)	returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar'
int	size(Map<K.V>)	returns the number of elements in the map type
int	size(Array<T>)	returns the number of elements in the array type
string	rtrim(string A)	returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar'
BIGINT	round(double a)	returns the rounded BIGINT value of the double
string	regexp_replace(string A, string B, string C)	returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo\|ar', ) returns 'fb'
double	rand(), rand(int seed)	returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic.
int	month(string date)	Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11
string	ltrim(string A)	returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar '
string	lower(string A)	returns the string resulting from converting all characters of B to lower case, for example, lower('fOoBaR') results in 'foobar'
string	lcase(string A)	Same as lower
string	get_json_object(string json_string, string path)	Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid.
string	from_unixtime(int unixtime)	convert the number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00"
BIGINT	floor(double a)	returns the maximum BIGINT value that is equal or less than the double
int	day(string date)	Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1
string	concat(string A, string B,...)	returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them.
BIGINT	ceil(double a)	returns the minimum BIGINT value that is equal or greater than the double

BIGINT	count(*), count(expr), count(DISTINCT expr[, expr_.])	count(*)—Returns the total number of retrieved rows, including rows containing NULL values; count(expr)—Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the supplied expression(s) are unique and non-NULL.
DOUBLE	avg(col), avg(DISTINCT col)	returns the average of the elements in the group or the average of the distinct values of the column in the group
DOUBLE	max(col)	returns the maximum value of the column in the group
DOUBLE	min(col)	returns the minimum value of the column in the group
DOUBLE	sum(col), sum(DISTINCT col)	returns the sum of the elements in the group or the sum of the distinct values of the column in the group

Return Type	Name(Signature)	Description
BIGINT	count(*), count(expr), count(DISTINCT expr[, expr...])	count(*) - Returns the total number of retrieved rows, including rows containing NULL values. count(expr) - Returns the number of rows for which the supplied expression is non-NULL. count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Execution of this can be optimized with hive.optimize.distinct.rewrite.
DOUBLE	sum(col), sum(DISTINCT col)	Returns the sum of the elements in the group or the sum of the distinct values of the column in the group.
DOUBLE	avg(col), avg(DISTINCT col)	Returns the average of the elements in the group or the average of the distinct values of the column in the group.
DOUBLE	min(col)	Returns the minimum of the column in the group.
DOUBLE	max(col)	Returns the maximum value of the column in the group.
DOUBLE	variance(col), var_pop(col)	Returns the variance of a numeric column in the group.
DOUBLE	var_samp(col)	Returns the unbiased sample variance of a numeric column in the group.
DOUBLE	stddev_pop(col)	Returns the standard deviation of a numeric column in the group.
DOUBLE	stddev_samp(col)	Returns the unbiased sample standard deviation of a numeric column in the group.
DOUBLE	covar_pop(col1, col2)	Returns the population covariance of a pair of numeric columns in the group.
DOUBLE	covar_samp(col1, col2)	Returns the sample covariance of a pair of a numeric columns in the group.
DOUBLE	corr(col1, col2)	Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.
DOUBLE	percentile(BIGINT col, p)	Returns the exact p^th percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.
array<double>	percentile(BIGINT col, array(p₁ [, p₂]...))	Returns the exact percentiles p₁, p₂, ... of a column in the group (does not work with floating point types). p_imust be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.
DOUBLE	percentile_approx(DOUBLE col, p [, B])	Returns an approximate p^th percentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value.
array<double>	percentile_approx(DOUBLE col, array(p₁ [, p₂]...) [, B])	Same as above, but accepts and returns an array of percentile values instead of a single one.
double	regr_avgx(independent, dependent)	Equivalent to avg(dependent). As of Hive 2.2.0.
double	regr_avgy(independent, dependent)	Equivalent to avg(independent). As of Hive 2.2.0.
double	regr_count(independent, dependent)	Returns the number of non-null pairs used to fit the linear regression line. As of Hive 2.2.0.
double	regr_intercept(independent, dependent)	Returns the y-intercept of the linear regression line, i.e. the value of b in the equation dependent = a * independent + b. As of Hive 2.2.0.
double	regr_r2(independent, dependent)	Returns the coefficient of determination for the regression. As of Hive 2.2.0.
double	regr_slope(independent, dependent)	Returns the slope of the linear regression line, i.e. the value of a in the equation dependent = a * independent + b. As of Hive 2.2.0.
double	regr_sxx(independent, dependent)	Equivalent to regr_count(independent, dependent) * var_pop(dependent). As of Hive 2.2.0.
double	regr_sxy(independent, dependent)	Equivalent to regr_count(independent, dependent) * covar_pop(independent, dependent). As of Hive 2.2.0.
double	regr_syy(independent, dependent)	Equivalent to regr_count(independent, dependent) * var_pop(independent). As of Hive 2.2.0.
array<struct {`'x','y'`}>	histogram_numeric(col, b)	Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights
array	collect_set(col)	Returns a set of objects with duplicate elements eliminated.
array	collect_list(col)	Returns a list of objects with duplicates. (As of Hive 0.13.0.)
INTEGER	ntile(INTEGER x)	Divides an ordered partition into `x` groups called buckets and assigns a bucket number to each row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common summary statistics. (As of Hive 0.11.0.)

Constructor Function	Operands	Description
map	(key1, value1, key2, value2, ...)	Creates a map with the given key/value pairs.
struct	(val1, val2, val3, ...)	Creates a struct with the given field values. Struct field names will be col1, col2, ....
named_struct	(name1, val1, name2, val2, ...)	Creates a struct with the given field names and values. (As of Hive 0.8.0.)
array	(val1, val2, ...)	Creates an array with the given elements.
create_union	(tag, val1, val2, ...)	Creates a union type with the value that is being pointed to by the tag parameter.

Operator	Operand types	Description
A[n]	A is an Array and n is an int	Returns the nth element in the array A. The first element has index 0. For example, if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'.
M[key]	M is a Map<K, V> and key has type K	Returns the value corresponding to the key in the map. For example, if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'.
S.x	S is a struct	Returns the x field of S. For example for the struct foobar {int foo, int bar}, foobar.foo returns the integer stored in the foo field of the struct.