Hive Essential (4):DML-project,filter,join,union

1. Project data with SELECT

The most common use case for Hive is to query data in Hadoop. To achieve this, we need to write and execute a SELECT statement. The typical work done by the SELECT statement is to project the whole row (with SELECT * ) or specified columns (with SELECT column1, column2, ... ) from a table, with or without conditions.Most simple SELECT statements will not trigger a Yarn job. Instead, a dump task is created just for dumping the data, such as the hdfs dfs -cat command. The SELECT statement is quite often used with the FROM and DISTINCT keywords. A FROM keyword followed by a table is where SELECT projects data. The DISTINCT keyword used after SELECT ensures only unique rows or combination of columns are returned from the table. In addition, SELECT also supports columns combined with user-defined functions, IF() , or a CASE WHEN THEN ELSE END statement, and regular expressions. The following are examples of projecting data with a SELECT statement:

SELECT * FROM employee; -- Project the whole row
SELECT name FROM employee; -- Project specified columns

--List all columns match java regular expression
SET hive.support.quoted.identifiers = none; -- Enable this
SELECT `^work.*` FROM employee; -- All columns start with work

SELECT DISTINCT name, work_place FROM employee;

SELECT
    CASE WHEN gender_age.gender = 'Female' THEN 'Ms.'
         ELSE 'Mr.' 
    END as title,
    name,
    IF(array_contains(work_place, 'New York'), 'US', 'CA') as country
FROM employee;

Multiple SELECT statements can work together to build a complex query using nested queries or CTE. A nested query, which is also called a subquery, is a query projecting data from the result of another query. Nested queries can be rewritten using CTE with the WITH and AS keywords. When using nested queries, an alias should be given for the inner query (see t1 in the following example), or else Hive will report exceptions. The following are a few examples of using nested queries in HQL:

--1. A nested query example with the mandatory alias:
SELECT
name, gender_age.gender as gender
FROM (
SELECT * FROM employee WHERE gender_age.gender = 'Male'
) t1; -- t1 here is mandatory

--2. A nested query can be rewritten with CTE as follows. 
--This is the recommended way of writing a complex single HQL query
WITH t1 as (
SELECT * FROM employee WHERE gender_age.gender = 'Male'
)
SELECT name, gender_age.gender as gender
FROM t1;

In addition, a special SELECT followed by a constant expression can work without the FROM table clause. It returns the result of the expression. This is equivalent to querying a dummy table with one dummy record.

SELECT concat('1','+','3','=',cast((1 + 3) as string)) as res;
+-------+
| res   |
+-------+
| 1+3=4 |
+-------+

2. Filtering data with conditions

It is quite common to narrow down the result set by using a condition clause, such as LIMIT , WHERE , IN / NOT IN , and EXISTS / NOT EXISTS . The LIMIT keyword limits the specified number of rows returned randomly. Compared with LIMIT , WHERE is a more powerful and generic condition clause to limit the returned result set by expressions, functions, and nested queries as in the following examples:

SELECT name FROM employee LIMIT 2;

SELECT name, work_place FROM employee WHERE name = 'Michael';

-- All the conditions can use together and use after WHERE
SELECT name, work_place FROM employee WHERE name = 'Michael' LIMIT 1;

IN / NOT IN is used as an expression to check whether values belong to a set specified by IN or NOT IN . With effect from Hive v2.1.0, IN and NOT IN statements support more than one column.

SELECT name FROM employee WHERE gender_age.age in (27, 30);

-- With multiple columns support after v2.1.0
SELECT
name, gender_age
FROM employee
WHERE (gender_age.gender, gender_age.age) IN
(('Female', 27), ('Male', 27 + 3));          -- Also support expression

In addition, filtering data can also use a subquery in the WHERE clause with IN / NOT IN and EXISTS / NOT EXISTS . A subquery that uses EXISTS or NOT EXISTS must refer to both inner and outer expressions:

SELECT
    name, gender_age.gender as gender
FROM 
    employee
WHERE name IN
(
    SELECT 
        name 
    FROM 
        employee 
    WHERE 
        gender_age.gender = 'Male'
);

SELECT
    name, gender_age.gender as gender
FROM 
    employee a
WHERE EXISTS (
    SELECT *
    FROM 
        employee b
    WHERE
        a.gender_age.gender = b.gender_age.gender AND
        b.gender_age.gender = 'Male'
); -- This likes join table a and b with column gender

There are additional restrictions for subqueries used in WHERE clauses:

Subqueries can only appear on the right-hand side of WHERE clauses
Nested subqueries are not allowed
IN / NOT IN in subqueries only support the use of a single column, although they support more in regular expressions

3. Linking data with JOIN

JOIN is used to link rows from two or more tables together. Hive supports most SQL JOIN operations, such as INNER JOIN and OUTER JOIN . In addition, HQL supports some special joins, such as MapJoin and Semi-Join too. In its earlier version, Hive only supported equal join. After v2.2.0, unequal join is also supported. However, you should be more careful when using unequal join unless you know what is expected, since unequal join is likely to return many rows by producing a Cartesian product of joined tables. When you want to restrict the output of a join, you should apply a WHERE clause after join as JOIN occurs before the WHERE clause. If possible, push filter conditions on the join conditions rather than where conditions to have data filtered earlier. What's more, all types of left/right joins are not commutative and always left/right associative, while INNER and FULL OUTER JOINS are both commutative and associative.

3.1 INNER JOIN

INNER JOIN or JOIN returns rows meeting the join conditions from both sides of joined tables. The JOIN keyword can also be omitted by comma-separated table names; this is called an implicit join . Here are examples of the HQL JOIN operation:

--1. First, prepare a table to join with and load data to it:
CREATE TABLE IF NOT EXISTS employee_hr (
name string,
employee_id int,
sin_number string,
start_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';


LOAD DATA INPATH '/tmp/hivedemo/data/employee_hr.txt'
OVERWRITE INTO TABLE employee_hr;

--2. Perform an INNER JOIN between two tables with equal and unequal join
--conditions, along with complex expressions as well as a post join WHERE
--condition. Usually, we need to add a table name or table alias before columns in
--the join condition, although Hive always tries to resolve them:
SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph 
ON emp.name = emph.name; -- Equal Join
+-----------+------------------+
| emp.name  | emph.sin_number |
+-----------+------------------+
| Michael   | 547-968-091|
| Will      | 527-948-090|
| Lucy      | 577-928-094|
+-----------+------------------+

SELECT
emp.name, emph.sin_number
FROM employee emp 
-- Unequal join supported since v2.2.0 returns more rows
JOIN employee_hr emph 
ON emp.name != emph.name;
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Michael  | 527-948-090|
| Michael  | 647-968-598|
| Michael  | 577-928-094|
| Will     | 547-968-091|
| Will     | 647-968-598|
| Will     | 577-928-094|
| Shelley  | 547-968-091|
| Shelley  | 527-948-090|
| Shelley  | 647-968-598|
| Shelley  | 577-928-094|
| Lucy     | 547-968-091|
| Lucy     | 527-948-090|
| Lucy     | 647-968-598|
+----------+-----------------+

-- Join with complex expression in join condition
-- This is also the way to implement conditional join
-- Below, conditional ignore row with name = 'Will'
SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph 
ON  IF(emp.name = 'Will', '1', emp.name) =CASE WHEN emph.name = 'Will' THEN '0' ELSE emph.name END;
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Michael  | 547-968-091|
| Lucy     | 577-928-094|
+----------+-----------------+


-- Use where/limit to limit the output of join
SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
WHERE  emp.name = 'Will';
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Will     | 527-948-090|
+----------+-----------------+

--3. The JOIN operation can be performed on more tables (such as table A, B, and C) with sequence joins. 
--The tables can either join from A to B and B to C, or join from A to B and A to C
SELECT
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emp.name = empi.name;
+-----------+-------------------+------------------+
| emp.name  | empi.employee_id  | emph.sin_number  |
+-----------+-------------------+------------------+
| Michael   | 100               | 547-968-091      |
| Will      | 101               | 527-948-090      |
| Lucy      | 103               | 577-928-094      |
+-----------+-------------------+------------------+

--4. Self-join is where one table joins itself. When doing such joins, 
--a different alias should be given to distinguish the same table
> SELECT
> emp.name -- Use alias before column name
> FROM employee  as emp
> JOIN employee  as emp_b -- Here, use a different alias
> ON emp.name = emp_b.name;
+-----------+
| emp.name  |
+-----------+
| Michael   |
| Will      |
| Shelley   |
| Lucy      |
+-----------+

--5. Perform an implicit join without using the JOIN keyword. 
--This is only applicable to the INNER JOIN
SELECT
emp.name, emph.sin_number
FROM
employee emp, employee_hr emph -- Only applies for inner join
WHERE emp.name = emph.name;
+-----------+------------------+
| emp.name  | emph.sin_number  |
+-----------+------------------+
| Michael   | 547-968-091      |
| Will      | 527-948-090      |
| Lucy      | 577-928-094      |
+-----------+------------------+

--6. The join condition uses different columns, which will create an additional job
SELECT
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emph.employee_id = empi.employee_id;
+-----------+-------------------+------------------+
| emp.name  | empi.employee_id  | emph.sin_number  |
+-----------+-------------------+------------------+
| Michael   | 100               | 547-968-091      |
| Will      | 101               | 527-948-090      |
| Lucy      | 103               | 577-928-094      |
+-----------+-------------------+------------------+

If JOIN uses different columns in its conditions, it will request an additional job to complete the join. If the JOIN operation uses the same column in the join conditions, it will join on this condition using one job.

When JOIN is performed between multiple tables, Yarn/MapReduce jobs are created to process the data in the HDFS. Each of the jobs is called a stage. Usually, it is suggested to put the big table right at the end of the JOIN statement for better performance and to avoid Out Of Memory (OOM) exceptions. This is because the last table in the JOIN sequence is usually streamed through reducers where as the others are buffered in the reducer by default. Also, a hint, /*+STREAMTABLE (table_name)*/ , can be specified to advise which table should be streamed over the default decision, as in the following example

SELECT /*+ STREAMTABLE(employee_hr) */
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emph.employee_id = empi.employee_id;

3.2 OUTER JOIN

Besides INNER JOIN , HQL also supports regular OUTER JOIN and FULL JOIN . The logic of such a join is the same as what's in the SQL. The following table summarizes the differences between common joins. Here, we assume table_m has m rows and table_n has n rows with one-to-one mapping.

Join type	Logic	Rows returned
table_m JOIN table_n	This returns all rows matched in both tables.	m ∩ n
table_m LEFT JOIN table_n	This returns all rows in the left table and matched rows in the right table. If there is no match in the right table, it returns NULL in the right table.	m
table_m RIGHT JOIN table_n	This returns all rows in the right table and matched rows in the left table. If there is no match in the left table, it returns NULL in the left table.	n
table_m FULL JOIN table_n	This returns all rows in both tables and matched rows in both tables. If there is no match in the left or right table, it returns NULL instead.	m + n - m ∩ n
table_m CROSS JOIN table_n	This returns all row combinations in both the tables to produce CROSS JOIN table_n a Cartesian product.	m*n

The following examples demonstrate the different OUTER JOINs:

SELECT
emp.name, emph.sin_number
FROM employee emp -- All rows in left table returned
LEFT JOIN employee_hr emph ON emp.name = emph.name;
+-----------+------------------+
| emp.name  | emph.sin_number   |
+-----------+------------------+
| Michael   | 547-968-091       |
| Will      | 527-948-090       |
| Shelley   | NULL              | -- NULL for mismatch
| Lucy      | 577-928-094       |
+-----------+------------------+



SELECT
emp.name, emph.sin_number
FROM employee emp        -- All rows in right table returned
RIGHT JOIN employee_hr emph 
ON emp.name = emph.name;
+-----------+------------------+
| emp.name  | emph.sin_number  |
+-----------+------------------+
| Michael   | 547-968-091      |
| Will      | 527-948-090      |
| NULL      | 647-968-598      | -- NULL for mismatch
| Lucy      | 577-928-094      |
+-----------+------------------+
4 rows selected (34.485 seconds)

SELECT
emp.name, emph.sin_number
FROM employee emp           -- Rows from both side returned
FULL JOIN employee_hr emph ON emp.name = emph.name;
+-----------+------------------+
| emp.name  | emph.sin_number  |
+-----------+------------------+
| Lucy      | 577-928-094      |
| Michael   | 547-968-091      |
| Shelley   | NULL             | -- NULL for mismatch
| NULL      | 647-968-598      | -- NULL for mismatch
| Will      | 527-948-090      |
+-----------+------------------+

The CROSS JOIN statement does not have a join condition. The CROSS JOIN statement can also be written using join without condition or with the always true condition, such as 1 = 1. In this case, we can join any datasets with cross joins. However, we only consider using such joins when we have to link data without relations in nature, such as adding headers with a row count to a table. The following are three equal ways of writing CROSS JOIN.

SELECT
emp.name, emph.sin_number
FROM employee  as emp
CROSS JOIN 
employee_hr    as emph;


SELECT
emp.name, emph.sin_number
FROM employee  as emp
JOIN 
employee_hr    as emph;

SELECT
emp.name, emph.sin_number
FROM employee   as emp
JOIN 
    employee_hr as emph 
on 1=1;

Although Hive did not support unequal joins explicitly in the earlier version, there are workarounds by using CROSS JOIN and WHERE , as in this example:

SELECT
emp.name, emph.sin_number
FROM employee emp
CROSS JOIN employee_hr emph
WHERE emp.name <> emph.name;
+-----------+------------------+
| emp.name  | emph.sin_number  |
+-----------+------------------+
| Michael   | 527-948-090      |
| Michael   | 647-968-598      |
| Michael   | 577-928-094      |
| Will      | 547-968-091      |
| Will      | 647-968-598      |
| Will      | 577-928-094      |
| Shelley   | 547-968-091      |
| Shelley   | 527-948-090      |
| Shelley   | 647-968-598      |
| Shelley   | 577-928-094      |
| Lucy      | 547-968-091      |
| Lucy      | 527-948-090      |
| Lucy      | 647-968-598      |
+-----------+------------------+

3.3 Special joins

HQL also supports some special joins that we usually do not see in relational databases, such as MapJoin and Semi-join .

MapJoin means doing the join operation only with map, without the reduce job. The MapJoin statement reads all the data from the small table to memory and broadcasts to all maps. During the map phase, the join operation is performed by comparing each row of data in the big table with small tables against the join conditions. Because there is no reduce needed, such kinds of join usually have better performance. In the newer version of Hive, Hive automatically converts join to MapJoin at runtime if possible. However, you can also manually specify the broadcast table by providing a join hint, /*+ MAPJOIN(table_name) */ . In addition, MapJoin can be used for unequal joins to improve performance since both MapJoin and WHERE are performed in the map phase. The following is an example of using a MapJoin hint with CROSS JOIN :

SELECT
/*+ MAPJOIN(employee) */ emp.name, emph.sin_number
FROM employee    as emp
CROSS JOIN 
employee_hr      as emph
WHERE emp.name <> emph.name;

The MapJoin operation does not support the following:

Using MapJoin after UNION ALL , LATERAL VIEW , GROUP BY / JOIN / SORTBY / CLUSTER , and BY / DISTRIBUTE BY
Using MapJoin before UNION , JOIN , and another MapJoin

Bucket MapJoin is a special type of MapJoin that uses bucket columns (the column specified by CLUSTERED BY in the CREATE TABLE statement) as the join condition. Instead of fetching the whole table, as done by the regular MapJoin , bucket MapJoin only fetches the required bucket data. To enable bucket MapJoin , we need to enable some settings and make sure the bucket number is are multiple of each other. If both joined tables are sorted and bucketed with the same number of buckets, a sort-merge join can be performed instead of caching all small tables in the memory:

SET hive.optimize.bucketmapjoin = true;
SET hive.optimize.bucketmapjoin.sortedmerge = true;
SET hive.input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

In addition, the LEFT SEMI JOIN statement is also a type of MapJoin . It is the same as a subquery with IN / EXISTS after v0.13.0 of Hive. However, it is not recommended for use since it is not part of standard SQL:

SELECT a.name 
FROM employee   as a
LEFT SEMI JOIN 
employee_id     as b 
ON a.name = b.name;

4. Union

When we want to combine data with the same schema together, we often use set operations. Regular set operations in the relational database are INTERSECT , MINUS , and UNION / UNION ALL . HQL only supports UNION and UNION ALL . The difference between them is that UNION ALL does not remove duplicate rows while UNION does. In addition, all unioned data must have the same name and data type, or else an implicit conversion will be done and may cause a runtime exception. If ORDER BY , SORT BY , CLUSTER BY , DISTRIBUTE BY , or LIMIT are used, they are applied to the whole result set after the union:

SELECT a.name as nm 
FROM employee a
UNION ALL        -- Use column alias to make the same name for union
SELECT b.name as nm 
FROM employee_hr b;
+-----------+
|nm         |
+-----------+
| Michael   |
| Will      |
| Shelley   |
| Lucy      |
| Michael   |
| Will      |
| Steven    |
| Lucy      |
+-----------+


SELECT a.name as nm FROM employee a
UNION                   -- UNION removes duplicated names and slower
SELECT b.name as nm FROM employee_hr b;
+----------+
|nm        |
+----------+
| Lucy     |
| Michael  |
| Shelley  |
| Steven   |
| Will     |
+----------+

-- Order by applies to the unioned data
-- When you want to order only one data set,
-- Use order in the subquery
SELECT a.name as nm FROM employee a
UNION ALL
SELECT b.name as nm FROM employee_hr b
ORDER BY nm;
+----------+
|nm        |
+----------+
| Lucy     |
| Lucy     |
| Michael  |
| Michael  |
| Shelley  |
| Steven   |
| Will     |
| Will     |
+----------+

For other set operations that HQL does not support yet, such as INTERCEPT and MINUS , we can use joins or left join to implement them as follows:

-- Use join for set intercept
SELECT a.name
FROM employee a
JOIN employee_hr b 
ON a.name = b.name;
+----------+
| a.name   |
+----------+
| Michael  |
| Will     |
| Lucy     |
+----------+

-- Use left join for set minus
SELECT a.name
FROM employee a
LEFT JOIN employee_hr b 
ON a.name = b.name
WHERE b.name IS NULL;