Apache Drill – Querying Complex Data & Data Definition Statements & Querying Data

8. Apache Drill – Querying Complex Data

In this chapter, we will discuss in detail about which all composite data types does Apache Drill supports.
 Array - An array is a repeated list of values.

A value in an array can be a scalar type, such as string or int, or an array can be a complex type, such as a map or another array.
 Map - A map is a set of name/value pairs. A value in a map can be a scalar type, such as string or int, or a complex type, such as an array or another map.

Apache Drill uses map and array data types internally for reading complex and nested data structures from data sources.

FLATTEN
FLATTEN separates the elements in a repeated field into individual records.
Syntax:
FLATTEN(x)
Where,
x - JSON array.

{
"num1" : 10,
"num2" : [10,20,30],
"num3" : " simple json array",
"num4" : 50.5
}

Now we can execute this query in Drill.
The following program shows the query for this function:
0: jdbc:drill:zk=local> select *,flatten(num2) as flatten from dfs.`/Users/../workspace/Drill-samples/array.json`;

KVGEN
This function returns a list of the keys that exist in the map.
Syntax:
KVGEN(column name)

{
"student_ID" : {
"001" : "Adam",
"002" : "Amit"
}
}
{
"student_ID" : {
"003" : "Bob",
"004" : "David"
}
}

The following program shows the query for this function:
0: jdbc:drill:zk=local> select kvgen(student_ID) from dfs.`/Users/../workspace/Drill-samples/student_map.json`;

66
Result:
num1 num2 num3 num4 flatten
10 [10,20,30] simple json array 50.5 10
10 [10,20,30] simple json array 50.5 20
10 [10,20,30] simple json array 50.5 30
KVGEN
This function returns a list of the keys that exist in the map.
Syntax:
KVGEN(column name)
Create a simple JSON map file named “student_map” as shown in the following program.
{
"student_ID" : {
"001" : "Adam",
"002" : "Amit"
}
}
{
"student_ID" : {
"003" : "Bob",
"004" : "David"
}
}
The following program shows the query for this function:
0: jdbc:drill:zk=local> select kvgen(student_ID) from dfs.`/Users/../workspace/Drill-samples/student_map.json`;

Result:
EXPR$0
[{"key":"001","value":"Adam"},{"key":"002","value":"Amit"}]
[{“key":"003","value":"Bob"},{"key":"004","value":"David"}]

REPEATED_COUNT (array)
The following program shows the query for this function:
0: jdbc:drill:zk=local> select REPEATED_COUNT(num2) from dfs.`/Users/../workspace/Drill-samples/array.json`;
Result:
EXPR$0
3

REPEATED CONTAINS
Searches for a keyword in an array. If the keyword is present in an array, the result will be true otherwise false.
Syntax:
REPEATED_CONTAINS(array_name, keyword)
The array_name is a simple array. The following program shows the query for this function:
0: jdbc:drill:zk=local> select REPEATED_CONTAINS(num2,10) from dfs.`/Users/../workspace/Drill-samples/array.json`;
Result:
true

Query:
0: jdbc:drill:zk=local> select REPEATED_CONTAINS(num2,40) from dfs.`/Users/deiva/workspace/Drill-samples/array.json`;
Result:
false

9. Apache Drill – Data Definition Statements

Create Statement
You can create tables in Apache Drill by using the following two CTAS commands.
Method 1
Syntax:
CREATE TABLE name [ (column list) ] AS query;

Where,
Query - select statement.

Method 2
Syntax:
CREATE TABLE name [ ( <column list> ) ] [ PARTITION BY ( <column_name> [ , ... ] ) ] AS <select statement>
Where,
 name - unique directory name
 column list - optional list of column names or aliases in the new table.
 PARTITION BY - partitions the data by the first column_name

To create a table, you should adhere to the following steps:
 Set the workspace to a writable workspace.
 You can only create new tables in df.tmp workspace. You cannot create tables using storage plugins, such as Hive and HBase

For example:
"tmp": {
"location": "/tmp",
"writable": true,
}

Example Query:
0: jdbc:drill:zk=local> use dfs.tmp;
Result:
+-------+--------------------------------------+
| ok | summary |
+-------+--------------------------------------+
| true | Default schema changed to [dfs.tmp] |
+-------+--------------------------------------+

Query:
0: jdbc:drill:zk=local> create table students as select * from dfs.`/Users/../workspace/Drill-samples/student_list.json`;
Result:
+-----------+----------------------------+
| Fragment | Number of records written |
+-----------+----------------------------+
| 0_0 | 10 |
+-----------+----------------------------+

To view records –
0: jdbc:drill:zk=local> select * from students;
Result:
+------+--------+------+---------+-----------+--------+--------+--------+--------------------+----------+
| ID | name | age | gender | standard | mark1 | mark2 | mark3 | addr | pincode |
+------+--------+------+---------+-----------+--------+--------+--------+--------------------+----------+
| 001 | Adam | 12 | male | six | 70 | 50 | 60 | 23 new street | 111222 |
| 002 | Amit | 12 | male | six | 40 | 50 | 40 | 12 old street | 111222 |
| 003 | Bob | 12 | male | six | 60 | 80 | 70 | 10 cross street | 111222 |

……

The following program shows the query for this function:
0: jdbc:drill:zk=local> create table student_new partition by (gender) as select * from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

Result:
+-----------+----------------------------+
| Fragment | Number of records written |
+-----------+----------------------------+
| 0_0 | 10 |
+-----------+----------------------------+
To view the records of the table –
0: jdbc:drill:zk=local> select * from student_new;

| 005 | Esha | 12 | female | six | 70 | 60 | 65 | 20 garden street | 111222 |
| 006 | Ganga | 12 | female | six | 100 | 95 | 98 | 25 north street | 111222 |
| 008 | Leena | 12 | female | six | 90 | 85 | 95 | 24 south street | 111222 |
| 009 | Mary | 13 | female | six | 75 | 85 | 90 | 5 west street | 111222 |
| 010 | Peter | 13 | female | six | 80 | 85 | 88 | 16 park avenue | 111222 |
| 001 | Adam | 12 | male | six | 70 | 50 | 60 | 23 new street | 111222 |
| 002 | Amit | 12 | male | six | 40 | 50 | 40 | 12 old street | 111222 |
| 003 | Bob | 12 | male | six | 60 | 80 | 70 | 10 cross street | 111222 |
| 004 | David | 12 | male | six | 50 | 70 | 70 | 15 express avenue | 111222 |
| 007 | Jack | 13 | male | six | 55 | 45 | 45 | 2 park street | 111222 |

Here the table records are partitioned by gender.

Alter Statement
The ALTER SYSTEM command permanently changes a system setting.

Syntax:
ALTER SYSTEM SET `option_name` = value;
To reset the system settings, use the following syntax.
ALTER SYSTEM RESET `option_name`;
ALTER SYSTEM RESET ALL;

Query:
Here is the sample query that enables the Decimal data type –
0: jdbc:drill:zk=local> ALTER SYSTEM SET `planner.enable_decimal_data_type` = true;

To reset all the changes, you will need to key-in the following command –
0: jdbc:drill:zk=local> ALTER SYSTEM RESET all;

Result:
+-------+---------------+
| ok | summary |
+-------+---------------+
| true | ALL updated. |
+-------+---------------+

Create View Statement
The CREATE VIEW command creates a virtual structure for the result set of a stored query. A view can combine data from multiple underlying data sources and provide the illusion that all of the data is from one source.
Syntax:
CREATE [OR REPLACE] VIEW [workspace.]view_name [ (column_name [, ...]) ] AS query;

Where,
 workspace - The location where you want the view to exist. By default, the view can be created in “dfs.tmp”.
 view_name - The name that you give to the view. This view must have a unique name.

Query:
0: jdbc:drill:zk=local> create view student_view as select * from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

Result:
+-------+--------------------------------------------------------------+
| ok | summary |
+-------+--------------------------------------------------------------+
| true | View 'student_view' created successfully in 'dfs.tmp' schema |
+-------+--------------------------------------------------------------+
To see the records, you can use the following query.
0: jdbc:drill:zk=local> select * from student_view;

Result:
+------+--------+------+---------+-----------+--------+--------+--------+--------------------+----------+
| ID | name | age | gender | standard | mark1 | mark2 | mark3 | addr | pincode |
+------+--------+------+---------+-----------+--------+--------+--------+--------------------+----------+
| 001 | Adam | 12 | male | six | 70 | 50 | 60 | 23 new street | 111222 |
| 002 | Amit | 12 | male | six | 40 | 50 | 40 | 12 old street | 111222 |
| 003 | Bob | 12 | male | six | 60 | 80 | 70 | 10 cross street | 111222 |
| 004 | David | 12 | male | six | 50 | 70 | 70 | 15 express avenue | 111222 |
| 005 | Esha | 12 | female | six | 70 | 60 | 65 | 20 garden street | 111222 |

……

Drop Table
The drop table statement is used to drop the table from a DFS storage plugin.
Syntax:
DROP TABLE [workspace.]name;
Query:
0: jdbc:drill:zk=local> drop table student_new;
Result:
+-------+------------------------------+
| ok | summary |
+-------+------------------------------+
| true | Table [student_new] dropped |
+-------+------------------------------+

Drop View
Similar to the table, a view can be dropped by using the following command –
0: jdbc:drill:zk=local> drop view student_view;
Result:
+-------+-----------------------------------------------------------------+
| ok | summary |
+-------+-----------------------------------------------------------------+
| true | View [student_view] deleted successfully from schema [dfs.tmp]. |
+-------+-----------------------------------------------------------------+

10. Apache Drill – Querying Data

In this chapter, we will learn about how Apache Drill allows us to query plain text files such as CSV, TSV and PSV.

CSV File
Create a CSV file named “students.csv” as shown in the following program:
001,Adam,23 new street
002,Amit,12 old street
003,Bob,10 cross street
004,David,15 express avenue
005,Esha,20 garden street
006,Ganga,25 north street
007,Jack,2 park street
008,Leena,24 south street
009,Mary,5 west street
010,Peter,16 park avenue

After saving the file, you can return to the terminal again and type the following query to view that CSV file.
0: jdbc:drill:zk=local> select * from dfs.`/Users/../workspace/Drill-samples/students.csv`;
Result:
+--------------------------------------+
| columns |
+--------------------------------------+
| ["001","Adam","23 new street"] |
| ["002","Amit","12 old street"] |
| ["003","Bob","10 cross street"] |
| ["004","David","15 express avenue"] |
| ["005","Esha","20 garden street"] |
| ["006","Ganga","25 north street"] |
| ["007","Jack","2 park street"] |
| ["008","Leena","24 south street"] |

……

From this output we can conclude that, Apache Drill recognizes each row as an array of values and returns one column for each row.

Finding Columns[n]
The COLUMNS[n] syntax is used to return CSV rows in a column by the column format, where n starts from 0 to n-1.
Query:
0: jdbc:drill:zk=local>select columns[0],columns[1],columns[2] from dfs.`/Users/../workspace/Drill-samples/students.csv`;

If you want to assign an alias name for columns, use the following query :

0: jdbc:drill:zk=local>select columns[0] as ID,columns[1] as Name,columns[2] as Address from dfs.`/Users/../workspace/Drill-samples/students.csv`;

Create Table
Like in JSON, you can create table for plain text files. Following is a sample query:
0: jdbc:drill:zk=local> create table CSV as select * from dfs.`/Users/../workspace/Drill-samples/students.csv`;
Result:
+-----------+----------------------------+
| Fragment | Number of records written |
+-----------+----------------------------+
| 0_0 | 10 |
+-----------+----------------------------+
To view the file contents, type the following query:
0: jdbc:drill:zk=local> select * from CSV;

TSV File
Create a TSV file named “students.tsv” as shown in the following program:

Now we can execute this TSV file in Apache Drill by using the following query:
0: jdbc:drill:zk=local> select * from dfs.`/Users/../workspace/Drill-samples/student.tsv`;
Result:
+--------------------------------------------------------------------------------------------+
| columns |
+--------------------------------------------------------------------------------------------+
| ["ID","Name","Age","Standard","Marks","Addr","pincode"] |

| ["001 id ","name is adam","for the age of 12","studying sxith std" "Having good marks"] |
+--------------------------------------------------------------------------------------------+

Create Table
As shown in the CSV file above, you can also create a table for the TSV file.
Query:
0: jdbc:drill:zk=local> select * from dfs.`/Users/../workspace/Drill-samples/student.tsv`;
Result:
+-----------+----------------------------+
| Fragment | Number of records written |
+-----------+----------------------------+
| 0_0 | 2 |
+-----------+----------------------------+
1 row selected (0.347 seconds)
Query:
0: jdbc:drill:zk=local> select * from TSV;

Result:
+--------------------------------------------------------------------------------------------+
| columns |
+--------------------------------------------------------------------------------------------+
| ["ID","Name","Age","Standard","Marks","Addr","pincode"] |
| ["001 id ","name is adam","for the age of 12","studying sxith std" "Having good marks"] |
+--------------------------------------------------------------------------------------------+

PSV (Pipe Separated Value) File
Create a psv file named “sample.tbl” as shown in the following program.
Tutorialspoint|Apache|Drill|article
Now we can execute this PSV file in Drill,
Query:
0: jdbc:drill:zk=local> select * from dfs.`/Users/../workspace/Drill-samples/sample.tbl`;
Result:
+------------------------------------------------+
| columns |
+------------------------------------------------+
| ["Tutorialspoint","Apache","Drill","article"] |
+------------------------------------------------+
Now, similar to the CSV and TSV files, try for yourself to create a table for PSV file.

11. Apache Drill – Querying Data Using HBase

……

12. Apache Drill – Querying Data Using Hive

……

13. Apache Drill – Querying Parquet Files

Parquet is a columnar storage format. Apache Drill uses Parquet format for easy, fast and efficient access.

Create a Table
Before moving to create a table in parquet, you must change the Drill storage format using the following command.
0: jdbc:drill:zk=local> alter session set `store.format`= 'parquet';
Result:
+-------+------------------------+
| ok | summary |
+-------+------------------------+
| true | store.format updated. |
+———+------------------------+

You can create a table using the following syntax.

0: jdbc:drill:zk=local> create table dfs.tmp.`/Users/../workspace` as select * from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

Result:
+-----------+----------------------------+
| Fragment | Number of records written |
+-----------+----------------------------+
| 0_0 | 10 |
+-----------+----------------------------+
To see the table contents, type-in the following query –
0: jdbc:drill:zk=local> select * from dfs.tmp.`/Users/../workspace`;

……