Apache Drill – Query Using JSON & Window Functions using JSON

6. Apache Drill – Query Using JSON

Apache Drill supports JSON format for querying data.

Drill treats a JSON object as SQL record.

One object equals one row in a Drill table.

Querying JSON File
Let us query the sample file, “employee.json” packaged as part of the drill. This sample file is Foodmart data packaged as JAR in Drill's classpath: ./jars/3rdparty/foodmart-data-json.0.4.jar. The sample file can be accessed using namespace, cp.
Start the Drill shell, and select the first row of data from the “employee.json” file installed.
Query:
0: jdbc:drill:zk=local> select * from cp.`employee.json` limit 1;
Result:
+--------------+---------------+-------------+------------+--------------+-----------------+-----------+----------------+-------------+------------------------+----------+----------------+------------------+-----------------+---------+-------------------+
| employee_id | full_name | first_name | last_name | position_id | position_title | store_id | department_id | birth_date | hire_date | salary | supervisor_id | education_level | marital_status | gender | management_role |
+--------------+---------------+-------------+------------+--------------+-----------------+-----------+----------------+-------------+------------------------+----------+----------------+------------------+-----------------+---------+--------------------+
| 1 | Sheri Nowmer | Sheri | Nowmer | 1 | President | 0 | 1 | 1961-08-26 | 1994-12-01 00:00:00.0 | 80000.0 | 0 | Graduate Degree | S | F | Senior Management |
+--------------+---------------+-------------+------------+--------------+-----------------+-----------+----------------+-------------+------------------------+----------+----------------+------------------+-----------------+---------+--------------------

The same result can be viewed in the web console as –

Storage Plugin Configuration
You can connect Drill to a file system through a storage plugin. On the Storage tab of the Drill Web Console (http://localhost:8047), you can view and reconfigure a storage plugin.
The Drill installation contains the following default storage plugin configurations.
 cp - Points to the JAR files in the Drill classpath.
 dfs - Points to the local file system, but you can configure this storage plugin to point to any distributed file system, such as a Hadoop or S3 file system.
 hbase - Provides a connection to the HBase.
 hive - Integrates Drill with the Hive metadata abstraction of files, HBase, and libraries to read data and operate on SerDes and UDFs.
 mongo - Provides a connection to MongoDB data.

Storage Plugin Configuration Persistence
 Embedded mode: Apache Drill saves the storage plugin configurations in a temporary directory. The temporary directory clears when you reboot.
 Distributed mode: Drill saves storage plugin configurations in ZooKeeper.

Workspace
The workspace defines the location of files in subdirectories of a local or distributed file system. One or more workspaces can be defined in a plugin.

Create JSON file
As of now we have queried an already packaged “employee.json” file. Let us create a new JSON file named “student_list.json” as shown in the following program.

Now, let us query the file to view its full records.

{
"ID" : "001",
"name" : "Adam",
"age" : 12,
"gender" : "male",
"standard" : "six",
"mark1" : 70,
"mark2" : 50,
"mark3" : 60,
"addr" : "23 new street",
"pincode" : 111222
}
{
"ID" : "002",
"name" : "Amit",

Query:
0: jdbc:drill:zk=local> select * from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

SQL Operators
This section will cover the operations on SQL operators using JSON.

AND Operator
The following program shows the query for this function:
0: jdbc:drill:zk=local> select * from dfs.`/Users/../workspace/Drill-samples/student_list.json` where age = 12 and mark3 = 70;

Here, the AND operator produces the result when the condition matches to age=12 and mark3=70.

OR Operator
The following program shows the query for this function:
0: jdbc:drill:zk=local> select * from dfs.`/Users/../workspace/Drill-samples/student_list.json` where ID = '007' or mark3 = 70;

Between Operator
The following program shows the query for this function:
0: jdbc:drill:zk=local> select name,age,addr from dfs.`/Users/../workspace/Drill-samples/student_list.json` where mark1 between 50 and 70;

LIKE Operator
The Like Operator is used for pattern matching.
The following program shows the query for this function:
0: jdbc:drill:zk=local> select name from dfs.`/Users/../workspace/Drill-samples/student_list.json` where name like ‘A%';

NOT Operator
The following program shows the query for this function:
0: jdbc:drill:zk=local> select * from dfs.`/Users/../workspace/Drill-samples/student_list.json` where mark1 not in (80,75,70);

Aggregate Functions
The aggregate functions produce a single result from a set of input values.

The following table lists out the functions in further detail.

……

COUNT(DISTINCT(exp))
The following program shows the query for this function:
0: jdbc:drill:zk=local> select count(distinct(mark3)) from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

……

Statistical Function
The following program shows the query for this function:标准差
0: jdbc:drill:zk=local> select stddev(mark2) from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

Result:
EXPR$0
18.020050561034015

Query:方差
0: jdbc:drill:zk=local> select variance(mark2) from dfs.`/Users/../workspace/Drill-samples/student_list.json`;
Result:
EXPR$0
324.7222222222223
Variance of mark2 column result is returned as the output.

7. Apache Drill – Window Functions using JSON

Window functions execute on a set of rows and return a single value for each row from the query. The term window has the meaning of the set of rows for the function.
A Window function in a query, defines the window using the OVER() clause. This OVER() clause has the following capabilities:
 Defines window partitions to form groups of rows. (PARTITION BY clause)
 Orders rows within a partition. (ORDER BY clause)

Aggregate Window Functions
The Aggregate window function can be defined over a partition by and order by clause.
Avg()
The following program shows the query for this function:
0: jdbc:drill:zk=local> select mark1,gender,avg(mark1) over (partition by gender ) as avgmark1 from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

This result shows that partition by clause is used for the gender column.

So, it takes the average of mark1 from female gender which is 83.0 and then replaces that value to all the male and female gender.

The mark1 avg result is now 55.0 and hence it replaces the same to all genders.

Count(*)
The following program shows the query for this function:
0: jdbc:drill:zk=local> select name, gender, mark1, age, count(*) over(partition by age) as cnt from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

Here, there are two age groups 12 and 13. The age count of 12 is for 7 students and 13 age count is for 3 students. Hence count(*) over partition by age replaces 7 for 12 age group and 3 for 13 age group.

MAX()
The following program shows the query for this function:
0: jdbc:drill:zk=local> select name,age,gender,mark3,max(mark3) over (partition by gender) as maximum from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

From the above query, maximum mark3 is partitioned by gender, hence female gender max mark 98 is replaced to all female students and male gender max mark 70 is replaced to all male students.

MIN()
The following program shows the query for this function:
0: jdbc:drill:zk=local> select mark2,min(mark2) over (partition by age ) as minimum from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

SUM()
The following program shows the query for this function:
0: jdbc:drill:zk=local> select name,age,sum(mark1+mark2) over (order by age ) as summation from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

Here mark1+mark2 result is replaced separately to each male and female student.

Ranking Window Functions
Following is the table listed out with ranking window functions.

CUME_DIST()
The following program shows the query for this function:
0: jdbc:drill:zk=local> select name,age,gender,cume_dist() over (order by age) as relative_rank from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

Dense_Rank()
The following program shows the query for this function:
0: jdbc:drill:zk=local> select mark1,mark2,mark3,dense_rank() over (order by age) as denserank from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

NTILE()
The NTILE window function requires the ORDER BY clause in the OVER clause.
Query:
0: jdbc:drill:zk=local> select name,gender,ntile(3) over (order by gender) as row_partition from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

Percent_rank
The following program shows the query for this function:
0: jdbc:drill:zk=local> select name,age,percent_rank() over (order by age) as percentrank from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

Rank()
The ORDER BY expression in the OVER clause determines the value.
Query:
0: jdbc:drill:zk=local> select name,age,rank() over (order by age) as percentrank from dfs.`/Users/../workspace/Drill-samples/student_list.json`;

Row_number()
The ORDER BY expression in the OVER clause determines the number. Each value is ordered within its partition. Rows with equal values for the ORDER BY expressions receive different row numbers non-deterministically.

Query:
select *,row_number() over (order by age) as rownumber from dfs.`/Users/../workspace/Drill-samples/student_list.json`;