Introduction

Apache Hive is an open-source big-data software project that allows SQL-style queries of data stored in the Hadoop file system.  Those SQL-style queries, or HiveQL queries are converted into a collection of MapReduce jobs which get executed in the Hadoop cluster.

There are several options for the Hive Execution Engine that is responsible for running the collection of MapReduce jobs.  The simplest one is the MapReduce engine, which executes the the collection of jobs as a traditional MapReduce. Apache Tez is another execution engine available to for Hive queries.  The difference between these two engines, from a purely practical standpoint, is that Tez speeds up query executions in part by limiting un-necessary disk writes between MapReduce jobs, where as the MapReduce engine always writes to disk after each MapReduce job. There are other differences between these two execution engines, but those details are beyond the topic of practical Hive. More recently, Hive queries can be executed using the Apache Spark engine, however, this requires that Spark be installed on the cluster.  Both Hive on Tez and Hive on Spark execute queries more quickly that the original MapReduce engine.

Basic Queries

select * from database.table limit 100;

This is one of the simplest HiveQL queries, where we have selected all the columns available in database.table and we only want the first 100 rows. Queries like this are great for getting a preview of the data structure and quality. This query should return almost instantaneously, since no MapReduce jobs are created.

select column_a, columns_b from database.table limit 100;

Here we select only specific columns.  Again, this should return almost instantaneously… no MapReduce jobs are created.

select column_a, column_b from database.table where column_a = 'HIVE';

This HiveQL query uses a where clause to sub-set the data and only returns rows where the value in the column_a is equal to the string HIVE.  The where clause initiates the creation of MapReduce jobs, and thus this query will utilize an execution engine.  The execution time for this query will depend on the database.table size, the available hardware resources, and the execution engine.

set hive.execution.engine=tez;
select
     column_one,
     column_two
from
     database.table
where
     (column_one = 'HIVE') and (column_two = 'TEZ');

Where clause conditions can be strung together as shown above.  Again, the where clause initiates the creation of MapReduce jobs. The indented structure and the parenthesis are not necessary, but does provide clarity as queries become increasing complex.  Notice that an initial command was passed telling Hive to use the TEZ execution engine.

select
     a.column_key as new_column_key,
     a.column_one as new_column_one,
     a.column_two as new_column_two,
     b.column_one as new_column_three,
     b.column_two as new_column_four
from
     (select column_one, column_two, column_key from database.table_47828) as a
inner join
     (select column_one, column_two, column_key from database.table_n3232) as b
on a.column_key = b.column_key;

Let’s break down the above query.

(select column_one, column_two, column_key from database.table_47828) as a

First we are selecting a sub-set of the columns available from database.table_47828, and an creating a temporary alias of that result a which we will reference shortly.

(select column_one, column_two, column_key from database.table_n3232) as b

Similarly, we select a sub-set of the columns in database.table_n3232 and alias that result as b.

    (select column_one, column_two, column_key from database.table_47828) as a
inner join
    (select column_one, column_two, column_key from database.table_n3232) as b
on a.column_key = b.column_key;

Next, the we take the results from the individual queries and execute an inner join to combine rows that share a common value in the column_key column.  Since each table has a column named column_key in this example, we make use of the temporary aliases we created earlier.

select
     a.column_key as new_column_key,
     a.column_one as new_column_one,
     a.column_two as new_column_two,
     b.column_one as new_column_three,
     b.column_two as new_column_four
from
  ...

The final piece is requesting which columns to return and how to return them.  In this example, we rename each column using the as new_column_x syntax.  Notice, that we again make use of the temporary table aliases when referencing table names.  This is necessary when tables have columns with the same name.

Left outer join, right outer join, and full outer join (or simply, join) work similarly in Hive.  Further information on the syntax of Hive Joins can be found here.

Other Useful Queries

describe database.table;

This query returns a list of the available columns and the data type of each column.  If you use a graphical interface such as dbvisualizer, then this data is typical already available.

select distinct(column_one) from database.table;
select count(distinct(column_one)) from database.table;

This top query returns the unique values from column_one within database.table. Personally, this query is useful to check date ranges and look for missing data within in a given data set.  Or we count the number of distinct values in a given column.

select column_one, count(*) from database.table group by column_one;

Alternatively, we can group the dataset by the unique values in column_one and return the number of rows or entries having that particular value.  This query shows the group by clause and the corresponding aggregation function, which is a very common query pattern.  There are many build-in aggregation functions in Hive that are useful.  The following functions are an abbreviated list:

sum(column), avg(column), var_pop(column), stddev_pop(column), corr(column_one, column_two), percentile_approx(column, percentile_target).

select
     cast(regexp_replace(column_one, '[A-Z]', '') as int) as new_column
from
     database.table

The above query exhibits both the cast() function and the utilization of a regular expression in HiveQL.  First, the values from column_one are stripped of any alphabet characters and then cast into the data type int.  This type of syntax is useful in a more complex query, such as an inner join where we need to modify a column prior to using it as the join key.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s