Import Tweets into Hive Table

Now that we have Twitter streams going into HDFS, the next step is to get all of the data and import it into a Hive table. This will give us the ability to query the data and export it in a structured fashion.

Building the JSON SerDe and Uploading to HDFS

The first thing we need to do, which is similar to what we did with ‘flume-sources’ earlier, is build the JSON SerDe from source. This will create a JAR file that will contain all the libraries necessary to convert the JSON structured tweets into a Hive table.

First, log back into the Flume-Agent VM, if not already done so. Next go into the hive-serdes directory.
$ cd ~/cdh-twitter-example/hive-serdes

Next we will build the JAR file using the following command:
$ mvn package

The JAR file is now placed in the ‘target’ directory. We will then upload this file to HDFS so we can use it in the future within Hue.
$ cd ~/cdh-twitter-example/hive-serdes/target/$ sudo -u hdfs hadoop fs -copyFromLocal hive-serdes-1.0-SNAPSHOT.jar /user/admin/

NOTE: You may get the following error: copyFromLocal: `hive-serdes-1.0-SNAPSHOT.jar’: No such file or directory

The reason for this is that the user uploading the file either doesn’t have permission to ass the local and/or remote directories/files. However, there is a work-around. You can copy the hive-serdes-1.0-SNAPSHOT.jar file to a tmp directory, and upload to HDFS from there:
$ cp /hive-serdes-1.0-SNAPSHOT.jar  /tmp/
$ sudo -u hdfs hadoop fs -copyFromLocal /tmp/hive-serdes-1.0-SNAPSHOT.jar /user/admin/

Creating a Hive Directory Hierarchy

Depending on the Hive installation, the following directories may or may not exist. Just to be sure, I would recommend executing the following lines to create the Hive file structure: (NOTE: Depending on your security requirements, permissions for the directories may vary)
$ sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse
$ sudo -u hdfs hadoop fs -chown -R hive:hive /user/hive
$ sudo -u hdfs hadoop fs -chmod 750 /user/hive
$ sudo -u hdfs hadoop fs -chmod 770 /user/hive/warehouse

Also, you will want to add your Hue username to the Hive Group on the server (in our case, the username is admin):
$ sudo usermod -a -G hive admin

NOTE: If your user exists in Hue but not on your Flume agent VM, you can add the user first before the above command:
$ sudo adduser  admin

Creating Tweets Table in Hue

We can now go back to Hue and start creating the tweets table.

From the homepage in Hue, click on the ‘Query editors’ drop-down and select ‘Hive’
hive-1

To create the table, first we need to add the hive-serdes JAR file from earlier to the path. Click on ‘Settings’ in the top left corner of the screen, and then click the ‘Add’ button underneath ‘File Resources’.
hive-2

Then click on button next to ‘Path’ to search for the Hive-serdes JAR in HDFS and click on hive-serdes-1.0-SNAPSHOT.jar
hive-3

Next, go back to the ‘Assist’ tab in the top left corner of the screen, and copy-and-paste the following Hive script into the Query editor text box and then click ‘Execute’.

CREATE EXTERNAL TABLE tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>,
retweet_count:INT>,
entities STRUCT<
urls:ARRAY<STRUCT>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT>>,
text STRING,
user STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>,
in_reply_to_screen_name STRING
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE ‘com.cloudera.hive.serde.JSONSerDe’
LOCATION ‘/user/flume/tweets’;

hive-4

Importing HDFS Tweets into Hive Tweets Table

The table structure is now complete. The next step is to import the data into the tweets table from HDFS.

Click on the ‘Metastore Manager’ on the top of the screen.

Next, click on the link for the ‘tweets’ table.
hive-5

Click on ‘Import Data’ on the left side of the screen. Click on the button next to path and navigate to the last folder where the tweets are located. In the ‘datehour’ field, type the last digits of the directory path as shown below. The click ‘Enter’ on your keyboard, or the ‘Submit’ button.
hive-6

To check if the tweets are actually in the table, click on ‘Metastore Manage’ , then ‘tweets’, and then click the ‘Sample’ tab. You should now see a sample of the tweets on the screen.
hive-7

Sample Queries on the Hive Tweets Table

Now that the information is in the Hive table, we can perform queries to gain valuable information from the  tweets.

First go to ‘Hive’ under ‘Query Editors’.

Example 1: Lets say we  want to find the usernames, and the number of retweets they have generated across all the tweets that we have data for. We will use the following query (Paste the query in the Query Editor text box and click ‘Execute’):

SELECT
t.retweeted_screen_name,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
retweeted_status.user.screen_name as retweeted_screen_name,
retweeted_status.text,
max(retweeted_status.retweet_count) as retweets
FROM tweets
GROUP BY retweeted_status.user.screen_name,
retweeted_status.text) t
GROUP BY t.retweeted_screen_name
ORDER BY total_retweets DESC
LIMIT 10;

hive-8