This section will show you how to automate the Hive job that imports Twitter data from the HDFS to Hive tables. Most of this information was gathered from the following Cloudera blog website: blog.cloudera.com/blog/2013/01/how-to-schedule-recurring-hadoop-jobs-with-apache-oozie/
In this section, we will also document some issues we had with following the Cloudera blog and some workarounds for these issues. By referring to the Cloudera Blog and this post, you should be able to get Oozie working in no time!
Creating Hive Query, Oozie Workflow, Oozie Coordinator, Job Properties Files
In this section, we will create all of the files necessary to get an Oozie workflow created. There are several ways to create the files and send them to HDFS. The method we will use is by creating the files in Hue.
First, open Hue and go to File Browser in the top right corner of the screen. This should take us to the user’s home directory (for our case, admin).
Next, we have to create a directory to place all of this files. Click on the ‘New’ button on the right side of the screen and click ‘Directory’. Name the directory add-tweet-partitions and then click ‘Create’
The first file we need to create is called add_partition_hive_script.q. This is the actual Hive code used to import data from HDFS into Hive tables. First, click on the directory we just created, and within it, we will create a file by clicking on the ‘New’ button on the right side of the screen and click ‘File’. Name the file add_partition_hive_script.q and click ‘Create’.
Now, click on the file we just created. On the next page, we will edit the file by clicking ‘Edit File’ in the left pane.
On the next page, copy the following content and paste it into the empty box and click ‘Save’.
ADD JAR ${JSON_SERDE};
ALTER TABLE tweets
ADD IF NOT EXISTS
PARTITION (datehour = ${DATEHOUR})
LOCATION ‘${PARTITION_PATH}’;
The next file we will create will be the Oozie workflow file. First navigate back to the add-tweet-partitions directory and click on ‘New’ and then ‘File’. We will name this add-partition-hive-action.xml and click ‘Create’.
Click on the file we just created and then click ‘Edit File’ on the next page so we can insert the content. In the empty box, copy and paste the content in the file link below (Note: You will have to change some fields depending on the cluster setup. For instance, you will have to change home directory path depending on the Hue user. The fields are in bold):
Click on the file we just created and then click ‘Edit File’ on the next page so we can insert the content. In the empty box, copy and paste the content in the file link below (Note: You will have to change some fields depending on the cluster setup. For instance, change the HDFS address to contain the IP Address or Hostname of your NameNode):
nameNode=hdfs://:8020
jobTracker=:8032
workflowRoot=${nameNode}/user/admin/add-tweet-partitions
# jobStart and jobEnd should be in UTC, because Oozie uses UTC for
# processing coordinator jobs by default (and it is not recommended
# to change this)
jobStart=2015-05-29T13:00Z
jobEnd=2015-06-30T23:00Z
# Timezone offset between UTC and the server timezone
tzOffset=0
# This should be set to an hour boundary, and should be set to a time
# of jobStart + tzOffset or earlier.
initialDataset=2015-05-29T13:00Z
# Adds a default system library path to the workflow’s classpath
# Used to allow workflows to access the Oozie Share Lib, which includes
# the Hive action
oozie.use.system.libpath=true
# HDFS path of the coordinator app
oozie.coord.application.path=${nameNode}/user/admin/add-tweet-partitions/add-partition-coord-app.xml
Other Configurations
Create a copy of hive-site.xml in HDFS
$ sudo cp /etc/hive/conf/hive-site.xml /tmp/hive-conf.xml
$ sudo -u hdfs hadoop fs -copyFromLocal /tmp/hive-conf.xml /user/admin/
Create Sub-folder and insert JAR dependencies
$ sudo -u hdfs hadoop fs -mkdir /user/admin/add-tweet-partitions/lib
$ sudo -u hdfs hadoop fs -chown admin:admin /user/admin/add-tweet-partitions/lib
Copy job.properties from HDFS to NameNode
$ sudo -u hdfs hadoop fs -copyToLocal /user/admin/add-tweet-partitions/job.properties /tmp/
$ cp /tmp/job.properties ~/
Add Linux user to the SuperGroup
$ sudo groupadd supergroup
$ sudo usermod -G supergroup robwilson
Changing the Whitelist for the NameNode and JobTracker
We also ran into another error when running the Oozie job. For the Hadoop system to accept information to and from the Oozie prot, 11000, we need to modify the Oozie Whitelist in cloudera manager (This solution was found in the following community link: https://community.cloudera.com/t5/Batch-Processing-and-Workflow/Running-Oozie/td-p/9020).
To make this change log into Cloudera Manager and click ‘Oozie’ on the left hand side of the home page. Next click on the ‘Configuration’ tab near the top of the page.
On the Oozie configuration page, the field we have to edit is called Oozie Server Configuration Safety Valve for oozie-site.xml . Copy and paste the following content in that field and click ‘Save Changes’:
Finally, restart the Oozie service.
Delete unnecessary directories that will cause errors
Before executing the Oozie job, another change has to be made. The Oozie workflow is very particular about the syntax location of the tweets data. For instance, all of the data should under the /user/flume/tweets/ directory should be in the format YEAR/MONTH/DAY/HOUR. In our case we have a directory called datehour=20150525 within our /user/flume/tweets directory and this caused an error when we ran our Oozie job. To make sure you do not get this error, go into the /user/flume/tweets directory in Hue, and make sure you delete and files or directories that don’t match the format or is similar to the directory I had to delete.
NOTE: If you do not do this step, you may have to delete and re-create your tweets table in the Hive editor.
Run the Oozie Job Command
Now that everything is in place, we can run the Oozie job command that will start the workflow. Type the following command in the home directory of your Linux user:
NOTE: Replace NameNode with the IP Address or hostname of your Hadoop namenode. Also, ~/ is the path where my job.properties is located.
$ oozie job -oozie http://:11000/oozie -config ~/job.properties -run
And That’s it! Your Oozie workflow will be automatically updating your tweets table every hour until the end date you desired. Good Luck!