Setup Oozie Workflow to Automate import

This section will show you how to automate the Hive job that imports Twitter data from the HDFS to Hive tables. Most of this information was gathered from the following Cloudera blog website: blog.cloudera.com/blog/2013/01/how-to-schedule-recurring-hadoop-jobs-with-apache-oozie/

In this section, we will also document some issues we had with following the Cloudera blog and some workarounds for these issues. By referring to the Cloudera Blog and this post, you should be able to get Oozie working in no time!

Creating Hive Query, Oozie Workflow, Oozie Coordinator, Job Properties Files

In this section, we will create all of the files necessary to get an Oozie workflow created. There are several ways to create the files and send them to HDFS. The method we will use is by creating the files in Hue.

First, open Hue and go to File Browser in the top right corner of the screen. This should take us to the user’s home directory (for our case, admin).

Next, we have to create a directory to place all of this files. Click on the ‘New’ button on the right side of the screen and click ‘Directory’.  Name the directory add-tweet-partitions and then click ‘Create’
oo-1

The first file we need to create is called add_partition_hive_script.q. This is the actual Hive code used to import data from HDFS into Hive tables. First, click on the directory we just created, and within it, we will create a file by clicking on the ‘New’ button on the right side of the screen and click ‘File’. Name the file add_partition_hive_script.q and click ‘Create’.
oo-2

Now, click on the file we just created. On the next page, we will edit the file by clicking ‘Edit File’ in the left pane.
oo-3

On the next page, copy the following content and paste it into the empty box and click ‘Save’.
ADD JAR ${JSON_SERDE};
ALTER TABLE tweets
ADD IF NOT EXISTS
PARTITION (datehour = ${DATEHOUR})

LOCATION ‘${PARTITION_PATH}’;

oo-4

The next file we will create will be the Oozie workflow file. First navigate back to the add-tweet-partitions directory and click on ‘New’ and then ‘File’. We will name this add-partition-hive-action.xml and click ‘Create’.
oo-5

Click on the file we just created and then click ‘Edit File’ on the next page so we can insert the content. In the empty box, copy and paste the content in the file link below (Note: You will have to change some fields depending on the cluster setup. For instance, you will have to change home directory path depending on the Hue user. The fields are in bold):

 

oo-6

The next file we will create will be the Oozie coordinator file. First navigate back to the add-tweet-partitions directory and click on ‘New’ and then ‘File’. We will name this add-partition-coord-app.xml and click ‘Create’. (Same process as above)

Click on the file we just created and then click ‘Edit File’ on the next page so we can insert the content. In the empty box, copy and paste the content in the file link below (Note: You will have to change some fields depending on the cluster setup. For instance, change the HDFS address to contain the IP Address or Hostname of your NameNode):

add-partition-coord-app

oo-7

Finally, we will create the Oozie job properties file. First navigate back to the add-tweet-partitions directory and click on ‘New’ and then ‘File’. We will name this job.properties and click ‘Create’. (Same process as above)
Click on the file we just created and then click ‘Edit File’ on the next page so we can insert the content. In the empty box, copy and paste the following content.
(Note: You will have to change some fields that are in bold. For NameNode, we used the private IP Address of our HDFS NameNode. Also, in our case /user/admin is our home directory. Make sure you set jobStart, jobEnd, tzOffset, and intialDataset  values to the appropriate dates and values. For instance my first folder inside /user/flume/tweets was on 22hour of 2015/05/21 so I set that as my jobStart. Also, older posts will give you the port 8021 as the jobTracker port, but we have to use 8032 because that is the new port that is specified by YARN):

nameNode=hdfs://:8020
jobTracker=:8032
workflowRoot=${nameNode}/user/admin/add-tweet-partitions

# jobStart and jobEnd should be in UTC, because Oozie uses UTC for
# processing coordinator jobs by default (and it is not recommended
# to change this)
jobStart=2015-05-29T13:00Z
jobEnd=2015-06-30T23:00Z

# Timezone offset between UTC and the server timezone
tzOffset=0

# This should be set to an hour boundary, and should be set to a time
# of jobStart + tzOffset or earlier.
initialDataset=2015-05-29T13:00Z

# Adds a default system library path to the workflow’s classpath
# Used to allow workflows to access the Oozie Share Lib, which includes
# the Hive action
oozie.use.system.libpath=true

# HDFS path of the coordinator app
oozie.coord.application.path=${nameNode}/user/admin/add-tweet-partitions/add-partition-coord-app.xml

oo-8

 Other Configurations

 Along with the files above, we need to make some other configurations before we deploy the Oozie job.

Create a copy of hive-site.xml in HDFS

First as you may have seen in the add-partition-hive-action.xml file, the Oozie workflow will need access to the hive configurations file. We will copy the Hive-site.xml th HDFS from the command line.
SSH into the NameNode of the Hadoop cluster. Since this file will be called hive-conf.xml in HDFS, we will rename it now and move it to the /tmp folder on our server first by using the following command:
$ sudo  cp  /etc/hive/conf/hive-site.xml   /tmp/hive-conf.xml

The next command will copy the hive-conf.xml file to our home directory in HDFS (in our case, /user/admin/)
$  sudo  -u  hdfs  hadoop  fs  -copyFromLocal  /tmp/hive-conf.xml   /user/admin/

Create Sub-folder and insert JAR dependencies

Next we will have to create a directory within the add-tweet-partitions directory called lib. Since we are already in the command line, we can use the following command to create a directory.
(NOTE: Replace /user/admin with your home directory and you can also create the directory using Hue)
$  sudo  -u  hdfs  hadoop  fs  -mkdir  /user/admin/add-tweet-partitions/lib

Don’t forget to change the Owner of the Lib directory from HDFS to Admin using the following command:
$  sudo  -u  hdfs  hadoop  fs  -chown  admin:admin  /user/admin/add-tweet-partitions/lib

Then we need to insert the JAR dependencies within this lib directory. The two JAR files we will need are the hive-serdes-1.0-SNAPSHOT.jar and the mysql-connector-java.jar. For hive-serdes-1.0-SNAPSHOT.jar, we can simply copy the one we created earlier in this tutorial that is already placed in our Home directory in HDFS. We can use the following command:
(NOTE: Again, /user/admin is our home directory)
$   sudo  -u  hdfs  hadoop  fs  -cp  /user/admin/hive-serdes-1.0-SNAPSHOT.jar   /user/admin/add-tweet-partitions/lib/

There are a couple of ways to get the mysql-connector-java.jar file into HDFS. First we will need to download the jar file from the following site: http://dev.mysql.com/downloads/connector/j/ . In the drop-down, select Platform Independent and download the ZIP file.
oo-9

Open the ZIP file on your machine and you will see a file called mysql-connector-java-5.1.35-bin.jar. For the sake of simplicity, we will rename the file mysql-connector-java.jar.
oo-10

Finally, we will upload this file into HDFS using the Uploader in Hue. In the add-tweet-partitions/lib/ directory in Hue, click on ‘Upload’ in the top right corner of the screen and select ‘File’. Click on ‘Select File’ and navigate to the mysql-connector-java.jar file on your PC and click Open.

You should now see both of the JAR dependencies in the Lib directory
oo-11

Copy job.properties from HDFS to NameNode

When the Oozie job command is executed, it will need access to the job.properties file on the local machine. Use the following commands to copy the file from HDFS to the Hadoop NameNode. (NOTE: Run this from the home directory of the Server user)
$ sudo  -u hdfs  hadoop  fs -copyToLocal  /user/admin/add-tweet-partitions/job.properties   /tmp/
$ cp  /tmp/job.properties   ~/

Add Linux user to the SuperGroup

In order for the Oozie job to work, the Linux user running the command must have the correct permissions to access all of the files and directories needed. We did this by adding our Linux user, (robwilson), to the HDFS SuperGroup. Depending on your security measures, you may want to try an alternate route. This was the one that worked for us. This make this happen run the following 2 commands:
$ sudo groupadd supergroup
$ sudo usermod -G supergroup robwilson

Changing the Whitelist for the NameNode and JobTracker

We also ran into another error when running the Oozie job. For the Hadoop system to accept information to and from the Oozie prot, 11000, we need to modify the Oozie Whitelist in cloudera manager (This solution was found in the following community link: https://community.cloudera.com/t5/Batch-Processing-and-Workflow/Running-Oozie/td-p/9020).

To make this change log into Cloudera Manager and click ‘Oozie’ on the left hand side of the home page. Next click on the ‘Configuration’ tab near the top of the page.
oo-12

On the Oozie configuration page, the field we have to edit is called Oozie Server Configuration Safety Valve for oozie-site.xml . Copy and paste the following content in that field and click ‘Save Changes’:
oo-13

Finally, restart the Oozie service.

Delete unnecessary directories that will cause errors

Before executing the Oozie job, another change has to be made. The Oozie workflow is very particular about the syntax location of the tweets data. For instance, all of the data should under the /user/flume/tweets/ directory should be in the format YEAR/MONTH/DAY/HOUR. In our case we have a directory called datehour=20150525 within our /user/flume/tweets directory and this caused an error when we ran our Oozie job. To make sure you do not get this error, go into the /user/flume/tweets directory in Hue, and make sure you delete and files or directories that don’t match the format or is similar to the directory I had to delete.

NOTE: If you do not do this step, you may have to delete and re-create your tweets table in the Hive editor.

Run the Oozie Job Command

Now that everything is in place, we can run the Oozie job command that will start the workflow. Type the following command in the home directory of your Linux user:
NOTE: Replace NameNode with the IP Address or hostname of your Hadoop namenode. Also, ~/ is the path where my job.properties is located.

$ oozie  job  -oozie  http://:11000/oozie  -config  ~/job.properties -run

And That’s it! Your Oozie workflow will be automatically updating your tweets table every hour until the end date you desired. Good Luck!

Configuring Hive to work with ODBC Drivers

In general, a Hive table can easily be imported to an application through an ODBC driver. However, for our scenario, we are importing a Twitter streams table that needs access to the hive-serdes-1.0-SNAPSHOT.jar file. So, if you try to import the tweets table we created earlier, you will get an error. There is a fix around this, though! A Solution to this error was found on the following Tableau Community forum: http://community.tableau.com/message/277174?et=watches.email.thread#277174

In order for the drivers to support JSON data from Hive, they must have access to the JAR file on the each of the datanodes.

To do this we need to log into the terminal of each server, and and put the hive-serdes-1.0-SNAPSHOT.jar file in the /opt/ directory. Since we already have the hive-serdes-1.0-SNAPSHOT.jar file stored in HDFS, the easiest way to get this file is by running the following command on each server: (NOTE: In our case the JAR file is located in the /user/admin directory in HDFS)

NOTE: Run on all datanodes

$ sudo  hadoop  fs  -copyToLocal  /user/admin/hive-serdes-1.0-SNAPSHOT.jar  /opt/

Next, we have to go into Cloudera Manager and edit some properties in the Configuration section.

Go to the Cloudera Manager Homepage. Then, click on ‘Hive’ on the left hand side. Once on the Hive page click on ‘Configuration’.
js-1

On the configuration page, we need to change two values. For the field, ‘Hive Auxiliary JARs Directory’, set the value to  /opt/   and click Save Changes.
js-2

The next field we have to change is to set the value of  ‘Auto Create and Upgrade Hive Metastore Database Schema’ to True (set check box to True) and click Save Changes
js-3

Finally, click on the Restart button in the top right corner to complete the configuration.
js-4

Once restarted, you should now be able to import Twitter data into any application using an ODBC driver.

Import Tweets into Hive Table

Now that we have Twitter streams going into HDFS, the next step is to get all of the data and import it into a Hive table. This will give us the ability to query the data and export it in a structured fashion.

Building the JSON SerDe and Uploading to HDFS

The first thing we need to do, which is similar to what we did with ‘flume-sources’ earlier, is build the JSON SerDe from source. This will create a JAR file that will contain all the libraries necessary to convert the JSON structured tweets into a Hive table.

First, log back into the Flume-Agent VM, if not already done so. Next go into the hive-serdes directory.
$ cd ~/cdh-twitter-example/hive-serdes

Next we will build the JAR file using the following command:
$ mvn package

The JAR file is now placed in the ‘target’ directory. We will then upload this file to HDFS so we can use it in the future within Hue.
$ cd ~/cdh-twitter-example/hive-serdes/target/$ sudo -u hdfs hadoop fs -copyFromLocal hive-serdes-1.0-SNAPSHOT.jar /user/admin/

NOTE: You may get the following error: copyFromLocal: `hive-serdes-1.0-SNAPSHOT.jar’: No such file or directory

The reason for this is that the user uploading the file either doesn’t have permission to ass the local and/or remote directories/files. However, there is a work-around. You can copy the hive-serdes-1.0-SNAPSHOT.jar file to a tmp directory, and upload to HDFS from there:
$ cp /hive-serdes-1.0-SNAPSHOT.jar  /tmp/
$ sudo -u hdfs hadoop fs -copyFromLocal /tmp/hive-serdes-1.0-SNAPSHOT.jar /user/admin/

Creating a Hive Directory Hierarchy

Depending on the Hive installation, the following directories may or may not exist. Just to be sure, I would recommend executing the following lines to create the Hive file structure: (NOTE: Depending on your security requirements, permissions for the directories may vary)
$ sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse
$ sudo -u hdfs hadoop fs -chown -R hive:hive /user/hive
$ sudo -u hdfs hadoop fs -chmod 750 /user/hive
$ sudo -u hdfs hadoop fs -chmod 770 /user/hive/warehouse

Also, you will want to add your Hue username to the Hive Group on the server (in our case, the username is admin):
$ sudo usermod -a -G hive admin

NOTE: If your user exists in Hue but not on your Flume agent VM, you can add the user first before the above command:
$ sudo adduser  admin

Creating Tweets Table in Hue

We can now go back to Hue and start creating the tweets table.

From the homepage in Hue, click on the ‘Query editors’ drop-down and select ‘Hive’
hive-1

To create the table, first we need to add the hive-serdes JAR file from earlier to the path. Click on ‘Settings’ in the top left corner of the screen, and then click the ‘Add’ button underneath ‘File Resources’.
hive-2

Then click on button next to ‘Path’ to search for the Hive-serdes JAR in HDFS and click on hive-serdes-1.0-SNAPSHOT.jar
hive-3

Next, go back to the ‘Assist’ tab in the top left corner of the screen, and copy-and-paste the following Hive script into the Query editor text box and then click ‘Execute’.

CREATE EXTERNAL TABLE tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>,
retweet_count:INT>,
entities STRUCT<
urls:ARRAY<STRUCT>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT>>,
text STRING,
user STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>,
in_reply_to_screen_name STRING
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE ‘com.cloudera.hive.serde.JSONSerDe’
LOCATION ‘/user/flume/tweets’;

hive-4

Importing HDFS Tweets into Hive Tweets Table

The table structure is now complete. The next step is to import the data into the tweets table from HDFS.

Click on the ‘Metastore Manager’ on the top of the screen.

Next, click on the link for the ‘tweets’ table.
hive-5

Click on ‘Import Data’ on the left side of the screen. Click on the button next to path and navigate to the last folder where the tweets are located. In the ‘datehour’ field, type the last digits of the directory path as shown below. The click ‘Enter’ on your keyboard, or the ‘Submit’ button.
hive-6

To check if the tweets are actually in the table, click on ‘Metastore Manage’ , then ‘tweets’, and then click the ‘Sample’ tab. You should now see a sample of the tweets on the screen.
hive-7

Sample Queries on the Hive Tweets Table

Now that the information is in the Hive table, we can perform queries to gain valuable information from the  tweets.

First go to ‘Hive’ under ‘Query Editors’.

Example 1: Lets say we  want to find the usernames, and the number of retweets they have generated across all the tweets that we have data for. We will use the following query (Paste the query in the Query Editor text box and click ‘Execute’):

SELECT
t.retweeted_screen_name,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
retweeted_status.user.screen_name as retweeted_screen_name,
retweeted_status.text,
max(retweeted_status.retweet_count) as retweets
FROM tweets
GROUP BY retweeted_status.user.screen_name,
retweeted_status.text) t
GROUP BY t.retweeted_screen_name
ORDER BY total_retweets DESC
LIMIT 10;

hive-8

Connecting to Hue and Viewing Twitter Streams

Now that we have Twitter streams being ingested through the Flume agents, we can take a look at the content in the HDFS through the Hue UI.

To go to the Hue UI, from the Cloudera Manager home page, click on ‘Hue’ on the left side of the screen.
hue-1

Next, click on ‘Hue Web UI’ under the ‘Quick Links’ section on the left side of the screen.
hue-2

 

This will open a new tab that will direct you to the Hue Web interface. If the link does not take you to the Hue login page, you probably have to change the URL to include the Fully Qualified Domain Name (FDQN) or IP Address of where your Hue service is running.

Once on the Hue login page, you will need to either login using existing credentials (If you have already accessed Hue before) or you will need to create an account.
hue-3

Once logged in click on ‘File Browser’ on the top of the page.
hue-4

You are now viewing the file structure within the HDFS.

To view the tweets being ingested, in the directory path, click on ‘user’, then ‘flume’, then ‘tweets’.
Hue-5

 

All of the raw JSON tweets are stored within this file structure. The next step of this process is to create a Hive table to structure all of this data to where we can query and export the tweets. This will be covered in the next section.

Configuring Flume to Ingest Twitter Streams

Adding Flume service to the existing services in Cloudera Manager

From the base install of the Cloudera cluster inside of Azure, Flume is not included as a service. You can do the following to include Flume as a service within Cloudera manager:

From the CLoudera Manager home page, click on the button next to the cluster name and click ‘Add a Service’.
fl3

Next, click  on the radio button next to Flume and Continue
fl4

Make sure radio button next to hdfs is checked and click Continue
fl5

Next we will add a Fume agent to one of the Datanodes in the cluster (A Flume agent can be added to multiple nodes for scalability, but we will only use one node for simplicity. NOTE: If done on multiple nodes, the following  configuration will have to be done for each node)

Click on ‘Select hosts’ withing the text box, and then check the box for the Datanodes you would like to add a Flume agent to (For our case, we will only install on DN-1); Then click OK:
fl6
fl7

Finally, click ‘Finish’ to complete the Flume installation. You should now see the Flume service appear on the Cloudera Manager homepage.
fl8

To turn on the Flume service, click on the drop-down arrow next to Flume and click Start.

Configuring Flume agent to ingest Tweets

Prerequisites:

1. Open Endpoint in Azure to connect to Flume agent via SSH

Within the Azure portal, navigate to the virtual machine (VM) that the Flume agent(s) is installed on (in our case, its the VM ending with DN-1). Click on the ‘Settings’ for that VM, and then click ‘Endpoints’. NOTE: This will vary depending on the Azure interface that you see, but the main goal is to get to the Endpoints tab for that server Flume agent is installed on.
fl9

Next, click on the ‘Add’ button withing Endpoints. Under the fields, name the Endpoint ‘SSH’ and make the Public and Private port 22 and click OK. (NOTE: It will take a few minutes for the Endpoint to be created)
fl-10
2. Setup SSH connection with Putty using Private key

Once the Endpoint for SSH has been opened in the Azure portal, we can connect the virtual machine via Putty.

Open putty.exe . Type the hostname or IP address (Public Virtual IP) of the Flume agent VM in the specified area. Create a name in the Saved Sessions section to call this session in the future. Then click Save. (NOTE: Your hostname or IP should be different)
fl-11

Next, on the left pane, click on  ‘Auth’ under the ‘SSH’ tab. Then click on ‘Browse…’ and find the private key that we saved earlier. (This should be the same private key that was used to create the Cloudera cluster within Azure)
fl-12

Go back to ‘Session’ on the left pane, and click ‘Save’ again
fl-13

Now that the session has been saved with the private key, you can open Putty at anytime and simply select the Saved session and the Open.

Now lets open the saved session and connect to the VM. First click on the session under ‘Saved Sessions’, then click ‘Load’, and then click ‘Open’.
fl-14

Next, a terminal screen will open asking for a login username (same username that was created in the Azure portal when the Cloudera cluster was initially setup) and a passphrase (same passphrase that was created with PuttyGen). Now you’re logged into the Flume agent VM.
fl-15

3. Install ‘Git’ on the VM

Git needs to be installed in order to install some of the flume sources in the next stage. This can be done by typing the following code:

$ sudo yum install git

 

4. Install Maven on the VM

Maven (mvn) needs to be installed in order to compile the flume sources in the next stage. This can be done with the following code: (credit to http://preilly.me/2013/05/10/how-to-install-maven-on-centos/)

$ wget http://mirror.cc.columbia.edu/pub/software/apache/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz
$ sudo tar xzf apache-maven-3.0.5-bin.tar.gz -C /usr/local
$ cd /usr/local
$ sudo ln -s apache-maven-3.0.5 maven

Next, Setup the maven path system-wide by editing the maven.sh file and adding the following lines:
$ sudo nano  /etc/profile.d/maven.sh

export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}

Finally, log out  and log back in to activate the environment variables. You can double check that Maven has been properly install by typing  $ mvn -version  after logging back in.
fl-16

 

Building the Custom Flume Source

The Flume Source is designed to connect to the Twitter Streaming API and ingest tweets into HDFS. Without creating this JAR file, we will not be able to connect Flume to the Twitter API. We used the following GitHub site to get the source for both the Flume Source and Hive Serdes, which we will be using later: https://github.com/cloudera/cdh-twitter-example (NOTE: Many of the installation steps were also used from this  site)

First, we will clone a copy of the source code on the GitHub website (https://github.com/cloudera/cdh-twitter-example) to our VM. On the GitHub site, copy (Ctrl-V) the content within the text box under ‘HTTPS’ on the right side of the screen.
fl-17
Next, in the Flume agent terminal, type the following command $ git clone  and then paste (Ctrl-V) the contents on your clipboard. So it should look like the following:
$ git clone https://github.com/cloudera/cdh-twitter-example.git

To build the flume-sources, type the following commands:
$ cd ~/cdh-twitter-example/flume-sources/
$ mvn package

This will create a file called flume-sources-1.0-SNAPSHOT.jar in the directory called target which is located in the flume-sources directory.

To add the flume-sources-1.0-SNAPSHOT.jar to the classpath, the following needs to be done:
1. Create the directory /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/
2. Create the directory /var/lib/flume-ng/plugins.d/twitter-streaming/lib/
3. Copy the flume-sources-1.0-SNAPSHOT.jar to both of the directories created.
$  sudo cp ~/cdh-twitter-example/flume-sources/target/flume-sources-1.0-SNAPSHOT.jar /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/

$  sudo cp ~/cdh-twitter-example/flume-sources/target/flume-sources-1.0-SNAPSHOT.jar /var/lib/flume-ng/plugins.d/twitter-streaming/lib/

4. Add one of the above locations to the FLUME_CLASSPATH using the following command:
$ export FLUME_CLASSPATH=”/usr/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar”

Creating the flume folder withing HDFS

In order for tweets to be sent to HDFS, there must be a folder created and ready for ingestion. From the Flume Agent Terminal, we can create these folders/directories within HDFS.

First, create the flume/tweets folder under /user/flume/tweets
$ sudo -u hdfs hadoop fs -mkdir /user/flume

$ sudo -u hdfs hadoop fs -mkdir /user/flume/tweets

Next, lets set the permission to the flume directory to 777 for the case of simplicity.
NOTE: For security reasons, you may want to change this to another set of permissions.
$  sudo -u hdfs hadoop fs -chmod -R 777 /user/flume

The flume/tweets directory is now ready for ingest.

Configuring the Flume Agent from Cloudera Manager

Now that we have all the dependencies for the Flume Agent installed, we are ready to configure Flume from the Cloudera Manager Web interface.

From the Cloudera Manager home page, click on Flume on the left side of the screen. Once on the Flume page, click on the ‘Configuration’ tab.
fl-18

From the Flume Configuration page, we need to change 2 of the entries and then click Save Changes:
1. Change the entry for ‘Agent Name’ to TwitterAgent
2. Clear all the content within the ‘Configuration File’ and insert the content of the following page in its place.  https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/flume.conf

NOTE:
– You must enter your Twitter Keys and tokens generated earlier
– After TwitterAgent.sources.Twitter.keywords , insert the keywords you want twitter to look for when ingesting data.
– In the TwitterAgent.sinks.HDFS.hdfs.path field, replace ‘hadoop1’ with the  Fully Qualifed Domain Name (FQDN) or IP Address of the NameNode of your cluster.

fl-19

Next, Click the Refresh button on the top right of the screen to restart and finish the Flume Agent configuration.
fl-20

After about  click on Refresh Cluster in the top right corner of the screen
fl-21

Finally, click on ‘Refresh Cluster’ at the bottom right of the screen and then click ‘Finish’ once it is all  complete.
fl-22

 

The Flume agent should now be ingesting Tweets from the Twitter Streaming API. We will be able to look at the contents in the HDFS through Hue in the next section.