Hadoop Cluster Setup

Before you are able to run any MapReduce jobs, you will need to configure your assigned machines to run as a Hadoop cluster. Your machines are essentially blank slates, and have no software installed or configured other than the default Linux installations. You have full administration permissions (aka 'root'), which means you can arbitrarily configure your machines using the sudo command (superuser-do) -- but be careful, since this allows you to arbitrarily modify system files!

Many of these steps need to be repeated on each of your cluster machines. You may wish to open up multiple terminal windows to simultaneously configure all of your machines. Unless otherwise noted, perform each step on each of your cluster machines. The cluster only needs to be set up once for the group; there is no setup that needs to be repeated for each group member (though each group member already has an account on each machine).

You should already have your private key and set of IP addresses to your machines. If you are getting permissions errors logging in with your key, remember to set the permissions of your keyfile so it is not world- or group-readable:

$ chmod 600 username-keypair

1. Cluster Pre-Configuration

Most of the commands you'll run on your cluster machines require administrative privileges. One way to do this is to preface every command with sudo, but an easier way (given that we need sudo on almost every command here anyways) is to just spawn a superuser shell:
```
$ sudo bash
#
```
Within this shell, every command is implicitly run "sudo". Of course, this also means you should exercise caution with what you run! By convention, we denote a superuser shell prompt by # (as opposed to a non-sudo shell prompt denoted by $).
Java is not part of the base system, so first you need to install a JDK. Most software on Linux is easily installed via a package manager; on these systems, the package manager is yum:
```
# yum install java-1.8.0-openjdk-devel
```
Remember to do this on every machine.
While we're at it, it's a good idea to make sure all software is up-to-date before you start customizing your machine. Use yum to update everything (this should complete quickly):
```
# yum update
```
We need to configure your machines so that they can SSH among themselves (which Hadoop will use for communication and control). Currently you are able to SSH into any of your machines using your private key, but once logged in you cannot SSH to a different machine (since your private key does not exist on the cluster machines). Thus, you need to copy your private key file from your local machine into the .ssh directory within your home directory on each of your cluster machines. We'll also save it as the default name that SSH expects (id_rsa). Change the file/usernames appropriately for your own key:
```
$ scp -i sb-keypair sb-keypair sb@1.2.3.4:~/.ssh/id_rsa
```
Now, we need to modify the SSH settings to allow root logins. Unfortunately (or perhaps fortunately), Linux makes this rather difficult for security's sake. Be especially careful with these steps, as errors could corrupt your SSH configuration and lock you out of your machine entirely. Copy your private key file (id_rsa) to the root user, as well as your authorized_keys file that controls which keys provide access.
```
# cp ~/.ssh/id_rsa /root/.ssh/
# cp ~/.ssh/authorized_keys /root/.ssh/
```
Next, open up the sshd (SSH server daemon) configuration file (in any editor):
```
# nano /etc/ssh/sshd_config
```
Find the PermitRootLogin entry, which should appear commented out like this (this is a line of a file, not a shell command):
```
#PermitRootLogin yes
```
Uncomment this line by deleting the "#", then save the file and quit. Now, restart the SSH server to make the change take effect:
```
# service sshd restart
```
Remember to do this on each of your machines.
Normally, SSH will prompt you to verify new hosts that you connect to. This behavior can cause headaches for us, so we'll just disable it by editing the ssh config file (note: different from the sshd config file):
```
# nano /etc/ssh/ssh_config
```
Find the line reading Host * (uncommented - ignore the commented one). Just under it, add the following line:
```
StrictHostKeyChecking no
```
Save and quit the file.
Run a quick test on one of the machines to make sure that your SSH configuration is working on all machines. As root, try SSHing to each of your machines (including the machine you're starting from -- i.e., you should be able to SSH into the local machine). If each login succeeds, just run 'exit' to get back to the master shell and then try the next machine. If you get a "Permission denied" error when connecting to any of your machines, you probably made a mistake during the SSH configuration.
Now we need to create a directory for Hadoop to store most of its data. Most importantly, this directory is where Hadoop will be storing the local data blocks of the HDFS (Hadoop Distributed File System). Since we may be working with a lot of data, we want to put this directory on an external disk instead of the small system drive. Your machines each have 500 GB of external storage attached on a disk separate from the boot drive (which is only 8 GB). This secondary disk (/dev/sdb) is completely unformatted, so first we need to initialize it with a filesystem (we'll use the ext4 filesytem):
```
# mkfs -t ext4 /dev/sdb
```
Now let's create a directory that we'll use to "mount" (i.e., attach) the new filesystem.
```
# mkdir /mnt/data
```
Right now this is just a regular directory on the system filesystem - now let's mount the new filesystem and bind it to that directory:
```
# mount /dev/sdb /mnt/data
```
To check that this worked, run df -h to list all mounted filesystems. In the list, you should see (among other things), the main filesystem mounted on / (size 8 GB) and the newly mounted filesystem on /mnt/data with a size of about 500 GB. Assuming it worked, now we can create a directory on the attached disk for Hadoop to use for HDFS storage:
```
# mkdir /mnt/data/hadoop
```
Each of your machines actually has two IP addresses - the public IP (which is what you use to SSH into each machine), and an internal IP (which only works from within the cluster). To find the internal IPs, run ifconfig on each machine - the IP will appear in a line looking like the following (and should start with 172):
```
inet 172.15.37.52
```
For future reference (and to prevent later confusion), record all your public IPs and their corresponding internal IPs from ifconfig. You will need the internal IPs for configuring Hadoop in the next section.

2. Hadoop Configuration

First we need to download the Hadoop software. Unfortunately, this isn't quite as easy as running a one-line yum command. On each of your machines, download the Hadoop files using wget, unpack the archive, and store them at /usr/local/hadoop:
```
# wget https://archive.apache.org/dist/hadoop/core/hadoop-2.9.2/hadoop-2.9.2.tar.gz
# tar xzvf hadoop-2.9.2.tar.gz
# mv hadoop-2.9.2 /usr/local/hadoop
```
Pick one of your machines to configure as the Hadoop master. We're going to get all of the Hadoop configuration files setup on the master (next 2 steps), then just copy them to the rest of your cluster. Make a note of the IP (public and private) of the master. For all of the following configuration files, use the private IPs only - things will break if you use the public IPs!
On your chosen master, open up /usr/local/hadoop/etc/hadoop/slaves in your favorite editor. You need to change this file to contain the private IP addresses of all your non-master machines (1 per line). Remove all other lines. For example:
```
1.2.3.4
2.3.4.5
3.4.5.6
```
Delete any other contents of the file.

Now we have to change some configuration files in the same /usr/local/hadoop/etc/hadoop directory. Insert the following XML for the following 3 files (leaving the xml header intact), but replacing the placeholder MASTER-IP with your actual private master IP.

In core-site.xml:

<configuration>
   <property>
       <name>fs.default.name</name>
       <value>hdfs://MASTER-IP:9000</value>
   </property>
   <property>
       <name>hadoop.tmp.dir</name>
       <value>/mnt/data/hadoop/tmp</value>
   </property>
</configuration>

In hdfs-site.xml:

<configuration>
   <property>
       <name>dfs.replication</name>
       <value>2</value>
   </property>
</configuration>

In yarn-site.xml:

<configuration>
  <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>MASTER-IP</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>1536</value>
  </property>
  <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>1536</value>
  </property>
  <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>128</value>
  </property>
  <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
  </property>
</configuration>

In mapred-site.xml (copy mapred-site.xml.template to create this file, then modify it):

<configuration>
  <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
  </property>
  <property>
        <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>512</value>
  </property>
  <property>
        <name>mapreduce.map.memory.mb</name>
        <value>300</value>
  </property>
  <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>300</value>
  </property>
  <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx200m</value>
  </property>
  <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx200m</value>
  </property>
</configuration>

In hadoop-env.sh, change the export JAVA_HOME=... line to the following:

export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk/"

Now, copy the configuration files out to the rest of your cluster. For each of your worker nodes (other than the master), copy the configuration files from the master (run this on the master for each of the other machines):
```
# scp -r /usr/local/hadoop/etc/hadoop/* 1.2.3.4:/usr/local/hadoop/etc/hadoop/
```

3. Launch the Cluster

First we need to start a new distributed file system across our cluster. The basic architecture of HDFS is a single "NameNode", which is a master server that manages the HDFS filesystem, and then any number of "DataNodes", which actually hold the distributed data on attached storage. We will run the NameNode server on your designated master, as well as a DataNode on every machine (including the master). Before we can run any of this, however, we need to format a new distributed filesystem:
```
# /usr/local/hadoop/bin/hdfs namenode -format
```
If the above command works, it will start the NameNode, run for a few seconds, dump a lot of output, and then exit (having formatted the distributed filesystem).
Now, fire up the HDFS daemon programs (this will start the NameNode as well as all DataNodes on all machines):
```
# /usr/local/hadoop/sbin/start-dfs.sh
```
This script will start the NameNode locally, then connect to all of the worker machines and start DataNodes there.

The easiest way to check that this worked is using the jps command, which will list all running Java processes on the local machine. Run jps on the master and you should see a NameNode entry. Also check the other machines - running jps as root on any other machine should now show you that a DataNode is running. If you don't see a DataNode running on every machine, then something went wrong previously. Hadoop logfiles are written to /usr/local/hadoop/logs/, which may tell you something about what went wrong.
Now, we'll start YARN ("Yet Another Resource Negotiator"), which is the Hadoop framework responsible for resource management and job scheduling (i.e., cluster management). On the master:
```
# /usr/local/hadoop/sbin/start-yarn.sh
```
As before, use jps to make sure this succeeded. On the master, you should a new ResourceManager (in addition to the previous NameNode). On each of the other nodes, you should see a NodeManager (but no ResourceManager). If everything seems to be running, proceed to the next section.

4. Test the Cluster

First we'll test the distributed filesystem. Interaction with HDFS is exclusively done via the hdfs program. First let's copy our Hadoop configuration directory (arbitrarily chosen) into HDFS:
```
# /usr/local/hadoop/bin/hdfs dfs -put /usr/local/hadoop/etc/hadoop /test
```
This command copies the Hadoop configuration directory (from our local filesystem) into HDFS (using the 'put' command) as a directory named test in the root HDFS directory. We can use the 'ls' command to view the directory we just copied:
```
# /usr/local/hadoop/bin/hdfs dfs -ls /test
```
Other DFS commands that work as you'd expect include 'cat' (view a file), 'cp' (copy a file within HDFS), and 'get' (copy a file from HDFS back into the local filesystem). Also note that since this is a distributed filesystem, it doesn't matter which node you run these commands from - they're all accessing the same (distributed) filesystem.
Now let's run an actual MapReduce job using one of the example jobs provided with Hadoop. Here's a distributed grep example that searches for a pattern in the test directory we just stored:
```
# /usr/local/hadoop/bin/hadoop jar \
      /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar \
      grep "/test" "/test-output" "dfs[a-z.]+"
```
This command will launch the job and print periodic status updates on its progress. You should not see any nasty error messages during the job run if everything is working correctly. Your tiny cluster is unlikely to surprise you with its blazing speed -- expect the job to take a minute or so to execute.
To see the actual output files of the job, we can view them in HDFS (or, alternatively, fetch them to the local filesystem and view them there):
```
# /usr/local/hadoop/bin/hdfs dfs -cat /test-output/*
```
The output of this example job just includes the matched pieces of text and how many times they appeared in the searched documents. If you'd like to compare to a non-distributed grep (which will also show the entire lines), you can run the following:
```
# grep -P dfs[a-z.]+ /usr/local/hadoop/etc/hadoop/*
```
To shut down the Hadoop cluster, all you need to do is stop YARN and HDFS, as follows:
```
# /usr/local/hadoop/sbin/stop-yarn.sh
# /usr/local/hadoop/sbin/stop-dfs.sh
```
In general, it's fine to leave the cluster running when not running jobs, though if you think you might've broken something, it's a good idea to reboot the cluster by stopping and then restarting the daemons.
If everything looks good, congratulations, you've built a working Hadoop cluster! The next thing you should do is try running your own Hadoop job.