Before you are able to run any MapReduce jobs, you will need to configure your assigned machines to run as a Hadoop cluster. Your machines are essentially blank slates, and have no software installed or configured other than the default Linux installations. You have full administration permissions (aka 'root'), which means you can arbitrarily configure your machines using the sudo
command (superuser-do) -- but be careful, since this allows you to arbitrarily modify system files!
Many of these steps need to be repeated on each of your cluster machines. You may wish to open up multiple terminal windows to simultaneously configure all of your machines. Unless otherwise noted, perform each step on each of your cluster machines. The cluster only needs to be set up once for the group; there is no setup that needs to be repeated for each group member (though each group member already has an account on each machine).
You should already have your private key and set of IP addresses to your machines. If you are getting permissions errors logging in with your key, remember to set the permissions of your keyfile so it is not world- or group-readable:
$ chmod 600 username-keypair
sudo
, but an easier way (given that we need sudo on almost every command here anyways) is to just spawn a superuser shell:
$ sudo bash #Within this shell, every command is implicitly run "sudo". Of course, this also means you should exercise caution with what you run! By convention, we denote a superuser shell prompt by
#
(as opposed to a non-sudo shell prompt denoted by $
).
yum
:
# yum install java-1.8.0-openjdk-develRemember to do this on every machine.
yum
to update everything (this should complete quickly):
# yum update
.ssh
directory within your home directory on each of your cluster machines. We'll also save it as the default name that SSH expects (id_rsa
). Change the file/usernames appropriately for your own key:
$ scp -i sb-keypair sb-keypair sb@1.2.3.4:~/.ssh/id_rsa
id_rsa
) to the root user, as well as your authorized_keys
file that controls which keys provide access.
# cp ~/.ssh/id_rsa /root/.ssh/ # cp ~/.ssh/authorized_keys /root/.ssh/Next, open up the
sshd
(SSH server daemon) configuration file (in any editor):
# nano /etc/ssh/sshd_configFind the PermitRootLogin entry, which should appear commented out like this (this is a line of a file, not a shell command):
#PermitRootLogin yesUncomment this line by deleting the "#", then save the file and quit. Now, restart the SSH server to make the change take effect:
# service sshd restartRemember to do this on each of your machines.
ssh
config file (note: different from the sshd
config file):
# nano /etc/ssh/ssh_configFind the line reading
Host *
(uncommented - ignore the commented one). Just under it, add the following line:
StrictHostKeyChecking noSave and quit the file.
/dev/sdb
) is completely unformatted, so first we need to initialize it with a filesystem (we'll use the ext4
filesytem):
# mkfs -t ext4 /dev/sdbNow let's create a directory that we'll use to "mount" (i.e., attach) the new filesystem.
# mkdir /mnt/dataRight now this is just a regular directory on the system filesystem - now let's mount the new filesystem and bind it to that directory:
# mount /dev/sdb /mnt/dataTo check that this worked, run
df -h
to list all mounted filesystems. In the list, you should see (among other things), the main filesystem mounted on /
(size 8 GB) and the newly mounted filesystem on /mnt/data
with a size of about 500 GB. Assuming it worked, now we can create a directory on the attached disk for Hadoop to use for HDFS storage:
# mkdir /mnt/data/hadoop
ifconfig
on each machine - the IP will appear in a line looking like the following (and should start with 172):
inet 172.15.37.52For future reference (and to prevent later confusion), record all your public IPs and their corresponding internal IPs from
ifconfig
. You will need the internal IPs for configuring Hadoop in the next section.
yum
command. On each of your machines, download the Hadoop files using wget
, unpack the archive, and store them at /usr/local/hadoop
:
# wget https://archive.apache.org/dist/hadoop/core/hadoop-2.9.2/hadoop-2.9.2.tar.gz # tar xzvf hadoop-2.9.2.tar.gz # mv hadoop-2.9.2 /usr/local/hadoop
/usr/local/hadoop/etc/hadoop/slaves
in your favorite editor. You need to change this file to contain the private IP addresses of all your non-master machines (1 per line). Remove all other lines. For example:
1.2.3.4 2.3.4.5 3.4.5.6Delete any other contents of the file.
/usr/local/hadoop/etc/hadoop
directory. Insert the following XML for the following 3 files (leaving the xml header intact), but replacing the placeholder MASTER-IP with your actual private master IP.
core-site.xml
:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://MASTER-IP:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/mnt/data/hadoop/tmp</value> </property> </configuration>In
hdfs-site.xml
:
<configuration> <property> <name>dfs.replication</name> <value>2</value> </property> </configuration>In
yarn-site.xml
:
<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>MASTER-IP</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>1536</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>1536</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> </configuration>In
mapred-site.xml
(copy mapred-site.xml.template
to create this file, then modify it):
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>512</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>300</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>300</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx200m</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx200m</value> </property> </configuration>In
hadoop-env.sh
, change the export JAVA_HOME=...
line to the following:
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk/"
# scp -r /usr/local/hadoop/etc/hadoop/* 1.2.3.4:/usr/local/hadoop/etc/hadoop/
# /usr/local/hadoop/bin/hdfs namenode -formatIf the above command works, it will start the NameNode, run for a few seconds, dump a lot of output, and then exit (having formatted the distributed filesystem).
# /usr/local/hadoop/sbin/start-dfs.shThis script will start the NameNode locally, then connect to all of the worker machines and start DataNodes there.
jps
command, which will list all running Java processes on the local machine.
Run jps
on the master and you should see a NameNode entry. Also check the other machines - running jps
as root on any other machine should now show you that a DataNode is running. If you don't see a DataNode running on every machine, then something went wrong previously. Hadoop logfiles are written to /usr/local/hadoop/logs/
, which may tell you something about what went wrong.
# /usr/local/hadoop/sbin/start-yarn.shAs before, use
jps
to make sure this succeeded. On the master, you should a new ResourceManager (in addition to the previous NameNode). On each of the other nodes, you should see a NodeManager (but no ResourceManager). If everything seems to be running, proceed to the next section.
hdfs
program. First let's copy our Hadoop configuration directory (arbitrarily chosen) into HDFS:
# /usr/local/hadoop/bin/hdfs dfs -put /usr/local/hadoop/etc/hadoop /testThis command copies the Hadoop configuration directory (from our local filesystem) into HDFS (using the 'put' command) as a directory named
test
in the root HDFS directory. We can use the 'ls' command to view the directory we just copied:
# /usr/local/hadoop/bin/hdfs dfs -ls /testOther DFS commands that work as you'd expect include 'cat' (view a file), 'cp' (copy a file within HDFS), and 'get' (copy a file from HDFS back into the local filesystem). Also note that since this is a distributed filesystem, it doesn't matter which node you run these commands from - they're all accessing the same (distributed) filesystem.
# /usr/local/hadoop/bin/hadoop jar \ /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar \ grep "/test" "/test-output" "dfs[a-z.]+"This command will launch the job and print periodic status updates on its progress. You should not see any nasty error messages during the job run if everything is working correctly. Your tiny cluster is unlikely to surprise you with its blazing speed -- expect the job to take a minute or so to execute.
# /usr/local/hadoop/bin/hdfs dfs -cat /test-output/*The output of this example job just includes the matched pieces of text and how many times they appeared in the searched documents. If you'd like to compare to a non-distributed grep (which will also show the entire lines), you can run the following:
# grep -P dfs[a-z.]+ /usr/local/hadoop/etc/hadoop/*
# /usr/local/hadoop/sbin/stop-yarn.sh # /usr/local/hadoop/sbin/stop-dfs.shIn general, it's fine to leave the cluster running when not running jobs, though if you think you might've broken something, it's a good idea to reboot the cluster by stopping and then restarting the daemons.