CSCI 3325
Distributed Systems

Bowdoin College
Spring 2019
Instructor: Sean Barker

Project 3 - MapReduce

The goal of this project is to become familiar with MapReduce, a popular model for 'big data' analysis on massive computing clusters that was developed by Google and publicly released by Apache (in the form of Hadoop). To do so, you will write and run several of your own MapReduce jobs on a Hadoop cluster. Additionally, you will also gain a bit of system administration experience by setting up your clusters from scratch.

This project should be done in teams of two (or three, with prior permission). All team members are expected to work on all parts of the project.

Part 1: Cluster Setup

Each group has been provided with a set of virtual servers hosted in an Amazon datacenter. Your first task is to configure your servers as a small Hadoop cluster. Follow the steps given here to configure your cluster. Setting up your cluster from scratch is likely to take some time.

Once that's done, make sure you can compile and run the sample Hadoop job, which gives you the skeleton of a complete Hadoop program without too much complexity.

Once your cluster is running and you are able to execute MapReduce jobs, your task is to write two such jobs (of which the first is basically a warmup).

Part 2: Inverted Index

An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. It is also one of the most popular MapReduce examples. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents:

      Doc1: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
      Doc2: Buffalo are mammals. 

We could construct the following inverted file index:

      Buffalo -> Doc1, Doc2
      buffalo -> Doc1
      buffalo. -> Doc1
      are -> Doc2
      mammals. -> Doc2 

Your goal is to build an inverted index of words to the documents which contain them. You can try this on the files in the dataset located here. You will need to copy these files to your cluster.

Your end result should be something of the form: (word, docid[]).

The actual logic of this job is quite straightforward, so your primary task will be learning your way around the basic Hadoop classes. Expect to spend some time reading Javadoc. Apache also has a MapReduce tutorial (linked below) that you may find useful. One specific tip that may help you in filtering through web answers to Hadoop questions - make sure you stick to the org.apache.mapreduce classes and avoid the org.apache.mapred classes. The latter have much of the same functionality but are part of an old (deprecated) Hadoop API.

Part 3: Internet Advertising

Suppose you work for an internet advertising company and want to better target your ads to users based on prior data. In other words, given an advertising context, we would like to predict which of our available advertisements is most likely to result in a click.

The ad serving machines produce two types of log files: impression logs and click logs. Each time we display an advertisement to a customer, we add an entry to the impression log. Each time a customer clicks on an advertisement, we add an entry to the click log. Ideally, every ad impression would result in a click (in reality, of course, the number of impressions will be much greater than the number of clicks).

We are interested in determining which ads should be shown on which pages in order to increase the number of clicks per impression. In particular, given a page URL on which we will be showing an ad (often called the referrer) and a particular ad ID, we wish to determine the click through rate, which is the percentage of impressions with the specified referrer and ad ID that were clicked. Clearly, a higher click through rate suggests that a particular ad is better suited to a particular page.

Your goal is to build a summary table of click through rates, which could later be queried by ad serving machines to determine the best ad to display. Logically, this table is a sparse matrix (i.e., a matrix in which most values are 0) with one axis representing the referrer and the other axis representing the ad. The matrix values represent the click through rates.

Test Logs

You will be provided with a test dataset of impression and click logs from the ad serving machines. First, you should download the log archive on a cluster machine:
wget https://bowdoin.edu/~sbarker/teaching/courses/distributed/19spring/files/p3-data.tar.gz

Extracting this archive will produce two directories: impressions and clicks, containing their respective logfiles. Take a look at a few of the logfiles to get a sense of their content. The logfiles are stored in JSON format, with one JSON object per line. In particular, note that every impression is identified by an "impressionId", and each click is similarly associated with an "impressionId", indicating which ad was clicked. You can try grepping for a particular impression ID in both the impression and click logs to see both matching entries.

Now, copy the log directories into HDFS. Note that you may see warnings like the following from HDFS during the copy, which you can safely ignore:

19/03/28 19:26:45 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
  at java.lang.Object.wait(Native Method)
  at java.lang.Thread.join(Thread.java:1252)
  at java.lang.Thread.join(Thread.java:1326)
  at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
  at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)

Before you can operate on the logfiles directly, a bit of preprocessing is in order. MapReduce will more efficiently operate on a small number of large files, but here we have a large number of small files instead. Thus, we'll first use MapReduce to merge all of the files together (for impressions and clicks, respectively). Rather than writing a custom job for this, we can just use the "sort" example job included with Hadoop, as in the following (note that you may need to modify the HDFS paths depending on where you copied the files):

/usr/local/hadoop/bin/hadoop jar \
      /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar \
      sort /impressions /impressions_merged \
      -inFormat org.apache.hadoop.mapreduce.lib.input.TextInputFormat \
      -outFormat org.apache.hadoop.mapreduce.lib.output.TextOutputFormat \
      -outKey org.apache.hadoop.io.LongWritable \
      -outValue org.apache.hadoop.io.Text

Expect this job to take roughly 30 minutes to complete on your cluster. Repeat this step to merge the click logs (which should take another 30 minutes or so).

Take a look at the output of the merge jobs. Each line of the merged files begins with a number (which wasn't part of the original files) and then the data line itself (a JSON object). This difference is due to the Hadoop sort outputting numeric keys with the actual file data as the values. One straightforward way to handle the merged files as input to your job is to write your map input as LongWritable, Text and just ignore the numeric key.

ClickRate Job

Your custom MapReduce job should be named ClickRate.java and should operate on the merged data files. The job should be provided three arguments: (1) the merged impression files, (2) the merged click files, and (3) the output path. In other words, your job should be executed as follows:

/usr/bin/hadoop/bin/hadoop jar build.jar ClickRate [impressions_merged] [clicks_merged] [out]

The output of your job must be in the following exact format:

[referrer, ad_id] click_rate

This can be achieved by making the output key the string "[referrer, ad_id]" and the output value the fractional value click_rate.

When processing the data in your job, you'll need to parse the JSON objects. The easiest way to do so is using a JSON library such as JSON.simple. You can tell Hadoop to include libraries (stored in jarfiles) by adding the jarfiles to a folder named lib and including this folder in the jar you create. Here is a Makefile that will do this:

LIBS=/usr/local/hadoop/share/hadoop
NEW_CLASSPATH=lib/*:${LIBS}/mapreduce/*:${LIBS}/common/*:${LIBS}/common/lib/*:${CLASSPATH}

SRC = $(wildcard *.java) 

all: build

build: ${SRC}
  ${JAVA_HOME}/bin/javac -Xlint -cp ${NEW_CLASSPATH} ${SRC}
  ${JAVA_HOME}/bin/jar cvf build.jar *.class lib

Hadoop will distrubute the lib folder to all nodes in the cluster and automatically include it in the classpath. You can download the JSON.simple jarfile here.

Sketching out your job on paper is almost certainly a better approach than trying to develop it "on-the-fly" while writing your code. While this job involves more complexity than the inverted index, the amount of actual code needed is still quite modest. Your code should compile without any warnings, except for several "bad path element" errors that you can ignore.

Finally, you may find it helpful to design your job as a sequence of two separate MapReduce operations. You can easily execute a sequence of MapReduce operations within your job by constructing multiple Job objects within your run method. While this task is doable as a single MapReduce operation, you may find it simpler to design a sequence of two operations.

Writeup and Submission

Your writeup can be relatively modest for this assignment -- you should discuss the design of your MapReduce jobs, as well as overview how you implemented them in Hadoop. You should also provide a piece of your final output (i.e., part of the final click rate data). Remember to ensure that your output is in the format specified above!

To submit your assignment, submit a gzipped tarball to Blackboard. Please include the following files in your tarball:

  1. Your writeup (PDF).
  2. All the files for your source code only. Please do not include any executables.
  3. Your Makefile (or compile instructions) and your jar file (build.jar).
  4. A snippet of your final output (either part of or separate from your writeup).

Resources