Background
The goal of this assignment is to learn how to evaluate the performance and limitations of distributed batch processing systems.
Motivation
Batch processing frameworks generally provides an easy to use framework for developing and executing tasks in parallell on a single computer, and in a network of computers. MapReduce is a batch processing framework, and Apache Hadoop implements this framework as an open source project. The motivation for this assignment is to familiarize yourself with the advantages and limitations of batch processing frameworks through practical use of Apache Hadoop MapReduce.
Task
Evaluate the performace of Apache Hadoop MapReduce for the following workloads:
- A workload where each step depends on the output of the previous step, such as Fibonacci Numbers.
- A parallelizable workload with multiple barriers, such as Shearsort.
- A workload that is easy to parallelize, such as the wordcount example from the Apache Hadoop Mapreduce getting started guide.
You should test the workloads in a Hadoop configurations on a single machine, and how the workloads scale with one to (at least) ten slaves.
Getting started with Apache Hadoop MapReduce
Visit http://hadoop.apache.org/mapreduce/ and read their getting started guide.
De som ?nsker ? lese om MapReduce fra et litt annet perspektiv kan ta en titt p? MapReduce tutorial p? Google Code University.
Machines
For this assignment you can use your own machines, or machines at IFI.
Machines at IFI are usually marked with their name. So you should be able to find machines by visiting bachelor and master labs. Bachelor machines can also be found using:
Find name of bachelor labs: $ ls /hom/peder/opt/termstue/share/maps/ Show machines on a lab: $ ~termvakt/bin/termstue <ROMNAVN> Example: $ ~termvakt/bin/termstue assembler
Note: The bachelor machines runs the Idle Job Killer script. This script kills any processes on machines where you are not logged in. One solution is to be logged in and do something on every machine, another is to set a nice value (this is explained in the emails Idle Job Killer sends you when it kills a process).
Assignment
Solve the task in a group of two (or alone). Present your experiences and results orally in the course INF507x on November 4, 8:15. It is mandatory to write a report that in the specified format and deliver it by November 4, 8:15, using Devilry. Keep in mind that (a) it is possible to update the report until November, and (b) you will be asked to choose 4 out of 5 reports for evaluation.
It is mandatory to present your group's results on November 4. You do not have to prepare a formal presentation (like a Powerpoint foilset); however, you must at least show the measurement results that are included in your report and that you discuss in class. The discussions in class are supposed to help you improve your report for final delivery. It is recommended that you have a web page or a PDF document that is web-accessible from an arbitrary computer.
Report
The written report has up to 4 pages in ACM format (see right column). It is expect that such a report includes: a description of the assignment, a description of the testbed, an explanation of the metrics that were chosen to present the measurement results visually, graphs showing the results, an interpretation of the graphs.
The results must be based on the own tests.
The report is evaluated by writing quality, by the trustworthiness and correctness of the results. The evaluation does not consider whether related work (citations of other papers) is included. It is not necessary to cite existing work in this report.
Log in to comment