EC2 Performance, Spot Instance ROI and EMR Scalability

Blog Summary: (AI Summaries by Summarizes)
  • Amazon introduced Elastic Compute Cloud (EC2) to Amazon Web Services (AWS) in 2006 and Elastic MapReduce (EMR) in 2009.
  • EMR uses Hadoop to create MapReduce jobs using EC2 instances with Simple Storage Service (S3) as the permanent storage mechanism.
  • Spot Instances allow you to bid on EMR or EC2 instances that are not in use, which can result in cost savings.
  • The Million Monkeys Project is a good metric for CPU and memory speed in a Hadoop cluster.
  • An EMR cluster is built on EC2 instances and these instances run various parts of the Hadoop cluster.

Update: Be sure to check out my screencast covering this project.

Note: This is a very long, technical and detailed discussion of Amazon Web Services. You can watch the YouTube video below for a less technical explanation or skip to the conclusion to get the results.

Introduction

In 2006, Amazon introduced Elastic Compute Cloud (EC2) to Amazon Web Services (AWS).  In 2009, Amazon introduced Elastic MapReduce (EMR).  EMR uses Hadoop to create MapReduce jobs using EC2 instances with Simple Storage Service (S3) as the permanent storage mechanism.  In 2011, Amazon added Spot Instance support for EMR jobs.  Spot Instances allow you to bid on EMR or EC2 instances that are not in use.  The pricing page under the Spot Instances heading gives up to date data on EMR and EC2 instance prices.

In 2011, I created the Million Monkeys Project (source code).  It is a good metric for CPU and memory speed in a Hadoop cluster as it is very computational and memory intensive in its character group testing.  This project will use the Million Monkeys code to profile the various EC2 instances and the scalability of EMR and Hadoop.  I will talk about the cost savings when running EMR jobs as Spot Instances (bid price) instead of On Demand instances (full price).  This post will help engineers in choosing the right EC2 instance types based on the amount of work or computation needed.

When I originally ran the Million Monkeys Project to recreate every work of Shakespeare, I lacked the resources to run it entirely on EMR.  I started the project on an EC2 micro instance, but the instance lacked enough RAM to run everything I needed.  This time, I have the resources to run the entire project and recreate every work of Shakespeare on EMR using a 20 node EMR cluster.

Setting Up An EMR Cluster

To run an EC2 cluster, various Hadoop services like the Task Tracker and DFS service need to be running. This is in addition to the actual Map and Reduce tasks that will do the actual work.  In an EMR cluster, the various Hadoop services are run on a master instance group.  The Map and Reduce tasks are run on a core instance group.  The core instance group is made of up of 1 or more EC2 instances.  When creating the EMR cluster, you can choose a different instance type for the master and core nodes.  You can use the information in this post in deciding which instance type should be used given the task(s).

An EMR cluster is built on EC2 instances and these instances run various parts of the Hadoop cluster.  The data can reside in S3 and be loaded from S3 into the Hadoop Distributed File System (DFS).  The compiled code or JAR and any input files are stored in an S3 bucket.  At the end of a job, all files that are not in S3 at the termination of the master instance group will be lost.  Therefore, you should make sure that the code places any important output in S3.  In the Million Monkeys code, I created a prefix that could be added to a file’s path to place them directly on S3.

Table 1.1 The Breakdown of Various EC2 Instances Specifications
Instance NameMemoryEC2 Compute Units and CoresPlatformI/O Performance
Small1.7 GB1 EC2 on 1 Core32-bitModerate
Large7.5 GB4 EC2 on 2 Cores64-bitHigh
Extra Large15 GB8 EC2 on 8 Cores64-bitHigh
High-CPU Medium1.7 GB5 EC2 on 2 Cores32-bitModerate
High-CPU Large7 GB20 EC2 on 8 Cores64-bitHigh
Quadruple Extra Large23 GB33.5 on 8 Cores64-bitVery High

Source  Note: High-Memory Extra Large, High-Memory Double Extra Large, High-Memory Quadruple Extra Large instances not tested and are not included on this table.

Instance Testing

EC2 has various instances and performance specifications for those instances.  These EC2 instances are analogous to running a virtual machine in the cloud.  As shown in Table 1.1, each instance type varies in the number of EC2 Computer Units (ECU), the number of virtual cores, the amount of RAM, 32 or 64 bit platform, the amount of disk space, and network or I/O performance.  Some of these descriptions are quite nebulous.  For example, this is the description from Amazon regarding the definition of an ECU (Source):

EC2 Compute Unit (ECU) – One EC2 Compute Unit (ECU) provides the equivalent
CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

That description of CPU capacity does not really help in making capacity decisions or really even guessing how to scale an application.  To this end, I ran various tests to give an absolute idea of how each instance compares to another when running the same tests.

For these tests, I ran the Million Monkeys program for 5 continuous hours.  During this time, the Million Monkeys Code is run in a loop and the total number of character groups is calculated.  A character group is a group of 9 characters that is randomly generated using a Mersenne Twister and its existence is checked against every work of Shakespeare.  The runs lasted for slightly over 5 hours per run and the number of character groups is pro-rated.

Chart 1.1 Total Character Groups Checked In a 5 Hour Pro-rated Period

Chart 1.1 does not present any surprises.  The Small instance obviously has the fewest character groups, followed by Hi-CPU Medium.  Large comes in third and Extra Large and Hi-CPU Large are a virtual tie, with Extra Large coming out slightly higher.  Quadruple Extra Large is the obvious winner with the highest total character groups.  Chart 1.1 gives an idea of the raw computing power of each instance.  It is not until we start looking at price per unit that we get a handle on cost efficiency of a particular instance.

In the original Million Monkeys project, I ran the entire Hadoop cluster on my home computer, an Intel Core 2 Duo 2.66GHZ with 4 GB RAM running Ubuntu 10.10 64-bit.  In 5 hours, my home computer ran 50,000,000,000 character groups.  One of the main differences between my home computer and the EC2 instances is that my home computer was not running in a virtualized environment.  I have seen a 10-30% decrease in efficiency when using virtualization.  Also, all processing was done locally with the Hadoop services and MapReduce tasks running on the same computer.

Chart 1.2 Spot Instance Savings Per Hour When Compared to On Demand

Spot Instances help reduce the cost of running an EMR cluster.  The Spot Instance prices will fluctuate as the market price changes.  Chart 1.2 represents the Spot Instance (bid) prices relative to their On Demand (full) prices when I ran their tests.  The savings in this test was very even across all instances at about 65% off their On Demand prices.  With a little bit of forward planning, an EMR cluster can save a lot of money using Spot Instances.

I should point out that running on a Spot Instance does not require a code change per se.  However, an EMR job flow’s Spot Instances can be taken away because of market price fluctuations.  A MapReduce job flow may need to be changed to accommodate an unplanned stoppage.  This might include saving the job state and adding the ability to start back up where it left off at the last save point.  The Million Monkeys code already did this and could take advantage of the Spot Instances without any code changes.

Chart 1.3 Cost Per Hour For On Demand and Spot Instances

Chart 1.3 shows another cost breakdown by hour of usage and total costs.  Calculating total cost for a single node cluster with EMR can be done using interactive Table 1.2.  For an On Demand instance, the total cost per hour is master node group plus core instance(s) group, plus EMR costs for all instances.  For a Spot instance, the total cost per hour is master node instance plus core instance(s) group spot price plus EMR instance(s).

For example, when I ran the Hi-CPU Medium instance testing, I paid a spot price for the core instance group of $0.06 per hour ($0.17 On Demand).  I also had to pay for the EMR cluster’s master node ($0.17 per hour) which was a Hi-CPU Medium instance as well.  On top that I have to pay the EMR price per hour ($0.03) for the master and core node.

To help illustrate the total pricing, Table 1.2 details the breakdown of total price per hour for Spot and On Demand instances.

Table 1.2 Spot Instance and On Demand Price Calculation
Price Description Spot Instance Price On Demand Price 
Master Node00
Master Node EMR00
Core Node00
Core Node EMR00
Total Price Per Hour00
Spot Instance Price Per HourOn Demand Instance Price Per Hour

 (Try out your own prices)

It is possible to run the master node as a Spot instance instead of an On Demand instance.  Amazon recommends running the master node as an On Demand instance to prevent market price from taking out your master node and stopping the entire cluster.

For these tests, I varied the master node instance type.  Table 1.3 shows a list of instance type for the core and the instance type for the master node I used.

Table 1.3 Core Instance Group Type Used With Master Group Type Test
Core Instance Group Type Master Instance Group Type 
SmallSmall
LargeLarge
Extra LargeExtra Large
Hi-CPU MediumHi-CPU Medium
Hi-CPU LargeHi-CPU Medium
Quadruple Extra LargeHi-CPU Medium
Chart 1.4 Cost Per 100,000,000 Character Groups Checked

Breaking down the data into price per unit gives insight into the most cost efficient means of running a job.  In Chart 1.4, I break down the cost by how much it costs to process 100,000,000 character groups.  For Chart 1.4, the lower the number the better.  This bore out my hunch that the best bang for the monkey buck is a Hi-CPU Medium instance.  I was surprised that the Small instance didn’t come in second best; that position was taken by the Large instance.

Once again, we can see the cost benefits of using a Spot instance.  Across the board, the Spot instances have a much smaller variance than their On Demand counterparts.  The Spot instances went from $0.00128 to $0.00497 and the On Demand instances went from $0.00364 to $0.0142.

Scalability Testing

The Instance testing above led up to the next phase of the project.  In Chart 1.4, we found that the Hi-CPU medium instances provided the highest cost efficiency per character group.  Now, I will take the most cost efficient instance and see how well it scales by adding more nodes to the cluster.  For these tests, I created EMR clusters of 1, 2, 3, 4, 5, 10 and 20 nodes.  Once again, I ran each cluster size for 5 hours and captured the results.

Chart 2.1 Spot Instance Savings Compared to On Demand Prices

In Chart 2.1, I show the cost savings by comparing Spot and On Demand prices across clusters sizes.  The bars with the the “All” designations show the entire cost roll up of the cluster size.  The core cost is consistent across all node sizes; however, having more nodes running at once increased the savings.

Chart 2.2 Cost Per Hour When Running Various Numbers of Nodes

Chart 2.2 shows another cost breakdown by hour of usage and total costs for Spot and On Demand instances.

To help illustrate the total pricing with a multi instance core group, Table 2.1 details the breakdown of total price per hour for Spot and On Demand instances for a 10 node cluster.

Table 2.1 Spot and On Demand Instance Price Calculation
Price Description Spot Instance Price On Demand Price 
Master Node00
Master Node EMR00
Core Node00
Core Node EMR00
Total Price Per Hour00
Spot Instance Price Per HourOn Demand Instance Price Per HourNumber Of Nodes

 (Try out your own prices)

As you can see, you can calculate the current and even project the cost of a cluster.  There is a new company, Cloudability, who has made it their mission to make not just cluster, EMR and EC2 price reporting more simple but look for ways to improve it (now in beta). Cloudability can even send you a daily or weekly Email showing the charges for that period.  You can check out their website and sign up for a free account.  Although I was unable to use Cloudability for this project, I look forward to using it in my next projects.

Chart 2.3 Cost To Run 100,000,000 Character Groups At Various Numbers of Nodes

In Chart 2.3, I break down the cost by how much it costs to process 100,000,000 character groups.  For Chart 2.3, the lower the number the better.  Once again, the Spot instance pricing shines.  In this case, the Spot instances price variations are quite flat and the On Demand varies much more.

Chart 2.4 Total Character Groups At Various Numbers of Nodes Pro-rated to 5 Hours


Chart 2.4 shows the power of creating a multi-node cluster.  With 20 nodes in the cluster, 477,987,913,067 character groups can be run in a 5 hour period.

I want to reiterate that there are no code changes necessary for creating a large cluster like this.  I only needed to make EMR configuration changes when creating the cluster.  Also, cluster configuration changes can be made to a live or running cluster.  You can add or remove core instances to increase or decrease the performance of a cluster.

Chart 2.5 Percent Of Linear Scalability From Actual Growth At Various Numbers of Nodes

Now let’s get in to the scalability of EMR and Hadoop.  A 1 node cluster is assumed to be the most efficient possible in an EMR cluster.  As you can see, Chart 2.5 recognizes that with a 100% efficiency for a 1 node cluster.  All subsequent cluster size efficiencies are calculated as number of character groups for 1 node, times the number of nodes in the cluster.  A 2-5 node cluster has very similar loss of efficiency at about 5%.  A 10 and 20 node cluster have a loss of efficiency at 13% and 16% respectively.

For anyone who has created a distributed system, they will recognize 84% as a phenomenal level of scalability. This really shows that EMR and Hadoop are living up to the hype as revolutionary technologies.  With no code changes and simple configuration changes, you can easily scale an application.

Chart 2.6 Actual Scalability With Projected Linear Growth Pro-rated to 5 Hours

Chart 2.6 presents another breakdown of the scalability showing the absolute or actual values and the calculated values at 100% efficiency.  Once again, we see a very gradual decline in cluster node sizes 2-5.  There is a much more obvious decline on 10 and 20 nodes.

Million Monkeys On EMR

In my original run of the Million Monkeys Project, I tried to use the Micro Instance EC2 to run the project.  The project needed more RAM than was available on the micro instance and I had to move it to my home computer.  Many reporters and commenters asked me how long the project would take if I ran it to completion on EMR.  This time, thanks to Amazon, I have the resources to run the project on a multi-node EMR cluster.

The instance testing and scalability testing really lead up to this test.  In the instance testing, I wanted to find the EC2 instance type with the best bang for the buck.  Next, I took that best EC2 instance (Hi-CPU Medium) and wanted to see what amount of efficiency I was losing when running a 20 node cluster.  From there, I created a 20 node Hi-CPU Medium cluster that ran the Million Monkeys code for a prolonged period of time.  I wanted to see how long it would take a 20 node cluster to recreate the original project.

For a little perspective, the original Million Monkeys project recreated every work of Shakespeare after running 7.5 trillion character groups and ran for 46 days.  For these prolonged tests, I actually ran the 20 node cluster twice.  The first time ran 12 trillion character groups in 5 days 17 hours.  The second time ran 25.7 trillion character groups in 11 days 15 hours.  Each one ran about 2.2 trillion character groups per day.  Given the random nature of the problem, we can only extrapolate how long the original project would have taken.  With these performance numbers, it would have taken 3 days 9 hours to complete the original project.

The cluster cost about $45.44 per day to run.  I ran the cluster with the configuration as shown in 20 node scalability testing above with the master instance group as one Hi-CPU Medium instance running On Demand.  The other 20 nodes are Hi CPU Medium instances running with a Spot price of $0.09 per hour.  The 5 day run cost $317.96.  The 11 day run total cost was $528.25.  If I hadn’t used Spot instances, the 11 day total cost would have been $1,514.87.  Once again, Spot pricing really shines because I achieved the same goal with almost $1,000 in savings.

Thoughts and Caveats

Previously, I mentioned that the Million Monkeys code is a good metric of CPU and memory.  There is less I/O than might be run in other MapReduce tasks.  I spent some time and effort to reduce the amount of I/O in code.  To reduce the amount of I/O, I used a Bloom Filter in the Map task.  The Bloom Filter is created once and saved in S3.  All future Map tasks simply read the Bloom Filter file and run all processing against it.  Once the Reduce tasks is run, a 3.5 MB text file is loaded into memory for the final existence checks.  Depending on the MapReduce task, a Map tasks may need to read in gigabytes or even terabytes of data for processing.  Another key difference for EC instances is their I/O performance.  For MapReduce tasks that require high I/O performance, a High-CPU medium instance with moderate I/O performance may not have the best cost to performance ratio.

Earlier in the project, I used the AWS web user interface to create EMR jobs.  It was a bit of a pain to setup the command line interface’s (CLI) various keys.  Once I set up the CLI, it made the testing much easier and I wish I would have used the CLI sooner.  It was much easier to repeat a job.  The EMR API can be used to spawn your cluster programmatically.  Here is the command line that I used to spawn the job:

./elastic-mapreduce --create --name "Monkeys Scalability 5 Hour Test 20 Node"   
--instance-group master --instance-type c1.medium --instance-count 1
--instance-group core --instance-type c1.medium --instance-count 20 --bid-price 0.08  
--jar s3://monkeys2/monkeys.jar --arg timelimit=5h --arg iterationsize=1
--arg memory=-Xmx1024m --arg -Dmapred.max.split.size=12000
--arg -Dmapred.min.split.size=10000

I would like to break down what this command is doing.  It is creating a new EMR cluster with the job name “Monkeys Scalability 5 Hour Test 20 Node.”  The master instance group will be made up of one High-CPU Medium instance.  The core instance group will be made up of 20 nodes with a bid price of $0.08 per hour per instance.  It will be running a custom jar located in S3 at s3://monkeys2/monkeys.jar.  The rest of the arguments are for the Million Monkeys code itself.

The performance could be improved by spending some time tuning and looking at configuration changes as all tests used defaults.  For the duration of this project, I did not spend time optimizing the jobs and used only default settings, except maximum Java heap memory (-xmx).

Although I kept my bid prices in nice round cents, you can bid in fractions of a cent like $0.085.

For the curious, I used JFreeChart for the charts and graphs on this page with a customized color scheme.

Problems

Hadoop and EMR jobs are usually geared towards very large input files.  In the case of the Million Monkeys project, the input files are very small, usually a few KB.  This presented an issue when I started running the EMR cluster with multiple nodes.  When I compared the multi-node results to the single node results, there was barely any improvement in total character groups.  In some cases, a multi-node cluster did worse than a single node cluster.  After a LOT of Googling and guesses, I finally found a small input file workaround by specifying the max and min split sizes for the input file.  Here is the command so as to save a future person lots of Googling:

./elastic-mapreduce --create --name "Monkeys Scalability 5 Hour Test 20 Node"   
--instance-group master --instance-type c1.medium --instance-count 1
--instance-group core --instance-type c1.medium --instance-count 20 --bid-price 0.08  
--jar s3://monkeys2/monkeys.jar --arg timelimit=5h --arg iterationsize=1
--arg memory=-Xmx1024m --arg -Dmapred.max.split.size=12000
--arg -Dmapred.min.split.size=10000

The split workaround stopped working (I never figured out why).  I looked around for a better solution and found the NLineInputFormat class.  Had I known about this class when I first wrote the code, I would have used it.  It is a much better fit for the type of input I am using for the Million Monkeys project.

When you are budgeting for your AWS project, make sure you bake in some time and money for running down some issues.  You may run in to some issues with multiple nodes that did not happen on a single development computer.

Conclusion

EMR is a great, cost effective way to get an enterprise Hadoop cluster going.  It is also easier to get an enterprise Hadoop cluster up and running than a Do-It-Yourself method.  An EMR cluster solves the many problems of creating an enterprise cluster like hardware specs, uptime and configuration.  Until you have dealt with the pain of redundancy and enterprise hardware requirements, you don’t know how much time and effort EC2 and S3 save.  With EMR, you simply have to start the cluster and all of these issues are taken care of.

I also showed how EMR and Hadoop make scaling easy.  You do not have to convince your boss to buy a $2,000 to $4,000 server(s); you can simply add more EC2 instances to the core instance group or change the instance type to one with more ECUs.  This can be done on a temporary basis to accommodate higher usage or a gradual increase in capacity.  Without changing the code, I was able to scale the cluster to 20 nodes.

EMR clusters can be run at Amazon Web Service’s various locations around the world.  AWS has 3 in the United States, 1 in Ireland, 1 in Singapore, 1 in Tokyo and 1 in Brazil.  Separate EMR clusters could be used in conjunction with geographic sharding or simply choosing the nearest location to the client.

Spot instances also show great promise in further reducing the price per hour of an EMR cluster.  For the 20 node tests, I reduced total cost per hour from $2.20 to $1.30, a 41% decrease.  During one of the 20 node speed runs, I saved $1,000 by using Spot instances.  If you decide to use Spot instances, make sure your code can handle its instances being taken away as the market price increases.

I think this project shows there is true substance to the hype and buzz around Hadoop and EMR.  Anyone who has created their own distributed system knows that achieving 84% efficiency is an impressive feat.  There are a great number of use cases that can make efficient use of Hadoop and EMR.  Paired with EMR, you can easily run a cost efficient, enterprise level, cluster that can run around the world.

Full Disclosure: Amazon supported this project with AWS credit.  I would like to thank Jeff Barr and Alan Mock from Amazon for their help in making this project possible.

Copyright © Jesse Anderson 2012.  All Rights Reserved.  All text, graphs and charts on this page are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Related Posts

Data Teams Survey 2020-2024 Analysis

Blog Summary: (AI Summaries by Summarizes)**Total Value Creation**:**Gradual Decrease in Value Creation**:**Team Makeup and Descriptions**:**Methodologies**:**Advice**:Frequently Asked Questions (AI FAQ by Summarizes)

Data Teams Survey 2024 Results

Blog Summary: (AI Summaries by Summarizes)Companies are not fully utilizing LLMs in data engineering, with 24.7% of teams not using them at all.Only 12% of