AWS Spot Instances: Retention Strategies for sustained spot usage


“How do we optimally bid for Spot Instances to ensure that it is retained for the duration of the workload?” Simple question, right? Yes. Are you kidding? No, it is the answer that is complex! The more you dig into it, the more you realize how deep the web really is.

In this post, I will give you a peek into Spot Instances retainment strategies to employ for running your workloads with ease. This post is an embodiment of the adage, “A picture is worth a thousand words.”


Spot Instances

A recap – Spot Instances are spare computing capacity available at deeply discounted prices. AWS allows users to bid on unused EC2 capacity in a region at any given point and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and all users bids that meet or exceed it gain access to the available Spot Instances.


Retention Strategies

Spot Instances, by design, can be taken away from you anytime. Retaining a Spot Instance isn’t just about bidding at an optimum price. Lots of other considerations have to be thought through. Name a few, you ask? Sure, here goes –

  • Choosing an optimum Bid Price
  • Selecting a (similar) instance type
  • Opting for the right Availability Zone


Typical Workload Scenario

Consider the following example: A batch workload, running daily at 12:00AM UTC+0530, picks files from an S3 bucket, transcodes those into HD format, and stores the results back into the same bucket. The entire process takes 8 hours on an m3.large instance.


The above graph shows the Spot Instance Pricing History for m3.large for the last 3 months. There are lots of spikes in us-east-1e. It is definitely not a good Availability Zone to launch our Spot Instances. us-east-1b has two spikes. It is better than us-east-1e, but we can do better. us-east-1a and us-east-1d have no spikes at all. These are the Availability Zones you should prefer as a first step.

Between us-east-1a and us-east-1d, which is the better Availability Zone? Let’s look at their Spot Instance Pricing History separately.

Picture20.pngPicture19.pngus-east-1a shows five tiny spikes in Spot Price while us-east-1d shows three. Clearly, us-east-1d is the slightly better Availability Zone.

Setting the bid price at 0.055 USD or more will ensure that m3.large Spot Instances will be retained for the duration of the workload.

What if you want to run your workload on m4.large?


It is apparent that all Availability Zones have huge spikes. m4.large itself is not a good instance type. We should now start considering similar instances.

m4.large has a hardware specification – 2 vCPU and 8 Mem (GiB). Similar instances, considering a percentage variance of Mem (GiB), would be m3.large – 2 vCPU and 7.5 Mem (GiB) and c4.xlarge – 4 vCPU and 7.5 Mem (GiB).


Again, us-east-1e has a high spike occurrence. The safest Availability Zone would be us-east-1d. The Spot Instance Pricing History of us-east-1d for c4.xlarge is –


The average Spot Price for c4.xlarge hovers around 0.04 USD, while that of m3.large is 0.02 USD. The most similar instance for m4.large, with all else being equal, is m3.large and you should choose this instance and run your workload in us-east-1d at a bid price of 0.055 USD.

Did you notice the complex set of steps involved in choosing the right Availability Zone, zeroing in on the best instance type, and setting the optimum bid price? Now extrapolate it to 14 regions, 38 Availability Zones, and 600+ bidding variations. We haven’t even considered when is a good time to run your workload. If your workload start times can be flexible, give or take a few hours, then you might get an additional savings of up to 15%. And yes, we are getting our hands dirty in the elusive domains of Pattern Recognition and Machine Learning.

X-Post from blog


Posted in Uncategorized | Leave a comment

AWS Spot Instance Termination Notices: How to make the best use of it


Spot Instances have gained popularity and continue to do so at a rapid pace. However, they come with their own set of complexities and challenges. Mitigation plans, if not in place, lead to application downtime and cost you dear. Common strategies include –

  • Never launching all Spot Instances of a single instance type
  • Launching at least 2 instance types are a must for better availability
  • Never launching all Spot Instances in a single Availability Zone

Since the Spot availability and prices are governed by market volatility, there is still a high probability of the instance being taken away from you. In this post, I will explain how to make the best use of Spot Instance Termination Notices.


Spot Instance Termination Notice

In late 2009, AWS launched Spot Instances allowing users to bid for spare EC2 capacity at a price they were willing to pay. When those were reclaimed, users would not know a priori. This resulted in lost work and data inconsistencies, in turn affecting users’ businesses. Hence, in early 2015, AWS introduced Spot Instance Termination Notice, a two-minute warning with the goal of enabling the user or an application to take appropriate actions. These include, but are not limited to –

  • Saving the application state
  • Uploading final log files
  • Removing itself from an Elastic Load Balancer
  • Pushing SNS notifications




Google Trends shows that Spot Instances are gaining popularity. But what worries us is that people are not aware of the termination notice. How are they managing it then? Short answer: They aren’t!


Strategies for Spot Availability

As mentioned earlier, common strategies are employed to mitigate the risk of Spot Instance availability. A Spot Instance management solution, such as ours, goes further to include –

  • A Spot Availability Predictor, which predicts the likelihood of an instance being taken away
  • Falling back to an appropriate similar instance
  • Usage of Spot Fleet and Spot Block
  • Choosing the best bid price

With all these in place and given the volatile nature of Spot Instances, sometimes things do get out of control! A certain Spot Instance has executed a majority of your workflow and only a tiny bit is pending for successful completion. The instance is now taken away from you. Would you restart the entire workflow again?


Usage of Spot Instance Termination Notice

How will an application running on a Spot Instance know that it will be reclaimed? The Termination Notice is accessible to an application running on the instance via the instance metadata at –

This information will be available when the instance has been marked for termination and will contain the time when a shutdown signal will be sent to the instance’s operating system. AWS recommends that applications poll for the termination notice at five-second intervals. This will give the application almost two full minutes to complete any required processing, such as saving the state and uploading the final logs before it is reclaimed. You can check for this warning using the following query –

$ if curl -s | grep -q .*T.*Z; then echo terminated; fi

Here’s a timeline, reproduced from the AWS blog, to help you to understand the termination process (the “+” indicates a time relative to the start of the timeline) –

  • +00:00 – Your Spot instance is marked for termination because the current Spot price has risen above the bid price. The bid status of your Spot Instance Request is set to marked-for-termination and the /spot/termination-time metadata is set to a time precisely two minutes in the future.
  • Between +00:00 and +00:05 – Your instance (assuming that it is polling at five-second intervals) learns that it is scheduled for termination.
  • Between +00:05 and +02:00 – Your application makes all necessary preparation for shutdown. It can checkpoint work in progress, upload final log files, and remove itself from an Elastic Load Balancer.
  • +02:00 – The instance’s operating system will be told to shut down and the bid status will be set to instance-terminated-by-price.

You have just been hearing vague action items, such as save the state or checkpoint the progress. What does that actually mean, you ask?


Typical Usage Scenario

Consider a sample stateless application, such as a HealthCheck API, running on Spot Instances behind an Elastic Load Balancer. When a request is made to the application, one of the Spot Instances processes it. But before the result is returned, that Spot Instance is reclaimed. The application, not having received any result, after waiting for a pre-configured timeout duration, sends the request again. Another Spot Instance now processes it and returns the result. Easy-peasy here.

The complexities arise when an application is stateful. Let’s consider a sample video encoding application which takes about 45 minutes for an HD conversion. The infrastructure setup is the same as before, i.e., Spot Instances behind an Elastic Load Balancer. The workflow is as follows –

  1. Application sends a video encoding request to the Elastic Load Balancer
  2. The Elastic Load Balancer routes it to a Spot Instance
  3. The Spot Instance starts the HD conversion, taking 45 minutes
  4. The Spot Instance then returns the downloadable link to the application


The problem is with Step 3 above. What if the HD conversion is 40 minutes deep and then the Spot Instance is reclaimed? If you did not save the state, then you have to restart it from scratch again. Saving the state simply implies that you store the current snapshot in a persistent storage such as S3. When a new Spot Instance becomes active, it first copies the snapshot from S3 and then resumes the workflow.

As is evident, it clearly demands a few house-keeping activities – the snapshot has to be moved from the local store to one that is persistent. It then has to be transferred back from the persistent store to local on a new Spot Instance, so that the application can resume the operation. I know you have a few rapid-fire questions for me. Shoot away!

  • Isn’t this enough?
    • No
  • Why?
    • AWS sends these termination notices on a best-effort basis. This basically means that while they make every effort to provide this warning, it is possible that your Spot Instance will be terminated before Amazon EC2 can make the warning available.

Wow, that’s a shocker!

  • Can we do better?
    • Yes, we can!


Making Spot Instance Data Persistent

When a Spot Instance is reclaimed, it takes with it data present on its local storage. Any EBS volumes attached would persist, provided the Delete on Termination was unchecked. The architectural design to make the data persistent is as follows –

  • Store the required data in an EBS volume, say ebs1, attached via a mount point, say /mount1
  • Spot Instance gets the termination notice, giving the application a two-minute warning
  • The application detaches ebs1 from the Spot Instance
  • Launch a new Spot Instance, with user-data containing the script to attach ebs1 on /mount1
  • A resumable controller, with intelligence to resume the operation from where it had left off, restarts the application
  • The application runs to completion


There is also a secret sauce which I have intentionally not delved into. And that’s it – we have accomplished data persistence on Spot Instances, too, just like On-Demand Instances. It is quite an achievement!

There’s also an alternate way through Amazon Elastic File System (EFS). Amazon EC2 instances mount EFS via the NFSv4.1 protocol, using standard operating system mount points. Currently, it is available only in three regions: Northern Virginia (us-east-1), Oregon (us-west-2), and Ireland (eu-west-1). It is still early days but holds a lot of promise.

I completely understand if you say, “Spot management is none of my business.” But, quite frankly, it is ours! Register now for a free 14-day trial of Batchly.

Remember: A penny saved is a penny earned!


Vijay Olety is a Founding Engineer and Technical Architect at He likes to be called a “Garage Phase Developer” who is passionate about Cloud, Big Data, and Search. He holds a Masters Degree in Computer Science from IIIT-B.

X-Post from blog

Posted in Uncategorized | Leave a comment

Run your Production Web Application on AWS Spot Instances

Spot Instances are great. They are cheap and offer up to 90% savings over On-Demand Instances. By design, they can be taken away at any time. “How would you then run Web / App tier on Spot Instances?” is the million-dollar question that needs an answer.

In this post, I will delve into its details. In the end, I am sure you will appreciate the value that Spot Instances provide and also recognize that they can be used for any kind of workload.


Spot Instances

First, a recap – Spot Instances are spare computing capacity available at deeply discounted prices. AWS allows users to bid on unused EC2 capacity in a region at any given point and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and all users bids that meet or exceed it gain access to the available Spot Instances.


Simple Architecture Diagram

A simple architecture below includes the following components:

  • External-facing Amazon Virtual Private Cloud (VPC) containing one subnet within a single Availability Zone (AZ)
  • Auto Scaling Groups (ASG) for the EC2 instances to handle requests
  • An Elastic Load Balancer (ELB) to route requests across these instances
  • Standard Identity and Access Management (IAM) roles and instance policies


Credit: Imgur


Sample Request Workflow

A request is made to through a browser. The browser then contacts Amazon Route 53, highly available and scalable cloud Domain Name System (DNS) web service. It then understands that there is a CNAME record associated it with, of the form The request is then forwarded to the ELB which subsequently pushes it to an underlying EC2 instance, say m3.large. That instance processes the request and the response is shown on the user’s browser.

Simple right? Yes, as the implicit assumption is that the instances behind the ELB are On-Demand instances and always available. What if we switch-over to Spot Instances? How would the architecture change?


Complex Architecture Diagram

An advanced architecture below includes, and not limited to, the following components:

  • External-facing Amazon Virtual Private Cloud (VPC) spread across multiple Availability Zones (AZs) with separate subnets for different applications
  • Auto Scaling Groups (ASG) for the EC2 instances to handle requests
  • Elastic Load Balancers (ELB) to route requests across these instances
  • Standard Identity and Access Management (IAM) roles and instance policies


Credit: AWS Docs

It is a highly available architecture spread across multiple AZs and subnets. This design is common to both On-Demand and Spot Instances but makes more sense to the latter as there is an added complication of instances being taken away. As you can see, a Production VPC has multiple Private Subnets spread across us-east-1b and us-east-1c.

A request to hits the ELB, which routes it to a Private Subnet in one of the AZs, say us-east-1b and the underlying EC2 instance, say m3.large processes it. If the routing policy is Round Robin, then the next request is forwarded to us-east-1c. If one of the Spot Instances is taken away, then the ASG immediately kicks in and provisions another Spot Instance. Let’s say there was a huge spike in Spot Price for m3.large in us-east-1b. All of our Spot Instances in us-east-1b would be terminated. Every request to is now routed to us-east-1c and it might overwhelm the associated instances. How should you address these issues?


Best Practices

Top of my head, I can recall a few:

  • Spread your architecture across multiple AZs
    • Consider us-east-1b and us-east-1c. If an entire AZ goes off-the-grid, say us-east-1c due to a natural calamity, then us-east-1b continues to process requests ensuring application availability.
  • Always have two or more instance types behind an ELB in each AZ
    • Consider large and m4.large in us-east-1b. Even though all m3.large are taken away due to a spike in Spot Prices, m4.large continues to process requests in us-east-1b.
  • Have ASGs associated with every instance type
    • Set minimum, desired and maximum counts to ensure that if a few Spot Instances are terminated, new ones can be immediately provisioned.
    • Associate CloudWatch metrics with ASGs so that we have the required capacity when there is a traffic spike.
  • Choose a random bidding strategy
    • Consider large and m4.large in us-east-1b with their current Spot Price being 0.1 USD. Set m3.large Bid Price as 0.2 USD and m4.large as 0.4 USD. If the Spot Price changes to 0.3 USD for both the instances, due to market volatility, then m3.large will be taken away and m4.large will continue to process requests.


Some of the best practices mentioned above can be offloaded through the usage of Spot Fleet, as described in our earlier post. Recently, Spot Fleet announced support for Auto-Scaling Groups which further eases the use of Spot Instances while guaranteeing compute capacity. This should suffice for a vast majority of applications most of the times.

But the real-world is the real deal; things can go drastically wrong, such as non-availability of any Spot Instance, and cause application downtime. This is totally unacceptable and a Plan B should be in place to ensure business continuity. follows a Hybrid model, wherein a few On-Demand Instances are launched and the remaining majority are Spot Instances. This ensures  both application uptime and cost savings are never compromised. The customers are extremely happy. So are we.

Register now for a free 14-day trial of Batchly.

Vijay Olety is a Founding Engineer and Technical Architect at He likes to be called a “Garage Phase Developer” who is passionate about Cloud, Big Data, and Search. He holds a Masters Degree in Computer Science from IIIT-B.

X-Post from blog

Posted in Uncategorized | Leave a comment

Run your AWS Elastic Beanstalk on Spot Instances and Save up to 80% on EC2 costs


AWS Elastic Beanstalk is an easy-to-use service for deploying and scaling web applications. It supports services developed with Java, .NET, PHP, Node.js, Python, Ruby and Go on well-known servers such as Apache, Nginx, and IIS. You can simply upload your code and Elastic Beanstalk automatically handles the deployment – from capacity provisioning, load balancing, auto scaling to application health monitoring. At the same time, you retain full control over the AWS resources powering your applications and have complete access to the underlying resources.

Spot Instances

Spot Instances are spare computing capacity available at deeply discounted prices. AWS allows users to bid on unused EC2 capacity in a region at any given point and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and all users bids that meet or exceed it gain access to the available Spot Instances.


Batchly is a solution that balances AWS workloads to achieve On-Demand availability at spot prices. Batchly’s unique algorithm and tighter integration with Auto Scaling Groups, Elastic Beanstalk, Custom AMI’s and EMR provides a highly reliable way to use spot instances in every layer of your application without compromising on your application’s uptime / availability.

In this post, I will delve a bit deep into AWS Elastic Beanstalk service and how you can manage it via Batchly to achieve up to 80% savings on your EC2 costs. By the time you finish reading this article, I hope you will appreciate the value that Spot Instances and how Batchly makes it extremely easy and efficient to gain the cost advantage.

AWS Elastic Beanstalk

As mentioned earlier, AWS Elastic Beanstalk makes it easy for developers to deploy and manage applications. It consists of an Environment which can be of two types:

  • Web Server Environment
    • These are standard web-tier applications that listen to HTTP requests, typically over port 80.
  • Worker Environment
    • These are specialized applications that have a background processing module that polls for messages from an Amazon SQS queue.


Web Server Environment

The web server environment is relatively easy to create via the AWS console. Just upload the application bundle, select “load balancing, auto scaling” for a real-world application and AWS Elastic Beanstalk takes care of the rest. By default, an Elastic Load Balancer (ELB) and an Auto Scaling Group (ASG) would get created. Depending on the scaling policies, the cluster size increases or decreases.

Pretty simple and all the instances launched would be On-Demand Instances ensuring high availability but at a price.


Worker Environment

These environments are for those workloads that take a long time to complete. A daemon running on every EC2 instance in the cluster polls for messages from an Amazon SQS queue and POSTs a request to localhost with the contents of the queue message in the body. Once a 200 OK is received, that message is deleted from the queue. Even in this case, after you upload the application bundle, an ASG gets created.

Again, it is simple. Only On-Demand instances are launched. However, given the nature of the workload, there is scope for down time tolerance as the requests are processed asynchronously.


Create an Elastic Beanstalk application via Batchly

You can create an Elastic Beanstalk application via Batchly to automatically start using Spot Instances to maximize your savings.

Login to the Batchly dashboard, go to “App Store” and select “Elastic Beanstalk”. Batchly consumes your existing applications and the corresponding environments and takes control of managing your application.


When creating the application via AWS console, I had used the following ASG configuration for a skeletal system:

Min = 2, Desired = 2, Max = 4

Now I have changed this setting to reflect the new configuration, so that the same cluster can handle peak traffic:

Min = 4, Desired = 10, Max = 20

“Why do it through Batchly?”, you ask.

The Batchly Advantage

I had previously mentioned that all the instances launched are On-Demand Instances. This is good but expensive. Based on the above example, in order to reduce costs without not compromising on high availability concerns, Batchly implements the following procedure:

Step 1: It first changes the configuration of your current ASG by setting all values to the Min value

  • Min = 2, Desired = 2, Max = 2
  • This effectively disables your ASG

Step 2: It launches 4 On-Demand Instances to ensure that the application never faces a downtime

Min = 4

Step 3: It then launches 6 Spot Instances to maintain the Desired count

Desired = Min + 6

Batchly continuously monitors the health of the instances as well as the cluster. If some of the instances have degraded, then those instances are removed from the cluster and additional Spot Instances are launched to maintain Desired capacity.


Elastic Beanstalk Deployments – Automatically handled by Batchly

I will not delve into the deployment details in this post but would just like to touch the surface. When you want to upgrade to newer versions of your applications, you can do so from the AWS console or AWS CLI tools. Once you deploy the new version, the cluster health becomes Degraded. As Batchly is continuously monitoring the cluster health, it sees the new cluster status and understands that the user has made a new deployment. Batchly would then replace the cluster instances and provision new ones with the latest application. In this fashion, Batchly ensures that all instances run the latest application though that is deployed via the AWS console or AWS CLI tools.

Batchly uses a potent combination of Reserved Instances, Spot Instances and On-Demand Instances to give you substantial savings as well as ensuring high availability at all times. Our customers have been running Elastic Beanstalk applications via Batchly. They have consistently achieved over 60% cost savings over On-Demand Instances. Don’t believe me? You can start your free trial and check this for yourself.

X-Post from blog

Posted in Uncategorized | Leave a comment

Lower Your EMR costs by leveraging AWS Spot Instances

We live in the Data Age!

Web has been growing rapidly in size as well as scale during the last 10 years and is showing no signs of slowing down. Statistics show that every passing year more data gets generated than all the previous years combined. Moore’s law not only holds true for hardware but for data being generated too. Without wasting time for coining a new phrase for such vast amounts of data, the computing industry decided to just call it, plain and simple, Big Data.

Apache Hadoop is a framework that allows for the distributed processing of such large data sets across clusters of machines. At its core, it consists of 2 sub-projects – Hadoop MapReduce and Hadoop Distributed File System (HDFS). Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

Amazon EMR (Elastic MapReduce) simplifies big data processing, providing a managed Hadoop framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of your data across dynamically scalable Amazon EC2 instances.

In this post, I will explain the components of EMR and when to use Spot Instances to lower AWS costs.


Amazon EMR

As mentioned earlier, Amazon EMR is a managed Hadoop framework. With just a click of a button or through an API call, you can have an EMR cluster up and running in no time. It is best suited for data crunching jobs such as log file analysis, data mining, machine learning and scientific simulation. You no longer need to worry about the cumbersome and time-consuming process of setup, management and fine tuning of Hadoop clusters.




Google Trends shows that popularity of both Hadoop and EMR is increasing. But look at the graph with a keen eye. What do you observe? Though Hadoop showed immense increase in capturing the user’s mind share initially, its popularity has started to plateau. Why, you ask? It is clearly because of the emergence of EMR and how it makes provisioning and management of Hadoop clusters super simple.


Instance Groups

Instance Groups are a collection of EC2 instances that perform a set of roles. There are three instance groups in EMR:

  • Master
  • Core
  • Task

Master Instance Group

The Master Instance Group manages the entire Hadoop cluster and runs the YARN ResourceManager service and the HDFS NameNode service, amongst others. It monitors the health of the cluster. It also tracks the status of the jobs submitted to the cluster.

Currently, there can only be one Master Node for an EMR cluster.

Core Instance Group

The Core Instance Group contains all the core nodes of an EMR cluster. The core nodes executes the tasks submitted to the cluster by running the YARN NodeManager daemons. It also stores the HDFS data by running the DataNode daemon.

The number of core nodes required is decided based on the size of the dataset.

You can resize the Core Instance Group and EMR will attempt to gracefully terminate the instances, for example, when no YARN tasks are present. Since core nodes host HDFS, there is a risk of losing data whenever graceful termination is not possible.

Task Instance Group

The Task Instance Group contains all the task nodes of an EMR cluster. The task nodes only executes the tasks submitted to the cluster by running the YARN NodeManager daemons. They do not run the DataNode daemon or store data in HDFS. Hence, you can add and terminate task nodes at will as there is absolutely no risk of data loss.

Task Instance Group is optional. You can add up to 48 task groups.


Words of caution

Master Node is a single node and AWS itself does not provide high availability. If it goes down, then the cluster is terminated. Hence, AWS recommends having Master Node on an On-Demand instance for time-critical workloads.

Core Nodes can be multiple in number. Since they also hold data, downsizing should be done with extreme caution as it risks data loss. AWS recommends having Core Nodes on On-Demand instances.

Core Nodes come with a limitation: An EMR cluster will not be deemed “healthy” unless all the requested core nodes are up and running. Let’s say you requested 10 core nodes and only 9 were provisioned. The status of the EMR cluster will be “unhealthy”. Only when the tenth core node become active, will the status change to “healthy”.


Leveraging Spot Instances

Isn’t it fair to assume that all real-world applications are data-critical workloads? Given this situation, we can conclude that Master Instance Group and Core Instance Group should ideally be On-Demand instances.

For the sake of argument, let’s consider launching Master Instance Group and Core Instance Group on Spot Instances. In the case of Master Instance Group, if the Spot Instance is taken away then the entire EMR cluster will be terminated.

In the case of Core Instance Group, if a subset of the Core Nodes are taken away then the cluster needs to recover the lost data and rebalance HDFS. If we lose majority or all of the Core Nodes, then we are bound to lose the entire cluster as data recovery from the available nodes will be impossible.

Core Instance Group have another limitation: They can only be of one instance type. You cannot launch a few Spot Instances in say m3.large and the rest in say c4.large.

There is also the question of “What is the best bid price for a Spot Instance?”. Careful examination of Spot Pricing History and understanding Spot Price variance is a must. It is no child’s play.

Running Task Instance Group on Spot Instances is a perfect match for time-insensitive workloads. As mentioned earlier, you can have up to 48 Task Instance Groups. This helps us in hedging the risk of losing all Spot Instances. For example, you can provision some Spot Instances in m3.large, a couple in m4.large and the rest in m1.large. There is no restriction, like Core Instance Group, that all requested Spot Instances have to be up and running. Even if only a subset of the Task Instance Group is up, the EMR status is considered “healthy” and the job execution continues.

Launching Task Instance Groups as Spot Instances is a good way to increase the cluster capacity while keeping costs at a minimum. A typical EMR cluster configuration is to launch Master Instance Group and Core Instance Group as On-Demand instances as they are guaranteed to run continuously. You can then add Task Instance Groups as needed to handle peak traffic and / or speed up data processing.

A caveat: You cannot remove a Task Instance Group once it is created. You can however decrease the task nodes count to zero. Since a maximum of 48 Task Instance Groups are allowed, be careful in choosing the instance types. You can neither change the instance type nor its bid price later on. EMR in action

“Enough talk, show me the numbers.”, you demand? Thanks for asking!


A 2TB GDELT dataset was analyzed with custom Hive scripts. The Master Instance Group and the Core Instance Group were on On-Demand instances. The Task Instance Group was entirely on Spot Instances. A total of 5035 Spot Instance hours were required to complete the job. The total cost of running this job entirely on On-Demand instances would have been 689.79 USD. Since launched 100% of Task Nodes on Spot Instances, the cost was only 109.76 USD resulting in a massive savings of additionally provides autoscaling of Task Nodes, i.e. you don’t have to worry about over-provisioning of instances, and also the ability to run your Master Node / Core Nodes on Spot Instances (gives you the choice) making it a compelling and easier option to use for running EMR workloads. Register now for a free 14-day trial of

After this, we can add that – additionally provides autoscaling of task nodes (where by you don’t have to worry about over provisioning of instances) and also the ability to run your Master / Core nodes on spot (gives you the choice) making it a compelling and easier option to use for running EMR workloads.

X-Post from blog

Posted in Uncategorized | Leave a comment

AWS Spot Fleet: A neat feature to help you get started with spot instances


Spot Instances are slowly but surely gaining traction with the enterprise AWS users. Who would want to lose out on a potential savings of up to 90% over On-Demand costs? Not me. Neither should you.

In this post, I will explain the basics of Spot Fleet and how it eases the use of native Spot Instances. I will also delve a bit into the missing pieces of Spot Fleet.

Spot Instances

As you are already aware, Spot Instances are spare computing capacity available at deeply discounted prices. AWS allows users to bid on unused EC2 capacity in a region at any given point and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and all users bids that meet or exceed it gain access to the available Spot Instances.

Provisioning Spot Instances in the hundreds and thousands is easy; AWS gives a single API call to do that. However, terminating them is a bit of pain! You need to hit terminate API call individually identified by their Spot Request Id ( which is in the format sir-xxxxxxxx).


Spot Fleet

In mid-2015, AWS announced Spot Fleet to make the EC2 Spot Instance model even more useful. With the addition of a new API, Spot Fleet allowed one to launch and manage an entire fleet of Spot Instances with just one request.

AWS defines Spot Fleet as, “…a collection, or fleet, of Spot instances. The Spot fleet attempts to launch the number of Spot instances that are required to meet the target capacity that you specified in the Spot fleet request. The Spot fleet also attempts to maintain its target capacity fleet if your Spot instances are interrupted due to a change in Spot prices or available capacity.”



Google Trends shows that the awareness of both Spot Instances and Spot Fleet go hand-in-hand. What are the missing pieces then? Read on. But first, let’s understand the workings of Spot Fleet and how it is a boon for using Spot Instances.


Behind the scenes

Each Spot Fleet request must contain the following values:

  • Target Capacity
    • The number of Spot Instances you would want to launch
  • Launch Specification
    • The types of Spot Instances that you would want to launch and how you want them to be configured (AMI, VPC, subnets, Availability Zones, user data, security groups, and so on)
  • Allocation Strategy
    • It determines how it fulfills your Spot Fleet request from the possible Spot Instance pools represented by its Launch Specifications
      • Lowest price
      • Diversified
  • Maximum Bid Price
    • The maximum bid price that you are willing to pay for all selected instance types.
    • USD is the only accepted currency

A single Spot Fleet request gives us the power to launch thousands of Spot Instances and also be able to manage them. If a certain Spot Instance is terminated, then it is the responsibility of Spot Fleet to automatically launch a new one to maintain the Target Capacity.

Let’s go a little deep into the Allocation Strategy. It provides us with two options: lowestPrice and diversified. If your use-case is that of load testing for a couple of hours, then the probability of Spot Instances taken away from you is low, even with all instances in a single Spot Instance pool. lowestPrice strategy is a good choice for you as it provides the lowest cost.

Consider a use-case of an API endpoint running continuously without any tolerance for downtime with a target capacity of 50. If the allocation strategy is diversified spread across 5 Spot Instance pools, then it means that 10 Spot Instances will be provisioned in each pool. If the Spot price for one pool increases above your bid price for this pool, only 10% of your Spot Fleet is affected. Using diversified strategy gives you high availability as well as makes Spot Fleet less sensitive to increases in Spot Prices.

Spot Fleet also provides a feature called Instance Weighting, which will not be discussed in this blog post.

To better understand how Spot Instances are provisioned using the specified allocation strategy, I have reproduced a snippet from AWS docs:

  • With the lowestPrice strategy (which is the default strategy), the Spot instances come from the pool with the lowest Spot price per unit at the time of fulfillment. To provide 20 units of capacity, the Spot fleet launches either 20 2xlarge instances (20 divided by 1), 10 r3.4xlarge instances (20 divided by 2), or 5 r3.8xlarge instances (20 divided by 4).
  • With the diversified strategy, the Spot instances would come from all three pools. The Spot fleet would launch 6 2xlarge instances (which provide 6 units), 3 r3.4xlarge instances (which provide 6 units), and 2 r3.8xlarge instances (which provide 8 units), for a total of 20 units.

As you can clearly see, Spot Fleet makes the management of Spot Instances super simple. It also gives advanced features to better address issues related to high availability and specified target capacity. It really is a boon for all those using Spot Instances!

The Missing Pieces

Spot Fleet is great, no doubt. You have read all about the good features that it provides. You finally decide to take the plunge and use it in your application. Sooner than later, you realize that there are some gaps or situations where it can be difficult to use spot fleet.

  • Native support to attach Spot Fleet instances to an ELB
    • This is more of a convenience factor.
    • As of now, the only way is via user-data scripts which the user has to write.
    • Spot Fleet should take as input the ELB information to attach the provisioned Spot Instances.
    • Spot Fleet should also take as input the record set information so as to give vanity URLs via Route53, provided the Hosted Zone and zone apex information is already configured.
  • Fallback to on-demand instances
    • This is not a Spot Fleet feature per se.
    • It is more of a comfort factor.
    • If Spot Fleet request can take as input the percentage distribution between on-demand and Spot Instances, then the application uptime is always guaranteed. So is the customers’ peace of mind.

I completely understand if you say, “Spot management is none of my business.” But, quite frankly, it is ours! Register now for a free 14-day trial of Batchly.

Remember: A penny saved is a penny earned!

X-Post from blog

Posted in Uncategorized | Leave a comment

Application Rightsizing on AWS is a platform which reduces your AWS EC2 spend across three main areas:

  1. Instance Optimization – usage of Spot, Reserved and On-Demand instances
  2. Time Optimization – schedule start and stop of instances
  3. Performance Optimization – analyse application performance and rightsize instances

Time-based Optimization is about the ability to schedule workloads (instances and applications) based on some policies (most relevant for dev/test workloads). Instance Optimization is the automatic usage of Spot Instances at all layers of your application – be it web / app tier or API tier as well as optimal utilization of purchased Reserved Instances across multiple accounts.

In this post, I will talk about the third critical area of optimization which is what we call as Performance Optimization or Rightsizing of your application.

Application Infrastructure

One of the reasons why enterprises move to the cloud is the flexibility it offers in terms of infrastructure management and the ease with which it can be changed in an instant. Is your infrastructure over-provisioned? Just do a scale-in. Is it under-provisioned? A simple scale-out should do the trick. What’s the assumption you have made during the scaling process? That the instances are well utilized and is being used to its maximum. This is partly true though when you look at one metric at a time.

Why you ask? You perform a scaling operation when a certain threshold is breached. Let’s take an example: you have configured your scale-out policy to add one instance to the cluster when CPU utilization touches 60%. This makes sense as you need more instances to handle traffic spikes. But what about the instance’s memory utilization? How do we ensure that memory is being utilized efficiently? What if memory is always at 5% usage? Scaling does not help and this is where Rightsizing comes to our rescue.



Rightsizing is the process of analyzing your workloads and recommending the right instance type to minimize wastage. During the planning and provisioning of the underlying infrastructure, we tend to make certain assumptions regarding the application performance. Over time, we capture metrics and validate our assumptions and take necessary actions if needed. But this is a manual and cumbersome Devops exercise. Wouldn’t it be nice if this can be automated completely? takes rightsizing very seriously and guarantees that the application performance is never degraded when we make recommendations.

Recommendation Engine

When customers move their workloads onto our platform, we start orchestration of their infrastructure on RI and Spot Instances to achieve savings of 75% and more. Additionally, we start monitoring the application metrics such as CPUUtilization, memory usage (available via an opt-in during workload migration) and network latency over a two-week window. This data is fed into our proprietary Recommendation Engine.

Our engine first sifts through the data. It plots a time-series graph and calculates the peaks, valleys and averages. It then matches these numbers with in-built breach thresholds. If the instances are being utilized efficiently, then give yourself a pat on your back! You are a DevOps expert and in complete control.

More often than not, our Recommendation Engine sees that application infrastructure is under-utilized and it then proceeds to recommend an appropriate rightsized instance. This is a complex multi-stage process and broadly involves:

  • Narrow down to a subset of “suitable” instances spanning across families and sizes. This is really a big deal: Consider an large instance where CPU is maxed out and memory is under-utilized. If we restrict recommendations within the m4 family, then we can only do so much. But if we were to move to a different family say c4, then we can improve CPU utilization and memory usage as it is a compute-optimized instance.
  • For each of these “suitable” instances, find out if there is Spot capacity available in the Region and Availability Zones where the application is running. Narrow down the list further.
  • For each of these instances, categorize the Spot Instance likelihood of getting terminated as Low, Medium and High. Narrow down the list further to include instances falling into Low and Medium risk.
  • For each of these instances, calculate the Spot price and find out the cheapest one.
  • Recommend this instance to the customer!


Rightsizing in action

We eat our own dog food! We use rightsizing on your own infrastructure and act upon them.


As you can see from the figure above, one of our internal systems is running on m4.xlarge. Our recommendation engine analyzed the application performance metrics and suggested that the most appropriate instance type is r3.large. It also displays information on savings achieved when we make the switch – 95 USD per instance-month and 23% of additional savings.

Register now for a 14-day free trial of CMPUTE.IO, the only unified platform to achieve all the 3 optimizations and see the results instantly.


Vijay Olety is a Founding Engineer and Technical Architect at He likes to be called a “Garage Phase Developer” who is passionate about Cloud, Big Data, and Search. He holds a Masters Degree in Computer Science from IIIT-B.

X-Post from blog

Posted in Uncategorized | Leave a comment

5 Little Known Facts About Spot Instances


Spot Instances are deeply discounted compute capacity that can provide up to 90% savings over On-demand instances. In other words, for the same budget you get 2-10x times the compute capacity. This is  lucrative for enterprises running compute intensive workloads such as video transcoding, big data analytics (Hadoop, Spark processing), log analysis and other batch jobs.

However, enterprises have not gotten around to using Amazon Spot instances effectively, though they have been around for over 7 years. RightScale published the “State of the Cloud” report recently and “cost optimization” was one of the key challenges reported by enterprise AWS users. Not so surprisingly, only 14% of the enterprises are looking at using Spot Instances as a cost optimization model.

Most of you who have explored spot instances would know that the price determination occurs dynamically based on demand-supply economics. We have listed 5 interesting things that you might not know about Spot:

    1. Spot price differs across AZ’s (Availability Zones) in the same region – Different AZ’s within a region can have completely different pricing for the same instance type. It is important to launch an instance at a specific AZ by monitoring the price whenever you need an instance. This can result in great savings (or losses if done incorrectly)
    2. Spot Blocks are instances which you can block for a finite duration (1-6 hours). These instances provides more reliability than regular spot instances (you are guaranteed availability for the entire duration) but are more pricier than spot (30-45% less than on-demand as against 80-90% savings using spot). These are really good for defined-duration workloads and if used in conjunction with spot instances and reserved instances can be used for other workloads as well (which is what Batchly does)


  1. You can run your Big Data/Hadoop Jobs (through Amazon EMR) on spot instances for great savings on large workloads. This however, requires one to bid, manage and monitor the instances (similar to regular spot instances). One best practice is to run the task nodes on spot while running your filesystem (HDFS, Hbase etc) in Master & core nodes on AWS on-demand.
  2. Until recently, spot instance would get taken off the moment you got outbid. This gave no time for workloads to perform any evasive action (to store the state or transfer processed information). This changed last year and now you get a 2 minute warning or “Termination Notice” to help you manage your application and take necessary preventive action.
  3. Spot Instance limits – Just like many other AWS services, Spot instances also has soft limits (only 20 spot instances to begin with). This can however be changed by submitting a request to AWS.

Bonus point: Interrupt tolerant workloads such as batch/cron jobs, load testing, video transcoding, rendering, hadoop workloads are most suited for spot. Even if one/more spot instances get taken away, the same can be processed in another spot/RI/on-demand instance(s)

X-Post from blog

Posted in Uncategorized | Leave a comment

Spot vs On-Demand: Which is better in a highly variable environment?

Be honest: Hasn’t this question crossed your mind often? I am sure it has because it has crossed mine. What’s the answer then, you ask? I would always say: “Go for Spot!”

In this post, I will explain how to make the best choice between on-demand and spot in a highly variable Amazon Web Services (AWS) environment. I hope that after reading this you will be in a better position to judge and make decisions based on your use case, scenario, and workload.

On-Demand Instances

On-Demand Instances are virtual servers that are purchased at a fixed rate per hour. Each On-Demand Instance is billed from the time it is launched until it is stopped or terminated. Partial instance hours are rounded off to the full hour during billing. Once launched, it will continue to run unless you stop it or in the rare cases of it being Scheduled for Retirement/ Unavailable.

These are the most expensive purchasing options.

Spot Instances

Spot Instances, also virtual servers, are spare computing capacity available at deeply discounted prices. AWS allows users to bid on unused EC2 capacity in a region at any given point and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and all users bids that meet or exceed it gain access to the available Spot Instances.

These are, by far, the cheapest purchasing option providing savings of up to 90% over On-Demand costs.



Google Trends, without any doubt, clarifies that Spot Instances are gaining popularity at a fast pace. But we have been in this space for close to a decade and know for a fact that businesses are still wary to make the shift to Spot Instances. Why is there less adoption? It’s because of a catch.

The Spot Instances Catch

Like all good things, Spot Instances, too, come with a catch: It can be taken away from you at any time!

Here’s how it works: You place a bid. As long as your bid price is higher than the current Spot Price, Spot Instances will be available to you. The moment the Spot Price goes higher than your bid price, the instances are taken away from you with a 2-minute warning, known quite aptly as Spot Instance Termination Notice. This gives your applications time to prepare for a graceful shutdown.

Most of the existing applications are designed with the assumption that the servers will never go down. It is thus an additional engineering cost and effort to handle this scenario. This involves snapshotting the application and checkpointing data to persistent storage at regular intervals to resume the operation on another server. Due to this, businesses are skeptical to adopt Spot Instances for their use cases.


Typical Usage Scenario

A request for a certain number of instances of a particular instance type is placed. Consider the following example: You run a web-tier application and have a need for 100  m3.large instances. The Spot Price on 04 Oct 2016 at 12:12PM UTC+0530 is 0.0285 USD. What should your bid price be? That question is worth another blog post but say you bid at 0.1 USD. As per Spot availability and your account limits, you will get at most 100 m3.large instances.


As you move through the days, you see small Spot Price spikes hovering in the range of 0.08 USD – 0.1 USD. You did good by setting a high bid price! But the real shocker comes on 06 Oct 2016 at 04:57PM UTC+0530. The Spot Price is a massive 1.46 USD. All the 100 m3.large instances are taken away and your web-tier is unreachable. Your application is experiencing an indefinite downtime and you are losing money real quick.


Mitigation Strategies

The fault, which is apparent in hindsight, was relying only on 1 instance type. You now know that at least 2 instance types must be used in a Spot-only environment to achieve better availability.

You request for 50 m3.large and 50 m4.large. You also increase your bid price to 0.2 USD, just to be safe.


The unexpected has occurred. At 06 Oct 2016 at 08:27PM UTC+0530, the Spot Price is 0.66 USD. Your bid price is low and both m3.large and  m4.large are taken away from you and your application is experiencing downtime again! What did you do wrong?

The mistake, again clearly apparent in hindsight, is that both instance types were launched in the same Availability Zone, us-east-1e.

You now request for 50 m3.large in us-east-1e and 50 r3.large in us-east-1c. You increase your bid price to 0.3 USD. Again, just to be safe.


My God, it happened again! At 06 Oct 2016 at 08:27PM UTC+0530, the Spot Price spiked to 1.7500 USD. All your Spot Instances are taken away and your application is witnessing uncertain downtime again! What went wrong now?

In summary, mitigation strategies that we looked into are:

  • Never launch all servers of a single instance type
  • At least 2 instance types are a must for better availability
  • Never launch all in a single Availability Zone

Even with these in place, we still saw a downtime. It is a never ending cat-and-mouse game. No matter how safe you think you are, there is always an edge case that causes havoc.

There are several other mitigation approaches that have to be in place to make Spot Instances truly available. Top of my head, I can recall:

  • Choosing the best bid price for a particular instance type
    • It really is an art!
  • A Spot Availability Predictor which predicts the likelihood of an instance being taken away
    • Having access to historical Spot Prices helps build robust Machine Learning prediction models. This helps you answer questions like:
      • Which is a good instance type to launch?
      • Should I launch it in us-east-1a or us-east-1c?
      • Is it ok to launch now or 10 minutes later for better availability?
  • Falling back to an appropriate similar instance
    • Spot Instances are subject to availability as well as your account limits. Would you rather go for On-Demand or a similar instance on Spot?
  • Inclusion of Spot Fleet
    • Spot Fleet is a boon. It vastly simplifies the management of thousands of Spot Instances in one request. You can choose between lowestPrice and diversified as your allocation strategy to meet your requirements.
  • Spot Fleet has limitations though, to name a few:
    • No native support to attach Spot Fleet instances to an ELB.
    • It cannot intelligently decide which instance type to launch if all the Spot Instances are down.
  • Usage of Spot Blocks
    • Spot Blocks are a good fit for Defined-Duration workloads. Just specify the Block Duration parameter when placing a Spot Instance request. But it is twice as costly as a regular Spot Instance. Why should you go for it?

As you can see, Spot Instances management is very complex. There are 50+ instance types, 11 regions, 35+ Availability Zones, and 600+ bidding strategies.

Back to the Original Question

It really boils down to, “Do you truly want 100% availability at all times OR 100% availability with a teeny-weeny chance of 99.99% availability sometimes, if at all, at a cool savings of up to 90%?” Think about it: The worst case is around 10 minutes downtime in a year and if your annual On-Demand expenditure is 1M USD, then you get the same performance and functionality and similar uptime with savings of up to 900,000 USD.

I completely understand if you say, “Spot management is none of my business.” But, quite frankly, it is ours! Register now for a free 14-day trial of Batchly.

I know you have one final question that still has you scratching your head: “Can I get the best of both worlds?” Yes you can! We also offer a hybrid model of On-Demand and Spot Instances for 100% availability for the really mission-critical workloads.

Remember: A penny saved is a penny earned!


Vijay Olety is a Technical Architect at 47Line. Fondly known as “Garage Phase Developer”, he is passionate about Cloud, Big Data and Search. He holds an M.Tech in Computer Science from IIIT-B.

X-Post from blog

Posted in Uncategorized | Leave a comment

DynamoDB: An Inside Look Into NoSQL – Part 7

In Part 6, we discussed handling failures via Hinted Handoff & Replica Synchronization. We also talked about the advantages of using a Sloppy Quorum and Merkle Trees.

In this last & final part of the series, we will look into Membership and Failure Detection.

Ring Membership

In any production environment, node outages are often transient. But it rarely signifies permanent failure and hence there is no need for repair or rebalancing the partition. On the other hand, manual errors might result in unintentional startup of new DynamoDB nodes. A proper mechanism is essential for the addition and removal of nodes from the DynamoDB Ring. An administrator uses a tool to issue a membership change command to either add / remove a node. The node that picks up this request writes into its persistent store the membership change request and the timestamp. A gossip-based protocol transfers the membership changes and maintains a consistent view of membership across all nodes. Each node contacts a peer chosen at random every second and the two nodes efficiently reconcile their persisted membership change histories. Partitioning & placement information also propagates via the gossip-based protocol and each storage node is aware of the token ranges its peers are responsible for. This allows each node to forward a key’s read/write operations to the right set of nodes directly.

Ring Membership (Credit)

External Discovery

It’s best to explain with an example: An administrator joins node A to the ring. He then joins node B to the ring. Nodes A and B consider itself as part of the ring, yet neither would be immediately aware of each other. To prevent these logical partitions, DynamoDB introduced the concept of seed nodes. Seed nodes are fully functional nodes that are discovered via an external mechanism (static configuration or a configuration service) and are known to all nodes. Since each node communicates with the seed node and gossip-based protocol transfer the membership changes, logical partitions are highly unlikely.

Failure Detection

Failure detection in DynamoDB is used to avoid attempts to communicate with unreachable peers during get() and put() operations and when transferring partitions and hinted replicas. For the purpose of avoiding failed attempts at communication, a purely local notion of failure detection is entirely sufficient: node A may consider node B failed if node B does not respond to node A’s messages (even if B is responsive to node C‘s messages). In the presence of a steady rate of client requests generating inter-node communication in the DynamoDB ring, a node A quickly discovers that a node B is unresponsive when B fails to respond to a message; Node A then uses alternate nodes to service requests that map to B‘s partitions; A also periodically retries B to check for the latter’s recovery. In the absence of client requests to drive traffic between two nodes, neither node really needs to know whether the other is reachable and responsive.


This exhaustive 7-part series detailing every component is sufficient to understand the design and architecture of any NoSQL system. Phew! What an incredible journey it has been these couple of months delving into the internals of DynamoDB. Having patiently read this far, you are amongst the chosen few who have this sort of deep NoSQL knowledge. You can be extremely proud of yourself!

Let’s eagerly await another expedition soon!

Article authored by Vijay Olety.

X-Post from CloudAcademy.

Posted in Technical | Tagged , , , , | Leave a comment