Be honest: Hasn’t this question crossed your mind often? I am sure it has because it has crossed mine. What’s the answer then, you ask? I would always say: “Go for Spot!”
In this post, I will explain how to make the best choice between on-demand and spot in a highly variable Amazon Web Services (AWS) environment. I hope that after reading this you will be in a better position to judge and make decisions based on your use case, scenario, and workload.
On-Demand Instances
On-Demand Instances are virtual servers that are purchased at a fixed rate per hour. Each On-Demand Instance is billed from the time it is launched until it is stopped or terminated. Partial instance hours are rounded off to the full hour during billing. Once launched, it will continue to run unless you stop it or in the rare cases of it being Scheduled for Retirement/ Unavailable.
These are the most expensive purchasing options.
Spot Instances
Spot Instances, also virtual servers, are spare computing capacity available at deeply discounted prices. AWS allows users to bid on unused EC2 capacity in a region at any given point and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and all users bids that meet or exceed it gain access to the available Spot Instances.
These are, by far, the cheapest purchasing option providing savings of up to 90% over On-Demand costs.
Trend
Google Trends, without any doubt, clarifies that Spot Instances are gaining popularity at a fast pace. But we have been in this space for close to a decade and know for a fact that businesses are still wary to make the shift to Spot Instances. Why is there less adoption? It’s because of a catch.
The Spot Instances Catch
Like all good things, Spot Instances, too, come with a catch: It can be taken away from you at any time!
Here’s how it works: You place a bid. As long as your bid price is higher than the current Spot Price, Spot Instances will be available to you. The moment the Spot Price goes higher than your bid price, the instances are taken away from you with a 2-minute warning, known quite aptly as Spot Instance Termination Notice. This gives your applications time to prepare for a graceful shutdown.
Most of the existing applications are designed with the assumption that the servers will never go down. It is thus an additional engineering cost and effort to handle this scenario. This involves snapshotting the application and checkpointing data to persistent storage at regular intervals to resume the operation on another server. Due to this, businesses are skeptical to adopt Spot Instances for their use cases.
Typical Usage Scenario
A request for a certain number of instances of a particular instance type is placed. Consider the following example: You run a web-tier application and have a need for 100 m3.large instances. The Spot Price on 04 Oct 2016 at 12:12PM UTC+0530 is 0.0285 USD. What should your bid price be? That question is worth another blog post but say you bid at 0.1 USD. As per Spot availability and your account limits, you will get at most 100 m3.large instances.
As you move through the days, you see small Spot Price spikes hovering in the range of 0.08 USD – 0.1 USD. You did good by setting a high bid price! But the real shocker comes on 06 Oct 2016 at 04:57PM UTC+0530. The Spot Price is a massive 1.46 USD. All the 100 m3.large instances are taken away and your web-tier is unreachable. Your application is experiencing an indefinite downtime and you are losing money real quick.
Mitigation Strategies
The fault, which is apparent in hindsight, was relying only on 1 instance type. You now know that at least 2 instance types must be used in a Spot-only environment to achieve better availability.
You request for 50 m3.large and 50 m4.large. You also increase your bid price to 0.2 USD, just to be safe.
The unexpected has occurred. At 06 Oct 2016 at 08:27PM UTC+0530, the Spot Price is 0.66 USD. Your bid price is low and both m3.large and m4.large are taken away from you and your application is experiencing downtime again! What did you do wrong?
The mistake, again clearly apparent in hindsight, is that both instance types were launched in the same Availability Zone, us-east-1e.
You now request for 50 m3.large in us-east-1e and 50 r3.large in us-east-1c. You increase your bid price to 0.3 USD. Again, just to be safe.
My God, it happened again! At 06 Oct 2016 at 08:27PM UTC+0530, the Spot Price spiked to 1.7500 USD. All your Spot Instances are taken away and your application is witnessing uncertain downtime again! What went wrong now?
In summary, mitigation strategies that we looked into are:
- Never launch all servers of a single instance type
- At least 2 instance types are a must for better availability
- Never launch all in a single Availability Zone
Even with these in place, we still saw a downtime. It is a never ending cat-and-mouse game. No matter how safe you think you are, there is always an edge case that causes havoc.
There are several other mitigation approaches that have to be in place to make Spot Instances truly available. Top of my head, I can recall:
- Choosing the best bid price for a particular instance type
- It really is an art!
- A Spot Availability Predictor which predicts the likelihood of an instance being taken away
- Having access to historical Spot Prices helps build robust Machine Learning prediction models. This helps you answer questions like:
- Which is a good instance type to launch?
- Should I launch it in us-east-1a or us-east-1c?
- Is it ok to launch now or 10 minutes later for better availability?
- Having access to historical Spot Prices helps build robust Machine Learning prediction models. This helps you answer questions like:
- Falling back to an appropriate similar instance
- Spot Instances are subject to availability as well as your account limits. Would you rather go for On-Demand or a similar instance on Spot?
- Inclusion of Spot Fleet
- Spot Fleet is a boon. It vastly simplifies the management of thousands of Spot Instances in one request. You can choose between lowestPrice and diversified as your allocation strategy to meet your requirements.
- Spot Fleet has limitations though, to name a few:
- No native support to attach Spot Fleet instances to an ELB.
- It cannot intelligently decide which instance type to launch if all the Spot Instances are down.
- Usage of Spot Blocks
- Spot Blocks are a good fit for Defined-Duration workloads. Just specify the Block Duration parameter when placing a Spot Instance request. But it is twice as costly as a regular Spot Instance. Why should you go for it?
As you can see, Spot Instances management is very complex. There are 50+ instance types, 11 regions, 35+ Availability Zones, and 600+ bidding strategies.
Back to the Original Question
It really boils down to, “Do you truly want 100% availability at all times OR 100% availability with a teeny-weeny chance of 99.99% availability sometimes, if at all, at a cool savings of up to 90%?” Think about it: The worst case is around 10 minutes downtime in a year and if your annual On-Demand expenditure is 1M USD, then you get the same performance and functionality and similar uptime with savings of up to 900,000 USD.
I completely understand if you say, “Spot management is none of my business.” But, quite frankly, it is ours! Register now for a free 14-day trial of Batchly.
I know you have one final question that still has you scratching your head: “Can I get the best of both worlds?” Yes you can! We also offer a hybrid model of On-Demand and Spot Instances for 100% availability for the really mission-critical workloads.
Remember: A penny saved is a penny earned!
Vijay Olety is a Technical Architect at 47Line. Fondly known as “Garage Phase Developer”, he is passionate about Cloud, Big Data and Search. He holds an M.Tech in Computer Science from IIIT-B.
X-Post from cmpute.io blog