Spot Instances have gained popularity and continue to do so at a rapid pace. However, they come with their own set of complexities and challenges. Mitigation plans, if not in place, lead to application downtime and cost you dear. Common strategies include –
- Never launching all Spot Instances of a single instance type
- Launching at least 2 instance types are a must for better availability
- Never launching all Spot Instances in a single Availability Zone
Since the Spot availability and prices are governed by market volatility, there is still a high probability of the instance being taken away from you. In this post, I will explain how to make the best use of Spot Instance Termination Notices.
Spot Instance Termination Notice
In late 2009, AWS launched Spot Instances allowing users to bid for spare EC2 capacity at a price they were willing to pay. When those were reclaimed, users would not know a priori. This resulted in lost work and data inconsistencies, in turn affecting users’ businesses. Hence, in early 2015, AWS introduced Spot Instance Termination Notice, a two-minute warning with the goal of enabling the user or an application to take appropriate actions. These include, but are not limited to –
- Saving the application state
- Uploading final log files
- Removing itself from an Elastic Load Balancer
- Pushing SNS notifications
Google Trends shows that Spot Instances are gaining popularity. But what worries us is that people are not aware of the termination notice. How are they managing it then? Short answer: They aren’t!
Strategies for Spot Availability
As mentioned earlier, common strategies are employed to mitigate the risk of Spot Instance availability. A Spot Instance management solution, such as ours, goes further to include –
- A Spot Availability Predictor, which predicts the likelihood of an instance being taken away
- Falling back to an appropriate similar instance
- Usage of Spot Fleet and Spot Block
- Choosing the best bid price
With all these in place and given the volatile nature of Spot Instances, sometimes things do get out of control! A certain Spot Instance has executed a majority of your workflow and only a tiny bit is pending for successful completion. The instance is now taken away from you. Would you restart the entire workflow again?
Usage of Spot Instance Termination Notice
How will an application running on a Spot Instance know that it will be reclaimed? The Termination Notice is accessible to an application running on the instance via the instance metadata at –
This information will be available when the instance has been marked for termination and will contain the time when a shutdown signal will be sent to the instance’s operating system. AWS recommends that applications poll for the termination notice at five-second intervals. This will give the application almost two full minutes to complete any required processing, such as saving the state and uploading the final logs before it is reclaimed. You can check for this warning using the following query –
$ if curl -s http://169.254.169.254/latest/meta-data/spot/termination-time | grep -q .*T.*Z; then echo terminated; fi
Here’s a timeline, reproduced from the AWS blog, to help you to understand the termination process (the “+” indicates a time relative to the start of the timeline) –
- +00:00 – Your Spot instance is marked for termination because the current Spot price has risen above the bid price. The bid status of your Spot Instance Request is set to marked-for-termination and the /spot/termination-time metadata is set to a time precisely two minutes in the future.
- Between +00:00 and +00:05 – Your instance (assuming that it is polling at five-second intervals) learns that it is scheduled for termination.
- Between +00:05 and +02:00 – Your application makes all necessary preparation for shutdown. It can checkpoint work in progress, upload final log files, and remove itself from an Elastic Load Balancer.
- +02:00 – The instance’s operating system will be told to shut down and the bid status will be set to instance-terminated-by-price.
You have just been hearing vague action items, such as save the state or checkpoint the progress. What does that actually mean, you ask?
Typical Usage Scenario
Consider a sample stateless application, such as a HealthCheck API, running on Spot Instances behind an Elastic Load Balancer. When a request is made to the application, one of the Spot Instances processes it. But before the result is returned, that Spot Instance is reclaimed. The application, not having received any result, after waiting for a pre-configured timeout duration, sends the request again. Another Spot Instance now processes it and returns the result. Easy-peasy here.
The complexities arise when an application is stateful. Let’s consider a sample video encoding application which takes about 45 minutes for an HD conversion. The infrastructure setup is the same as before, i.e., Spot Instances behind an Elastic Load Balancer. The workflow is as follows –
- Application sends a video encoding request to the Elastic Load Balancer
- The Elastic Load Balancer routes it to a Spot Instance
- The Spot Instance starts the HD conversion, taking 45 minutes
- The Spot Instance then returns the downloadable link to the application
The problem is with Step 3 above. What if the HD conversion is 40 minutes deep and then the Spot Instance is reclaimed? If you did not save the state, then you have to restart it from scratch again. Saving the state simply implies that you store the current snapshot in a persistent storage such as S3. When a new Spot Instance becomes active, it first copies the snapshot from S3 and then resumes the workflow.
As is evident, it clearly demands a few house-keeping activities – the snapshot has to be moved from the local store to one that is persistent. It then has to be transferred back from the persistent store to local on a new Spot Instance, so that the application can resume the operation. I know you have a few rapid-fire questions for me. Shoot away!
- Isn’t this enough?
- AWS sends these termination notices on a best-effort basis. This basically means that while they make every effort to provide this warning, it is possible that your Spot Instance will be terminated before Amazon EC2 can make the warning available.
Wow, that’s a shocker!
- Can we do better?
- Yes, we can!
Making Spot Instance Data Persistent
When a Spot Instance is reclaimed, it takes with it data present on its local storage. Any EBS volumes attached would persist, provided the Delete on Termination was unchecked. The architectural design to make the data persistent is as follows –
- Store the required data in an EBS volume, say ebs1, attached via a mount point, say /mount1
- Spot Instance gets the termination notice, giving the application a two-minute warning
- The application detaches ebs1 from the Spot Instance
- Launch a new Spot Instance, with user-data containing the script to attach ebs1 on /mount1
- A resumable controller, with intelligence to resume the operation from where it had left off, restarts the application
- The application runs to completion
There is also a secret sauce which I have intentionally not delved into. And that’s it – we have accomplished data persistence on Spot Instances, too, just like On-Demand Instances. It is quite an achievement!
There’s also an alternate way through Amazon Elastic File System (EFS). Amazon EC2 instances mount EFS via the NFSv4.1 protocol, using standard operating system mount points. Currently, it is available only in three regions: Northern Virginia (us-east-1), Oregon (us-west-2), and Ireland (eu-west-1). It is still early days but holds a lot of promise.
I completely understand if you say, “Spot management is none of my business.” But, quite frankly, it is ours! Register now for a free 14-day trial of Batchly.
Remember: A penny saved is a penny earned!
Vijay Olety is a Founding Engineer and Technical Architect at Batch.ly. He likes to be called a “Garage Phase Developer” who is passionate about Cloud, Big Data, and Search. He holds a Masters Degree in Computer Science from IIIT-B.
X-Post from cmpute.io blog