feat: Prioritize unused Reserved Instances in instance selection #8717
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This is a fairly straightforward but important feature to make Karpenter more cost-aware for anyone with existing AWS commitments.
The problem is simple: Karpenter has been blind to Reserved Instances (RIs) unless they are explicitly tied to a Capacity Reservation. This means that if you've bought Standard or Convertible RIs to save money, Karpenter would happily ignore them and go off to launch new, more expensive On-Demand or Spot instances. This is just leaving money on the table and makes Karpenter less effective for anyone trying to optimize their AWS bill.
This change fixes that by making Karpenter RI-aware.
The core of the change is a new, dedicated
ReservedInstanceProvider. I've put this in its ownpkg/providers/reservedinstancepackage because RIs are a distinct concept from Capacity Reservations, and mixing the two would have been a mess. The new provider is responsible for one thing: figuring out which RIs are actually available for use right now.Getting this right is non-trivial. You can't just use the billing or Cost Explorer APIs (
GetReservationUtilization, etc.) because that data can be up to 24 hours stale. A scheduler can't work with stale data. The only reliable way to get a real-time view is to:activeReserved Instances (DescribeReservedInstances).runninginstances that could be consuming those RIs (DescribeInstances).This is exactly what the new provider does. It's the only sane approach for real-time accuracy. To avoid hammering the EC2 API on every single provisioning loop, the results are cached for 10 minutes, which is a reasonable trade-off between data freshness and API load.
With the new provider in place, the rest is simple:
pkg/operator/operator.go.offering.Providerinpkg/providers/instancetype/offering/offering.go.offering.Providernow uses this data to create newreservedofferings.These new RI-based offerings are priced using the same trick we already use for Capacity Reservations: the on-demand price is divided by a ridiculously large number. This makes them effectively "free" in the eyes of the scheduler, ensuring they are always picked first if they match the requirements.
This closes a significant gap in Karpenter's cost-optimization strategy and makes it behave the way users would expect.
How was this change tested?
Ran
make ci-testwithout issues.End-to-end testing of Reservation features on AWS isn’t practical here, the costs are prohibitive. If you know a sane way to do real-world testing without racking up a giant bill, let me know.
Does this change impact docs?