Skip to content

Conversation

@moko-poi
Copy link
Contributor

@moko-poi moko-poi commented Dec 8, 2025

fix: exponential backoff for EC2NodeClass auth validation retries

Fixes #8114

Description

This PR addresses the issue where EC2NodeClass remains in NotReady state for 30 minutes after IAM permission attachment, even though IAM propagation typically completes within seconds to minutes.

The current implementation uses a fixed 30-minute retry interval for all validation failures, including authorization errors. This was increased from 10 minutes in PR #8439 to reduce API rate limiting. However, this creates a poor UX for the common case of attaching IAM permissions immediately before creating an EC2NodeClass.

Changes:

  • Add AuthRetryInitialDelay (30s) and AuthRetryMaxDelay (5m) constants
  • Implement getAuthRetryDelay() for exponential backoff calculation (30s -> 60s -> 120s -> 240s -> 5m)
  • Apply exponential backoff specifically to authorization failures in:
    • validateCreateLaunchTemplate
    • validateCreateFleet
    • validateRunInstances
  • Clear retry count on successful validation
  • Preserve the existing annotation-based cache invalidation workaround for urgent manual intervention

Impact:

  • Typical case: EC2NodeClass becomes Ready in ~30 seconds instead of 30 minutes
  • API load: For 20 EC2NodeClasses failing simultaneously, increases from 40 requests/30min to ~100 requests/10min before settling at 5-minute intervals
  • Backward compatible: Non-auth failures still use 30-minute interval

How was this change tested?

  • Added unit test should use exponential backoff for auth failures to verify:
    • Retry delays follow exponential backoff pattern
    • Maximum delay caps at 5 minutes
    • Retry count resets on successful validation
  • Existing validation tests continue to pass

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Implements exponential backoff (30s -> 60s -> 120s -> 240s -> 5m) for
authorization failures during EC2NodeClass validation, reducing ready
time from 30 minutes to ~30 seconds for typical IAM propagation.

Fixes aws#8114
@moko-poi moko-poi requested a review from a team as a code owner December 8, 2025 23:44
@moko-poi moko-poi requested a review from bwagner5 December 8, 2025 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validate RunInstancesAuthCheck Fails For ~10 Minutes In Ec2NodeClaim

1 participant