Skip to content

Conversation

@normj
Copy link
Member

@normj normj commented Dec 9, 2025

Draft Mode: I put it in draft mode because I want to do some testing on an actual EC2 instance to confirm the behavior of the unit tests.

Description

The DefaultInstanceProfileAWSCredentials used as the underlying "singleton" across service clients to fetch AWS credentials from EC2 IMDS used a ReaderWriterLockSlim to control the number of threads accessing IMDS. In V3 this was okay because only the sync version was implemented but in V4 we added the async version and ReaderWriterLockSlim is not valid to use across async/await call contexts. I also noticed in the previous logic within the scope of the reader lock it would also initiate a fetch credentials fetch essential the "write" logic that should have only been done when a Write lock was obtained.

I reworked the logic to rely on a SemaphoreSlim to control access and fixed it so the fetch credentials was only done when a thread had acquired the semaphore.

The reason more users haven't run into the issue since V4 was GA last spring is generally the GetCredentials and GetCredentialsAsync should only be returning credentials that were previously fetched during the background call to RenewCredentials. The RenewCredentials method uses the sync calls. When @msab-john ran into the issue there must have been some other issue between the EC2 instance and IMDS that caused the background RenewCredentials to fail to acquire credentials and then when GetCredentialsAsync was called it reverted to it's failsafe of directly trying to get credentials that triggered the SynchronizationLockException

I also did some refactoring of the DefaultInstanceProfileAWSCredentials to make it more friendly for unit testing.

Motivation and Context

#4199

Testing

Add new unit tests
Dry Run: Success (DRY_RUN-6ef5e975-cba6-4794-939b-83e6620f3b50)
Direct EC2 Testing: Pending

if (!_imdsRefreshFailed &&
_lastRetrievedCredentials.IsExpiredWithin(TimeSpan.Zero))
logger.DebugFormat("[Background Timer] Waiting on lock to refresh ECS IMDS");
if (_credentialsSemaphore.Wait(_credentialsLockTimeout))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could _credentialsSemaphore already be disposed here before it tries to wait on it? Yes the exception handler would eat the exception but if we had a check here for _isDisposed we could may be able to avoid an unnecessary exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{
try
logger.DebugFormat("Waiting on lock to refresh ECS IMDS");
if (_credentialsSemaphore.Wait(_credentialsLockTimeout))
Copy link
Contributor

@boblodgett boblodgett Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this try to wait on a disposed semaphore? Dispose and GetCredentials/GetCredentialsAsync could be called at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the caller try to call a method on a disposed object which is invalid. Same as a user calling read operations on a disposed Stream. The extra protection done for dispose in the RenewCredentials was done because that is called via the timer and there could be timing issues shutting down the timer and disposing the semaphore.

// lower refresh rate to speed test
var provider = DefaultInstanceProfileAWSCredentials.CreateTestDefaultInstanceProfileAWSCredentials(stub, TimeSpan.FromSeconds(5));

// fix semaphore (production code's semaphore is created with initialCount:0 in this branch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment. The code does initialCount: 1. Also what is "this branch"? Note there are a few places where this comment exists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the EnsureSemaphoreReleased and ResetPreviousRefreshFailedFlag as they were unnecessary artifacts when I use the AI agent to do the first pass at creating unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants