Skip to content

Conversation

@skypher
Copy link

@skypher skypher commented Dec 8, 2025

Summary

When a function build process crashes or is killed (e.g., by SIGINT/Ctrl+C, SIGTERM, SIGHUP), the .build-lock directory can be left behind. Previously, this caused subsequent builds to hang for an extended period (with 20 retries using exponential backoff) before failing, making development frustrating. This was particularly confusing because the build would appear to hang indefinitely with no clear error message.

This PR implements a two-pronged approach to handle stale locks:

1. Signal Handlers for Clean Shutdown

Registers signal handlers (SIGINT, SIGTERM, SIGHUP) after acquiring the lock to ensure the lock is released even when the process is interrupted:

  • Handlers release the lock before re-emitting the signal for normal termination
  • Handlers are cleaned up after normal build completion to avoid memory leaks

2. Proactive Stale Lock Detection on Startup

Uses lockfile.check() to verify if a lock is actively held before attempting acquisition:

  • Removes orphaned lock directories from crashed builds automatically
  • Reduces retries from 20 to 3 with shorter timeouts (100ms-1000ms) for faster failure when lock is truly contested
  • Sets explicit 10-second stale threshold for lock files
  • Improves error message to show the actual lockfile path so users can manually clean up if needed

The fix handles three startup scenarios:

  1. Stale lock exists (no active process) - auto-removes and proceeds normally
  2. Active lock held by another process - waits and retries as before
  3. Corrupted lock (check fails with ENOENT or similar) - attempts cleanup and proceeds

Test plan

  • Added 5 new unit tests covering all stale lock scenarios and signal handler registration
  • All 15 extension build tests pass
  • Manually verified fix by reproducing the original issue (crashed build leaving orphan lock) and confirming auto-cleanup works

When a function build process crashes or is killed, the `.build-lock`
directory can be left behind. Previously, this caused subsequent builds
to hang for an extended period (with 20 retries using exponential backoff)
before failing, making development frustrating.

This change:
- Adds proactive stale lock detection before attempting to acquire a lock
- Uses `lockfile.check()` to verify if a lock is actively held
- Removes orphaned lock directories from crashed builds automatically
- Reduces retries from 20 to 3 with shorter timeouts (faster failure)
- Sets explicit 10-second stale threshold for lock files
- Improves error message to show the actual lockfile path

The fix handles three scenarios:
1. Stale lock exists (no active process) - auto-removes and proceeds
2. Active lock held by another process - waits normally
3. Corrupted lock (check fails) - attempts cleanup and proceeds
@skypher skypher requested a review from a team as a code owner December 8, 2025 03:54
This prevents stale lockfiles from being left behind when the build
process is interrupted by a signal (Ctrl+C, kill, terminal close).

The signal handlers:
- Register after the lock is acquired
- Release the lock before re-emitting the signal to allow normal termination
- Are cleaned up after normal build completion to avoid memory leaks

Also adds a test to verify signal handlers are properly registered
and removed during the build lifecycle.
The @shopify/cli-kit rmdir function uses the 'del' library which handles
recursion automatically. The RmDirOptions interface only supports 'force',
not 'recursive'.
@skypher
Copy link
Author

skypher commented Dec 8, 2025

CLA signed.

@skypher skypher requested a review from a team as a code owner December 8, 2025 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant