Skip to content

Conversation

@embano1
Copy link
Member

@embano1 embano1 commented Dec 6, 2025

Issue #, if available: n/a

Description of changes:
Provide a language SDK specification for developers to build their own SDKs and establish conformance testing. This is just a first start to iterate on the SDK and provide builders guidance given the large interest in additional SDKs (Go, Rust, Java, Swift, .NET). The file should then be extracted into its own repository to create conformance tests for officially supported SDKs.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Invocation 1:
- Load state: []
- Start STEP(id="step1")
- Checkpoint: START step1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need more guidance here around:

  • AT-LEAST/MOST-ONCE
  • batching/optimizations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - that would be a good idea.

- Be checkpointed and resumed
- Maintain execution state across interruptions

The two core durable operation primitives are:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 5 primitives (if you ignore the EXECUTION operation which is only used to complete the execution):

  • CALLBACK
  • CHAINED_INVOKE
  • CONTEXT
  • STEP
  • WAIT


For correct replay behavior, **user code MUST be deterministic**:

1. Non-durable code (code outside operations) MUST execute identically on each replay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we should note is that this may require re-implementing/providing alternatives for certain language constructs that are inherently nondeterministic.

For example in Java unless you use a LinkedHashMap instead of a HashMap, the iteration order is not guaranteed to be the same on multiple creations of the same map, or in Go where map iteration order is purposefully randomized, etc.

"CheckpointToken": "string",
"InitialExecutionState": {
"Operations": [
/* Operation objects */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we link to the Lambda API docs sections where appropriate in this doc? E.g. https://docs.aws.amazon.com/lambda/latest/api/API_Operation.html


The SDK CANNOT:

- Prevent users from writing non-deterministic code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if it is somehow able to, it should 😄 - the spec shouldn't prevent it from doing so

Maybe this should say "The SDK is not responsible for:"

"Error": {
"ErrorType": "string",
"ErrorMessage": "string",
"StackTrace": ["string"] // OPTIONAL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the fields are actually optional

(There's also a 4th ErrorData field as well for additional machine-readable error data)


- Maximum execution duration: 1 year
- Maximum response payload: 6MB
- Maximum history size: Limited by service quotas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maximum number of durable operations (including retries)? The limit is not directly on history.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we also have a history limit (100MB), added both

Invocation 1:
- Load state: []
- Start STEP(id="step1")
- Checkpoint: START step1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - that would be a good idea.

- Load state: [step1: SUCCEEDED, step2: STARTED]
- Replay STEP(id="step1") - return cached "result1"
- Resume STEP(id="step2")
- Checkpoint: START step2 (same ID, continues)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wouldn't checkpoint START again - it's already started. Depends on semantics but can either run it again then checkpoint success/failure/retry or decide to immediately checkpoint failure, or retry, etc.


```
[callback_promise, callback_id] = await context.create_callback("approval")
await send_approval_email(callback_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to put this in a context.step

│ START action
┌─────────┐
│ STARTED │◄──────┐
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the arrow here should be coming from READY

Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>
Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>
Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>
@embano1
Copy link
Member Author

embano1 commented Dec 9, 2025

@jriecken thx for the detailed feedback. Incorporated (diff for commit: ea5d479)

@embano1 embano1 changed the title Add language SDK specification docs: add language sdk specification Dec 9, 2025
@embano1 embano1 closed this Dec 9, 2025
@embano1 embano1 reopened this Dec 9, 2025
@embano1 embano1 marked this pull request as ready for review December 9, 2025 12:49
@embano1
Copy link
Member Author

embano1 commented Dec 9, 2025

@jriecken shall I add a section on testing (in-memory local executor)?

@embano1
Copy link
Member Author

embano1 commented Dec 10, 2025

Just connected with @maschnetwork and I noticed that we currently don't have guidance in this SPEC how to handle concurrent durable operations when waits/suspension are involved (simple waits, durable invokes, callbacks, including timeouts). For example, you want to use context.parallel() with a step taking 5 seconds and a wait (1s) or having two concurrent waits (5s and 5s) where you expect to not wait 10s in total - how should an SDK implement those suspension decisions?

cc/ @ParidelPooya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants