[Need discussion] Add graph capture for webgpu #1720

qjia7 · 2025-09-03T07:19:25Z

This pull request adds the graph capture support for WebGPU EP. It adds the gpu allocator and update the attention_mask/position_ids in gpu so that the graph capture can be applied in generation phase.

See >20% improvement for generation speed on NV 5080.

Steps to test this PR

Build the onnxruntime with PR [Don't review][webgpu] Make graph capture work on LLM onnxruntime#25868 and install it to a folder, for example: C:\xxx\ort_home
Build onnxruntime-genai with step 1's ort. Run 'build --config RelWithDebInfo --parallel --build_dir build\WGPU --ort_home C:\xxx\ort_home'
Prepare the graph capture supported model. Use builder.py with PR 'Add enable_webgpu_graph in extra_options #1788'. Then use the new builder to run 'python builder.py -p int4 -i D:\jiajia\models\Phi-4-mini-instruct -o D:\jiajia\models\test-phi4 -e webgpu --extra_options int4_accuracy_level=4 int4_algo_config=k_quant_last int4_tied_embeddings=true enable_webgpu_graph=true'
Run step3 generated model with step1's onnxruntime and step2's onnxruntime-genai with provider option 'enableGraphCapture' = '1'.

This pull request is intended to facilitate discussion and provide a comprehensive overview of the overall changes. The final solution specially how to integrate dawn or share the same dawn libraries with ort from internal or external still need to be discussed.

pass device and instance to webgpu provider initialize dawn CopyDeviceToCpu/CopyCpuToDevice support fix fix errors fix nullptr error add dxc libraries use device memory for intpus only if graph capture is enabled Delete useless files

This reverts commit 35af94d.

It seems that for cast, the buffer is dynamically created and destroyed. So the bind group can't be reused. Can this be improved?

qjia7 · 2025-10-15T08:15:45Z

This PR is to support the graph capture for webgpu ep. The draft shows over 20% faster for generation phase in RTX 5080 with graph capture enabled. However, the biggest challenge is how to process the tensor data in gpu in onnxruntime-genai repo since all inputs/outputs are in gpu for graph capture.
We have two primary objectives:

implement CopyDeviceToCpu, CopyCpuToDevice, and CopyFrom functions.
support preprocessing and postprocessing tasks in gpu, such as UpdateAttentionMask, UpdatePositionIds, and Cast.

Solution one used in this PR: Build dawn in genai and pass it into ort-webgpu or use ort-webgpu's default dawn

This approach allows genai to process ort-webgpu’s buffers since the proc_table/device/instance are shared. We utilize Dawn APIs and compute shaders to achieve the two primary objectives, similar to how CUDA and DML EPs leverage corresponding GPU APIs.

However, it requires that onnxruntime-genai and onnxruntime must have the same dawn version to ensure compatibility of the proc table. Even if we utilize ort-webgpu's dawn, we must include Dawn's headers and the dawn_proc/dawn_common libraries. Ensuring compatibility across two different repositories remains a challenge.

Solution two: we could use small models to accomplish the two primary objectives. For instance, a model containing only a Cast operation could manage the Cast functionality in genai. This method allows all EPs to share a common approach, eliminating EP-specific differences in genai. However, it is less flexible and potentially less efficient. For functionalities that cannot be built using basic operations, we would need to introduce specific contrib-ops for support.

Solution three: Expose the C API in onnxruntime to support these functionalities, and directly call them in onnxruntime-genai. This approach allows all EPs to share a common method; however, it requires onnxruntime to expose additional interfaces. The onnxruntime-genai version must also be compatible with the onnxruntime version, as it depends on the required APIs being available. A potential issue arises if more functionalities, such as SelectTop in BeamSearch_Cuda, are needed in the future. Would we continue to expose them in onnxruntime? This method, therefore, lacks flexibility and is challenging to maintain.

Besides above solutions, expose the DataTransfer in ORT will be a bonus since ort-webgpu already support download/upload/copy which are similar with genai's CopyDeviceToCpu/CopyCpuToDevice/CopyFrom.

PS: The PR is to convenient the discussion and collect perf data not an official review.

@guschmue @fs-eire Please help add the related people here and guide me how to move this work forward. Thanks.
cc @sushraja-msft

This PR enables the graph capture for webgpu. It implements CopyDeviceToCpu\CopyCpuToDevice\CopyFrom\Zero functions using the new `CopyTensors` API. The ort part needs to apply this PR [#26450](microsoft/onnxruntime#26450) to make it work for webgpu. Below things will be implemented in following-up PRs to get the full performance gain for graph capture (The original one is #1720). 1. Support UpdateAttentionMask, UpdatePositionIds, and Cast to keep the whole pipeline on gpu. 2. Optimize CopyFrom with offsets --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

qjia7 added 14 commits September 1, 2025 13:02

add graph capture for webgpu

1fe581a

pass device and instance to webgpu provider initialize dawn CopyDeviceToCpu/CopyCpuToDevice support fix fix errors fix nullptr error add dxc libraries use device memory for intpus only if graph capture is enabled Delete useless files

delete unnecessary changes

2760811

Off some unused dawn feature

35af94d

Revert "Off some unused dawn feature"

dbf0349

This reverts commit 35af94d.

update attentionmask in gpu

e8b5cf5

support UpdatePositionIds in GPU

4c73447

Support Cast in gpu

81fa63c

It seems that for cast, the buffer is dynamically created and destroyed. So the bind group can't be reused. Can this be improved?

nits

36c9d79

format the new files

33344fe

work with latest dawn

6400653

Merge branch 'main' into webgpu_graph

89fe130

fix the clang-format errors

6a0aa75

Use webgpu ep's internal dawn

5ebb557

fix cast errors

23f59c2

qjia7 mentioned this pull request Oct 11, 2025

Add enable_webgpu_graph in extra_options #1788

Merged

qjia7 added 2 commits October 13, 2025 10:24

only keep one device/instance

94ad66c

resolve device lost errors

27de8c8

qjia7 changed the title ~~[Don't review] Add graph capture for webgpu~~ [Need discussion] Add graph capture for webgpu Oct 15, 2025

ambroser53 mentioned this pull request Nov 5, 2025

Enable graph capture for webgpu #1848

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Need discussion] Add graph capture for webgpu #1720

[Need discussion] Add graph capture for webgpu #1720

Uh oh!

qjia7 commented Sep 3, 2025 •

edited

Loading

Uh oh!

qjia7 commented Oct 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Need discussion] Add graph capture for webgpu #1720

Are you sure you want to change the base?

[Need discussion] Add graph capture for webgpu #1720

Uh oh!

Conversation

qjia7 commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qjia7 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qjia7 commented Sep 3, 2025 •

edited

Loading

qjia7 commented Oct 15, 2025 •

edited

Loading