Skip to content

Conversation

@qjia7
Copy link
Contributor

@qjia7 qjia7 commented Sep 3, 2025

This pull request adds the graph capture support for WebGPU EP. It adds the gpu allocator and update the attention_mask/position_ids in gpu so that the graph capture can be applied in generation phase.

See >20% improvement for generation speed on NV 5080.

Steps to test this PR

  1. Build the onnxruntime with PR [Don't review][webgpu] Make graph capture work on LLM onnxruntime#25868 and install it to a folder, for example: C:\xxx\ort_home
  2. Build onnxruntime-genai with step 1's ort. Run 'build --config RelWithDebInfo --parallel --build_dir build\WGPU --ort_home C:\xxx\ort_home'
  3. Prepare the graph capture supported model. Use builder.py with PR 'Add enable_webgpu_graph in extra_options #1788'. Then use the new builder to run 'python builder.py -p int4 -i D:\jiajia\models\Phi-4-mini-instruct -o D:\jiajia\models\test-phi4 -e webgpu --extra_options int4_accuracy_level=4 int4_algo_config=k_quant_last int4_tied_embeddings=true enable_webgpu_graph=true'
  4. Run step3 generated model with step1's onnxruntime and step2's onnxruntime-genai with provider option 'enableGraphCapture' = '1'.

This pull request is intended to facilitate discussion and provide a comprehensive overview of the overall changes. The final solution specially how to integrate dawn or share the same dawn libraries with ort from internal or external still need to be discussed.

qjia7 added 14 commits September 1, 2025 13:02
pass device and instance to webgpu provider

initialize dawn

CopyDeviceToCpu/CopyCpuToDevice support

fix

fix errors

fix nullptr error

add dxc libraries

use device memory for intpus only if graph capture is enabled

Delete useless files
It seems that for cast, the buffer is dynamically created and destroyed.
So the bind group can't be reused.
Can this be improved?
@qjia7
Copy link
Contributor Author

qjia7 commented Oct 15, 2025

This PR is to support the graph capture for webgpu ep. The draft shows over 20% faster for generation phase in RTX 5080 with graph capture enabled. However, the biggest challenge is how to process the tensor data in gpu in onnxruntime-genai repo since all inputs/outputs are in gpu for graph capture.
We have two primary objectives:

  1. implement CopyDeviceToCpu, CopyCpuToDevice, and CopyFrom functions.
  2. support preprocessing and postprocessing tasks in gpu, such as UpdateAttentionMask, UpdatePositionIds, and Cast.

Solution one used in this PR: Build dawn in genai and pass it into ort-webgpu or use ort-webgpu's default dawn

This approach allows genai to process ort-webgpu’s buffers since the proc_table/device/instance are shared. We utilize Dawn APIs and compute shaders to achieve the two primary objectives, similar to how CUDA and DML EPs leverage corresponding GPU APIs.

However, it requires that onnxruntime-genai and onnxruntime must have the same dawn version to ensure compatibility of the proc table. Even if we utilize ort-webgpu's dawn, we must include Dawn's headers and the dawn_proc/dawn_common libraries. Ensuring compatibility across two different repositories remains a challenge.

Solution two: we could use small models to accomplish the two primary objectives. For instance, a model containing only a Cast operation could manage the Cast functionality in genai. This method allows all EPs to share a common approach, eliminating EP-specific differences in genai. However, it is less flexible and potentially less efficient. For functionalities that cannot be built using basic operations, we would need to introduce specific contrib-ops for support.

Solution three: Expose the C API in onnxruntime to support these functionalities, and directly call them in onnxruntime-genai. This approach allows all EPs to share a common method; however, it requires onnxruntime to expose additional interfaces. The onnxruntime-genai version must also be compatible with the onnxruntime version, as it depends on the required APIs being available. A potential issue arises if more functionalities, such as SelectTop in BeamSearch_Cuda, are needed in the future. Would we continue to expose them in onnxruntime? This method, therefore, lacks flexibility and is challenging to maintain.

Besides above solutions, expose the DataTransfer in ORT will be a bonus since ort-webgpu already support download/upload/copy which are similar with genai's CopyDeviceToCpu/CopyCpuToDevice/CopyFrom.

PS: The PR is to convenient the discussion and collect perf data not an official review.

@guschmue @fs-eire Please help add the related people here and guide me how to move this work forward. Thanks.
cc @sushraja-msft

@qjia7 qjia7 changed the title [Don't review] Add graph capture for webgpu [Need discussion] Add graph capture for webgpu Oct 15, 2025
qjia7 added a commit that referenced this pull request Nov 25, 2025
This PR enables the graph capture for webgpu. It implements
CopyDeviceToCpu\CopyCpuToDevice\CopyFrom\Zero functions using the new
`CopyTensors` API.

The ort part needs to apply this PR
[#26450](microsoft/onnxruntime#26450) to make it
work for webgpu.

Below things will be implemented in following-up PRs to get the full
performance gain for graph capture (The original one is
#1720).
1. Support UpdateAttentionMask, UpdatePositionIds, and Cast to keep the
whole pipeline on gpu.
2. Optimize CopyFrom with offsets

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
kunal-vaishnavi pushed a commit that referenced this pull request Dec 5, 2025
This PR enables the graph capture for webgpu. It implements
CopyDeviceToCpu\CopyCpuToDevice\CopyFrom\Zero functions using the new
`CopyTensors` API.

The ort part needs to apply this PR
[#26450](microsoft/onnxruntime#26450) to make it
work for webgpu.

Below things will be implemented in following-up PRs to get the full
performance gain for graph capture (The original one is
#1720).
1. Support UpdateAttentionMask, UpdatePositionIds, and Cast to keep the
whole pipeline on gpu.
2. Optimize CopyFrom with offsets

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants