Modify Model Builder to build paged attention models #1605

aciddelgado · 2025-07-03T17:25:15Z

This PR should allow us to replace a typical attention operator like GQA with Paged Attention, to be used with GenAI Engine API.

This reverts commit fb3fd84.

test/python/test_paged_model.py

@@ -0,0 +1,107 @@
+from onnxruntime import InferenceSession, OrtValue, SessionOptions, get_available_providers


To fix the unused import error for get_available_providers, simply remove it from the import statement on line 1 of test/python/test_paged_model.py. Alter the import so that only the used symbols (InferenceSession, OrtValue, SessionOptions) are listed. No changes to functionality are required, and no additional dependencies or definitions are needed.

test/python/test_paged_model.py

+  total_sequence_length = 276
+  sequence_length = 1
+  num_tokens = 1
+  max_num_blocks = 2


To fix the problem, the assignment of max_num_blocks on line 17 should be removed. As its right-hand side is a simple constant (2) with no side effects or function calls, it is safe to delete the whole line. No references exist to this variable elsewhere in the function, so the deletion does not affect functionality. The affected region is within test_paged_model() in the file test/python/test_paged_model.py. No imports, definitions, or method changes are needed—simply delete the line.

test/python/test_paged_model.py

+    io_binding.bind_ortvalue_output(f"present.{i}.value", values[i])
+
+  # Run inference
+  outputs = session.run_with_iobinding(io_binding)


To fix the problem, remove the first assignment to outputs on line 81, but retain the session.run_with_iobinding(io_binding) call to ensure any needed side effects happen. Only the assignment to outputs is redundant, so the method call should still be executed. This means replacing line 81 with just a call to session.run_with_iobinding(io_binding), removing "outputs =" from that line. No further changes need to be made, as the subsequent usage of outputs (after line 82) remains unchanged. No imports or additional definitions are needed.

test/python/test_paged_model.py

+    io_binding.bind_input(f"past_key_values.{i}.value", "cuda", 0, np.float16, values_gqa[i].shape(), values_gqa[i].data_ptr())
+    io_binding.bind_ortvalue_output(f"present.{i}.key", keys_gqa[i])
+    io_binding.bind_ortvalue_output(f"present.{i}.value", values_gqa[i])
+  outputs = session.run_with_iobinding(io_binding)


To fix the problem, simply remove the unnecessary assignment to outputs on line 109. Call session.run_with_iobinding(io_binding) as a standalone statement (unless its return value is genuinely intended to be used somewhere). This ensures that any side effects of the method call (if needed) still occur, but avoids having a redundant variable assignment. Only code within test/python/test_paged_model.py needs to be edited; specifically, line 109.

test/python/test_paged_model.py

+  assert np.allclose(final_norm_output_page, final_norm_output_gqa[0], atol=1e-3), "Final norm output from paged model and gqa model do not match."
+
+  # Compare first present key between paged model and gqa model
+  present_key_gqa = outputs[1]


To fix the issue, simply remove the assignment to the variable present_key_gqa on line 117, as it is not used anywhere in the code. Take care not to remove any code on the right hand side of the assignment that has side effects, but in this case, outputs[1] is just an indexing operation and does not produce side effects. Removing the line will not affect the existing functionality or output of the test. No additional imports, methods, or definitions are required.

aciddelgado added 15 commits March 27, 2025 14:05

softmax later for efficiency

d72adeb

top p on top k

fb3fd84

Revert "top p on top k"

6920db5

This reverts commit fb3fd84.

softmax later for efficiency

ea2dcad

Merge branch 'main' into aciddelgado/top_k_softmax

8e6a543

softmax

604ef1b

Merge branch 'main' into aciddelgado/top_k_softmax

59f198f

Merge branch 'main' into aciddelgado/top_k_softmax

38ed832

fix cuda sampling logic

d6ccd1c

tests top p / k cuda

4fd27f4

lint

726d265

paged attention model builder start work

d33cd5d

builder paged

d6db5ad

works

d74e8c8

test and builder update

5017af8

github-advanced-security bot found potential problems Jul 31, 2025

View reviewed changes

fix graph

efbf757

github-advanced-security bot found potential problems Aug 12, 2025

View reviewed changes

debugging stuff

11f45fd

github-advanced-security bot found potential problems Aug 18, 2025

View reviewed changes

aciddelgado added 3 commits September 3, 2025 11:59

merge main

7857791

small fixes

2c5cffe

top logits seem to match more or less

e8d9a5a

@@ -1,4 +1,4 @@
-            from onnxruntime import InferenceSession, OrtValue, SessionOptions, get_available_providers
+            from onnxruntime import InferenceSession, OrtValue, SessionOptions
             import numpy as np
             import torch

@@ -114,7 +114,6 @@
               print("GQA model passed successfully.")
               # Compare first present key between paged model and gqa model
-              present_key_gqa = outputs[1]
               # print(f"Present key paged shape: {present_key_paged.shape}, Present key gqa shape: {present_key_gqa.shape}")
               # print(f"Difference between paged and gqa present key: {present_key_paged[1, total_sequence_length-block_size-1, 4, :50] - present_key_gqa[0, 4, total_sequence_length-1, :50]}")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modify Model Builder to build paged attention models #1605

Modify Model Builder to build paged attention models #1605

Uh oh!

aciddelgado commented Jul 3, 2025

Uh oh!

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Uh oh!

Uh oh!

Check warning

Copilot Autofix

Uh oh!

Uh oh!

Check warning

Copilot Autofix

Check notice

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,107 @@
		from onnxruntime import InferenceSession, OrtValue, SessionOptions, get_available_providers

Modify Model Builder to build paged attention models #1605

Are you sure you want to change the base?

Modify Model Builder to build paged attention models #1605

Uh oh!

Conversation

aciddelgado commented Jul 3, 2025

Uh oh!

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Uh oh!

Uh oh!

Check warning

Uh oh!

Copilot Autofix

Uh oh!

Uh oh!

Check warning

Uh oh!

Copilot Autofix

Check notice

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants