-
Notifications
You must be signed in to change notification settings - Fork 238
Modify Model Builder to build paged attention models #1605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This reverts commit fb3fd84.
| @@ -0,0 +1,107 @@ | |||
| from onnxruntime import InferenceSession, OrtValue, SessionOptions, get_available_providers | |||
Check notice
Code scanning / CodeQL
Unused import Note test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
To fix the unused import error for get_available_providers, simply remove it from the import statement on line 1 of test/python/test_paged_model.py. Alter the import so that only the used symbols (InferenceSession, OrtValue, SessionOptions) are listed. No changes to functionality are required, and no additional dependencies or definitions are needed.
-
Copy modified line R1
| @@ -1,4 +1,4 @@ | ||
| from onnxruntime import InferenceSession, OrtValue, SessionOptions, get_available_providers | ||
| from onnxruntime import InferenceSession, OrtValue, SessionOptions | ||
| import numpy as np | ||
| import torch | ||
|
|
| total_sequence_length = 276 | ||
| sequence_length = 1 | ||
| num_tokens = 1 | ||
| max_num_blocks = 2 |
Check notice
Code scanning / CodeQL
Unused local variable Note test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
To fix the problem, the assignment of max_num_blocks on line 17 should be removed. As its right-hand side is a simple constant (2) with no side effects or function calls, it is safe to delete the whole line. No references exist to this variable elsewhere in the function, so the deletion does not affect functionality. The affected region is within test_paged_model() in the file test/python/test_paged_model.py. No imports, definitions, or method changes are needed—simply delete the line.
| @@ -14,7 +14,6 @@ | ||
| total_sequence_length = 276 | ||
| sequence_length = 1 | ||
| num_tokens = 1 | ||
| max_num_blocks = 2 | ||
| num_blocks = 2 | ||
| block_size = 256 | ||
| num_heads = 32 |
| io_binding.bind_ortvalue_output(f"present.{i}.value", values[i]) | ||
|
|
||
| # Run inference | ||
| outputs = session.run_with_iobinding(io_binding) |
Check warning
Code scanning / CodeQL
Variable defined multiple times Warning test
redefined
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
To fix the problem, remove the first assignment to outputs on line 81, but retain the session.run_with_iobinding(io_binding) call to ensure any needed side effects happen. Only the assignment to outputs is redundant, so the method call should still be executed. This means replacing line 81 with just a call to session.run_with_iobinding(io_binding), removing "outputs =" from that line. No further changes need to be made, as the subsequent usage of outputs (after line 82) remains unchanged. No imports or additional definitions are needed.
-
Copy modified line R81
| @@ -78,7 +78,7 @@ | ||
| io_binding.bind_ortvalue_output(f"present.{i}.value", values[i]) | ||
|
|
||
| # Run inference | ||
| outputs = session.run_with_iobinding(io_binding) | ||
| session.run_with_iobinding(io_binding) | ||
| outputs = io_binding.copy_outputs_to_cpu() | ||
|
|
||
| # Check output shape |
| io_binding.bind_input(f"past_key_values.{i}.value", "cuda", 0, np.float16, values_gqa[i].shape(), values_gqa[i].data_ptr()) | ||
| io_binding.bind_ortvalue_output(f"present.{i}.key", keys_gqa[i]) | ||
| io_binding.bind_ortvalue_output(f"present.{i}.value", values_gqa[i]) | ||
| outputs = session.run_with_iobinding(io_binding) |
Check warning
Code scanning / CodeQL
Variable defined multiple times Warning test
redefined
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
To fix the problem, simply remove the unnecessary assignment to outputs on line 109. Call session.run_with_iobinding(io_binding) as a standalone statement (unless its return value is genuinely intended to be used somewhere). This ensures that any side effects of the method call (if needed) still occur, but avoids having a redundant variable assignment. Only code within test/python/test_paged_model.py needs to be edited; specifically, line 109.
-
Copy modified line R109
| @@ -106,7 +106,7 @@ | ||
| io_binding.bind_input(f"past_key_values.{i}.value", "cuda", 0, np.float16, values_gqa[i].shape(), values_gqa[i].data_ptr()) | ||
| io_binding.bind_ortvalue_output(f"present.{i}.key", keys_gqa[i]) | ||
| io_binding.bind_ortvalue_output(f"present.{i}.value", values_gqa[i]) | ||
| outputs = session.run_with_iobinding(io_binding) | ||
| session.run_with_iobinding(io_binding) | ||
| outputs = io_binding.copy_outputs_to_cpu() | ||
|
|
||
| logits_gqa = outputs[0] |
| assert np.allclose(final_norm_output_page, final_norm_output_gqa[0], atol=1e-3), "Final norm output from paged model and gqa model do not match." | ||
|
|
||
| # Compare first present key between paged model and gqa model | ||
| present_key_gqa = outputs[1] |
Check notice
Code scanning / CodeQL
Unused local variable Note test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
To fix the issue, simply remove the assignment to the variable present_key_gqa on line 117, as it is not used anywhere in the code. Take care not to remove any code on the right hand side of the assignment that has side effects, but in this case, outputs[1] is just an indexing operation and does not produce side effects. Removing the line will not affect the existing functionality or output of the test. No additional imports, methods, or definitions are required.
| @@ -114,7 +114,6 @@ | ||
| print("GQA model passed successfully.") | ||
|
|
||
| # Compare first present key between paged model and gqa model | ||
| present_key_gqa = outputs[1] | ||
| # print(f"Present key paged shape: {present_key_paged.shape}, Present key gqa shape: {present_key_gqa.shape}") | ||
| # print(f"Difference between paged and gqa present key: {present_key_paged[1, total_sequence_length-block_size-1, 4, :50] - present_key_gqa[0, 4, total_sequence_length-1, :50]}") | ||
|
|
This PR should allow us to replace a typical attention operator like GQA with Paged Attention, to be used with GenAI Engine API.