Long context(tokens length >= 4090) encountered a sharp perf regression with beachmark_e2e.py

When using beachmark_e2e.py to run against phi3 model, we encountered a sharp pref regression with tokens length(prompt length + generation lenght) greater then **4090** on Mac device, e.g., prompt length 3836, generation lenght 254. While Windows device, the critical value is **4091**.
Device: Mac with Apple M2 Pro chip and Windows with NV 5080 GPU chip
Repro steps:
1. Download Phi3 model
2. Run "python benchmark/python/benchmark_e2e.py -i model_path -l 3836 -g 254"
3. Check the token generation tps(which was 35 on my Mac machine)
4. Run "python benchmark/python/benchmark_e2e.py -i model_path -l 3836 -g 255"
5. Check the token generation tps which should be much smaller than the first run(which was 14 on my Mac machine).
6. On Windows devices, the generation token tps values was 120(token 4091) vs 52(token 4092)

I further dived deep on Mac device and found the token 4090 took much more time to finish the generation, token 4089 used 0.04ms while token 4090 used 11.34ms, then 4091 recovered to 0.88ms. I also debugged into the ort and print the prameters variable in [ApplyFlashAttention](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc) function when generation token 4089 - 4091 as below. The sequence_length_ 4097 is confused when generating token 4090, the expected value should be 1 as token 4089 and 4091.

Looks like token 4090 generation goes into profilling process? I wonder if there are any potential bugs in onnx or genai?

**3836+253=4089:**
(lldb) p parameters
(const onnxruntime::contrib::webgpu::WebgpuAttentionParameters &) 0x000000016b3034b0: {
  is_gqa_ = true
  batch_size_ = 1
  sequence_length_ = 1
  kv_sequence_length_ = 1
  past_sequence_length_ = 4099
  total_sequence_length_ = 4096
  max_sequence_length_ = 0
  input_hidden_size_ = 0
  hidden_size_ = 3072
  head_size_ = 128
  v_hidden_size_ = 1024
  v_head_size_ = 128
  num_heads_ = 24
  rotary_embedding_ = 0
  is_unidirectional_ = true
  past_present_share_buffer_ = true
  do_rotary_ = false
  broadcast_attn_bias_dim_0_ = false
  broadcast_attn_bias_dim_1_ = false
  mask_filter_value_ = -10000
  scale_ = 0.0883883461
  use_tf32_ = false
  seqlen_past_kv_cache_ = 4099
  seqlen_present_kv_cache_ = 4099
  kv_hidden_size_ = 1024
  kv_num_heads_ = 8
  num_splits_ = 0
  rotary_dim_ = 0
  local_window_size_ = 0
  kv_share_buffer_ = false
  is_packed_qkv_ = false
  is_subsequent_prompt_ = false
  is_first_prompt_ = false
  rotary_interleaved_ = false
  use_smooth_softmax_ = false
  softcap_ = 0
  zeros_count_ = 0
  zero_ptr_ = 0x0000000000000000
  n_reps = 3
  mask_type_ = MASK_NONE
  qkv_format_ = Q_K_V_BSNH
}
 
**3836+254=4090:**
(const onnxruntime::contrib::webgpu::WebgpuAttentionParameters &) 0x000000016b303360: {
  is_gqa_ = true
  batch_size_ = 1
  sequence_length_ = 4097
  kv_sequence_length_ = 4097
  past_sequence_length_ = 4099
  total_sequence_length_ = 4097
  max_sequence_length_ = 0
  input_hidden_size_ = 0
  hidden_size_ = 3072
  head_size_ = 128
  v_hidden_size_ = 1024
  v_head_size_ = 128
  num_heads_ = 24
  rotary_embedding_ = 0
  is_unidirectional_ = true
  past_present_share_buffer_ = true
  do_rotary_ = false
  broadcast_attn_bias_dim_0_ = false
  broadcast_attn_bias_dim_1_ = false
  mask_filter_value_ = -10000
  scale_ = 0.0883883461
  use_tf32_ = false
  seqlen_past_kv_cache_ = 4099
  seqlen_present_kv_cache_ = 4099
  kv_hidden_size_ = 1024
  kv_num_heads_ = 8
  num_splits_ = 0
  rotary_dim_ = 0
  local_window_size_ = 0
  kv_share_buffer_ = false
  is_packed_qkv_ = false
  is_subsequent_prompt_ = false
  is_first_prompt_ = true
  rotary_interleaved_ = false
  use_smooth_softmax_ = false
  softcap_ = 0
  zeros_count_ = 0
  zero_ptr_ = 0x0000000000000000
  n_reps = 3
  mask_type_ = MASK_NONE
  qkv_format_ = Q_K_V_BSNH
}

**3836+255=4091:**
(const onnxruntime::contrib::webgpu::WebgpuAttentionParameters &) 0x000000016b3034b0: {
  is_gqa_ = true
  batch_size_ = 1
  sequence_length_ = 1
  kv_sequence_length_ = 1
  past_sequence_length_ = 4099
  total_sequence_length_ = 4098
  max_sequence_length_ = 0
  input_hidden_size_ = 0
  hidden_size_ = 3072
  head_size_ = 128
  v_hidden_size_ = 1024
  v_head_size_ = 128
  num_heads_ = 24
  rotary_embedding_ = 0
  is_unidirectional_ = true
  past_present_share_buffer_ = true
  do_rotary_ = false
  broadcast_attn_bias_dim_0_ = false
  broadcast_attn_bias_dim_1_ = false
  mask_filter_value_ = -10000
  scale_ = 0.0883883461
  use_tf32_ = false
  seqlen_past_kv_cache_ = 4099
  seqlen_present_kv_cache_ = 4099
  kv_hidden_size_ = 1024
  kv_num_heads_ = 8
  num_splits_ = 0
  rotary_dim_ = 0
  local_window_size_ = 0
  kv_share_buffer_ = false
  is_packed_qkv_ = false
  is_subsequent_prompt_ = false
  is_first_prompt_ = false
  rotary_interleaved_ = false
  use_smooth_softmax_ = false
  softcap_ = 0
  zeros_count_ = 0
  zero_ptr_ = 0x0000000000000000
  n_reps = 3
  mask_type_ = MASK_NONE
  qkv_format_ = Q_K_V_BSNH
}

**Profling Analyzer for token 4089:**
<img width="1793" height="825" alt="Image" src="https://github.com/user-attachments/assets/7e9978ec-d447-447d-b795-2ef9c8e29081" />

**Profling Analyzer for token 4090:**
<img width="1784" height="740" alt="Image" src="https://github.com/user-attachments/assets/1e7a176b-c83b-41b3-a796-8ef29701ec74" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long context(tokens length >= 4090) encountered a sharp perf regression with beachmark_e2e.py #1910

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Long context(tokens length >= 4090) encountered a sharp perf regression with beachmark_e2e.py #1910

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions