Skip to content

Long context(tokens length >= 4090) encountered a sharp perf regression with beachmark_e2e.py #1910

@feich-ms

Description

@feich-ms

When using beachmark_e2e.py to run against phi3 model, we encountered a sharp pref regression with tokens length(prompt length + generation lenght) greater then 4090 on Mac device, e.g., prompt length 3836, generation lenght 254. While Windows device, the critical value is 4091.
Device: Mac with Apple M2 Pro chip and Windows with NV 5080 GPU chip
Repro steps:

  1. Download Phi3 model
  2. Run "python benchmark/python/benchmark_e2e.py -i model_path -l 3836 -g 254"
  3. Check the token generation tps(which was 35 on my Mac machine)
  4. Run "python benchmark/python/benchmark_e2e.py -i model_path -l 3836 -g 255"
  5. Check the token generation tps which should be much smaller than the first run(which was 14 on my Mac machine).
  6. On Windows devices, the generation token tps values was 120(token 4091) vs 52(token 4092)

I further dived deep on Mac device and found the token 4090 took much more time to finish the generation, token 4089 used 0.04ms while token 4090 used 11.34ms, then 4091 recovered to 0.88ms. I also debugged into the ort and print the prameters variable in ApplyFlashAttention function when generation token 4089 - 4091 as below. The sequence_length_ 4097 is confused when generating token 4090, the expected value should be 1 as token 4089 and 4091.

Looks like token 4090 generation goes into profilling process? I wonder if there are any potential bugs in onnx or genai?

3836+253=4089:
(lldb) p parameters
(const onnxruntime::contrib::webgpu::WebgpuAttentionParameters &) 0x000000016b3034b0: {
is_gqa_ = true
batch_size_ = 1
sequence_length_ = 1
kv_sequence_length_ = 1
past_sequence_length_ = 4099
total_sequence_length_ = 4096
max_sequence_length_ = 0
input_hidden_size_ = 0
hidden_size_ = 3072
head_size_ = 128
v_hidden_size_ = 1024
v_head_size_ = 128
num_heads_ = 24
rotary_embedding_ = 0
is_unidirectional_ = true
past_present_share_buffer_ = true
do_rotary_ = false
broadcast_attn_bias_dim_0_ = false
broadcast_attn_bias_dim_1_ = false
mask_filter_value_ = -10000
scale_ = 0.0883883461
use_tf32_ = false
seqlen_past_kv_cache_ = 4099
seqlen_present_kv_cache_ = 4099
kv_hidden_size_ = 1024
kv_num_heads_ = 8
num_splits_ = 0
rotary_dim_ = 0
local_window_size_ = 0
kv_share_buffer_ = false
is_packed_qkv_ = false
is_subsequent_prompt_ = false
is_first_prompt_ = false
rotary_interleaved_ = false
use_smooth_softmax_ = false
softcap_ = 0
zeros_count_ = 0
zero_ptr_ = 0x0000000000000000
n_reps = 3
mask_type_ = MASK_NONE
qkv_format_ = Q_K_V_BSNH
}

3836+254=4090:
(const onnxruntime::contrib::webgpu::WebgpuAttentionParameters &) 0x000000016b303360: {
is_gqa_ = true
batch_size_ = 1
sequence_length_ = 4097
kv_sequence_length_ = 4097
past_sequence_length_ = 4099
total_sequence_length_ = 4097
max_sequence_length_ = 0
input_hidden_size_ = 0
hidden_size_ = 3072
head_size_ = 128
v_hidden_size_ = 1024
v_head_size_ = 128
num_heads_ = 24
rotary_embedding_ = 0
is_unidirectional_ = true
past_present_share_buffer_ = true
do_rotary_ = false
broadcast_attn_bias_dim_0_ = false
broadcast_attn_bias_dim_1_ = false
mask_filter_value_ = -10000
scale_ = 0.0883883461
use_tf32_ = false
seqlen_past_kv_cache_ = 4099
seqlen_present_kv_cache_ = 4099
kv_hidden_size_ = 1024
kv_num_heads_ = 8
num_splits_ = 0
rotary_dim_ = 0
local_window_size_ = 0
kv_share_buffer_ = false
is_packed_qkv_ = false
is_subsequent_prompt_ = false
is_first_prompt_ = true
rotary_interleaved_ = false
use_smooth_softmax_ = false
softcap_ = 0
zeros_count_ = 0
zero_ptr_ = 0x0000000000000000
n_reps = 3
mask_type_ = MASK_NONE
qkv_format_ = Q_K_V_BSNH
}

3836+255=4091:
(const onnxruntime::contrib::webgpu::WebgpuAttentionParameters &) 0x000000016b3034b0: {
is_gqa_ = true
batch_size_ = 1
sequence_length_ = 1
kv_sequence_length_ = 1
past_sequence_length_ = 4099
total_sequence_length_ = 4098
max_sequence_length_ = 0
input_hidden_size_ = 0
hidden_size_ = 3072
head_size_ = 128
v_hidden_size_ = 1024
v_head_size_ = 128
num_heads_ = 24
rotary_embedding_ = 0
is_unidirectional_ = true
past_present_share_buffer_ = true
do_rotary_ = false
broadcast_attn_bias_dim_0_ = false
broadcast_attn_bias_dim_1_ = false
mask_filter_value_ = -10000
scale_ = 0.0883883461
use_tf32_ = false
seqlen_past_kv_cache_ = 4099
seqlen_present_kv_cache_ = 4099
kv_hidden_size_ = 1024
kv_num_heads_ = 8
num_splits_ = 0
rotary_dim_ = 0
local_window_size_ = 0
kv_share_buffer_ = false
is_packed_qkv_ = false
is_subsequent_prompt_ = false
is_first_prompt_ = false
rotary_interleaved_ = false
use_smooth_softmax_ = false
softcap_ = 0
zeros_count_ = 0
zero_ptr_ = 0x0000000000000000
n_reps = 3
mask_type_ = MASK_NONE
qkv_format_ = Q_K_V_BSNH
}

Profling Analyzer for token 4089:
Image

Profling Analyzer for token 4090:
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions