Skip to content

Add batching support for Marian language models #1897

@mustjab

Description

@mustjab

Describe the bug
Currently, Marian models are limited to single-sequence inference. Enabling batching is critical for high-performance translation scenarios (e.g., Edge Translate), offering significant throughput improvements.

Other Models (Llama, Phi, GPT, etc.)

  • Status: Fully Supported.
  • Details:
    • Models support both continuous and static batching.
    • State and Model implementations handle [batch_size, seq_len] inputs natively.
    • KV caching is managed per-sequence.
    • Search strategies (Greedy, Beam) operate efficiently on the full batch.

Marian Models (Current Main Branch)

  • Status: Single Sequence Only (batch_size = 1).
  • Limitations:
    • MarianState::Run assumes single-sequence inputs.
    • Encoder/Decoder logic does not account for padding or batch dimensions.
    • Attempting to run with batch_size > 1 results in shape mismatches or incorrect processing of padding tokens.

Desktop (please complete the following information):

  • OS: Win11
  • Browser Edge
  • Version 144

Additional context
Testing batching performance using python script showed significant performance improvement for batch processing vs sequential translation.

Windows Results

Metric Batch Sequential Difference
Total Time 1.09s 4.76s 4.38x slower
Avg per Text 0.011s 0.049s -
Time Wasted - 3.68s 77.2% overhead

Linux (WSL) Results

Metric Batch Sequential Difference
Total Time 1.06s 5.39s 5.08x slower
Avg per Text 0.011s 0.056s -
Time Wasted - 4.33s 80.3% overhead

ort_model_timing.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions