-
Notifications
You must be signed in to change notification settings - Fork 238
Open
Labels
Description
Describe the bug
Currently, Marian models are limited to single-sequence inference. Enabling batching is critical for high-performance translation scenarios (e.g., Edge Translate), offering significant throughput improvements.
Other Models (Llama, Phi, GPT, etc.)
- Status: Fully Supported.
- Details:
- Models support both continuous and static batching.
StateandModelimplementations handle[batch_size, seq_len]inputs natively.- KV caching is managed per-sequence.
- Search strategies (Greedy, Beam) operate efficiently on the full batch.
Marian Models (Current Main Branch)
- Status: Single Sequence Only (
batch_size = 1). - Limitations:
MarianState::Runassumes single-sequence inputs.- Encoder/Decoder logic does not account for padding or batch dimensions.
- Attempting to run with
batch_size > 1results in shape mismatches or incorrect processing of padding tokens.
Desktop (please complete the following information):
- OS: Win11
- Browser Edge
- Version 144
Additional context
Testing batching performance using python script showed significant performance improvement for batch processing vs sequential translation.
Windows Results
| Metric | Batch | Sequential | Difference |
|---|---|---|---|
| Total Time | 1.09s | 4.76s | 4.38x slower |
| Avg per Text | 0.011s | 0.049s | - |
| Time Wasted | - | 3.68s | 77.2% overhead |
Linux (WSL) Results
| Metric | Batch | Sequential | Difference |
|---|---|---|---|
| Total Time | 1.06s | 5.39s | 5.08x slower |
| Avg per Text | 0.011s | 0.056s | - |
| Time Wasted | - | 4.33s | 80.3% overhead |