DoorDash Interview — Project Write-up
3. Technical Considerations URL copied
Architecture
Three input modalities fused: (1) document image, (2) OCR-extracted text tokens, (3) bounding box coordinates of each word
No existing open-source model supported all three. Built custom multimodal Transformer architecture from scratch.
Pretraining: masked prediction across all three modalities simultaneously
Fine-tuning: supervised extraction heads for 6 fields
Scale challenge — training on ~1 crore documents
- Pulling all data in bulk was infeasible (terabytes, storage limits)
- Built a pull-based streaming pipeline: documents fetched from Azure Blob Storage on demand, one batch at a time
- Local disk footprint during training: <100MB at any moment regardless of dataset size
- Enabled 10x–100x scale-up without infrastructure changes
Deployment stack
- Seldon DAG for pipeline orchestration (OCR node → extraction model node → output routing)
- Triton Inference Server for model serving (versioned, multi-model, supports V2/V3/V4 rollouts)
- Kubernetes autoscaling for GPU provisioning
SLA constraint
- 20,000 invoices arrive as a daily burst (early morning). Must process within 2 hours.
- Calculated: 4 GPUs → ~6,000 docs/2 hours → needed ~14 GPUs → 4 large GPU machines provisioned in off-peak morning window
Invoice Processing Pipeline
width: 90vw
height: 40vh
sequenceDiagram
autonumber
participant AP as Accounts Payable
participant OCR as OCR Node
participant Model as Extraction Model<br/>(Triton)
participant DB as Client-Declared DB
participant HQ as Manual Review Queue
AP->>OCR: Invoice arrives (PDF/image)
OCR->>Model: Layout tokens + text + bounding boxes
Model->>DB: Compare 6 extracted fields vs declared values
alt All 6 match
DB-->>AP: Auto-cleared ✓
else Mismatch or high-risk vendor
DB->>HQ: Route to manual reviewer
HQ-->>AP: Manual validation
end
Key trade-offs
| Dimension | Decision | Rationale |
|---|---|---|
| Availability/throughput vs. latency | Single-pass inference with manual fallback for low-confidence/high-risk cases | Preserved 2-hour burst SLA for overall pipeline without adding multi-pass inference latency |
| Generalization vs. labelling cost | Unsupervised pretraining | Zero per-vendor labelling required |
| Synthetic vs. real data | Real unlabelled corpus | 2-month synthetic experiment failed; real data statistics can't be replicated |
| Infra simplicity vs. scale | Seldon + K8s over Flask | 20k bursty daily load required autoscaled GPU provisioning |
| Coverage vs. risk | Bad-vendor always-manual flag | ~20 low-volume, high-noise vendors explicitly routed to manual; flagging used to push vendors to improve |
| Metric definition | Client-declared values as ground truth | Eliminated annotation cost; auto-clearance decision is the business outcome |