DoorDash Interview — Project Write-up

3. Technical ConsiderationsURL copied

Architecture

Three input modalities fused: (1) document image, (2) OCR-extracted text tokens, (3) bounding box coordinates of each word
No existing open-source model supported all three. Built custom multimodal Transformer architecture from scratch.
Pretraining: masked prediction across all three modalities simultaneously
Fine-tuning: supervised extraction heads for 6 fields

Scale challenge — training on ~1 crore documents

  • Pulling all data in bulk was infeasible (terabytes, storage limits)
  • Built a pull-based streaming pipeline: documents fetched from Azure Blob Storage on demand, one batch at a time
  • Local disk footprint during training: <100MB at any moment regardless of dataset size
  • Enabled 10x–100x scale-up without infrastructure changes

Deployment stack

  • Seldon DAG for pipeline orchestration (OCR node → extraction model node → output routing)
  • Triton Inference Server for model serving (versioned, multi-model, supports V2/V3/V4 rollouts)
  • Kubernetes autoscaling for GPU provisioning

SLA constraint

  • 20,000 invoices arrive as a daily burst (early morning). Must process within 2 hours.
  • Calculated: 4 GPUs → ~6,000 docs/2 hours → needed ~14 GPUs → 4 large GPU machines provisioned in off-peak morning window

Invoice Processing Pipeline

width: 90vw
height: 40vh
sequenceDiagram
    autonumber
    participant AP as Accounts Payable
    participant OCR as OCR Node
    participant Model as Extraction Model<br/>(Triton)
    participant DB as Client-Declared DB
    participant HQ as Manual Review Queue

    AP->>OCR: Invoice arrives (PDF/image)
    OCR->>Model: Layout tokens + text + bounding boxes
    Model->>DB: Compare 6 extracted fields vs declared values
    alt All 6 match
        DB-->>AP: Auto-cleared ✓
    else Mismatch or high-risk vendor
        DB->>HQ: Route to manual reviewer
        HQ-->>AP: Manual validation
    end

Key trade-offs

Dimension Decision Rationale
Availability/throughput vs. latency Single-pass inference with manual fallback for low-confidence/high-risk cases Preserved 2-hour burst SLA for overall pipeline without adding multi-pass inference latency
Generalization vs. labelling cost Unsupervised pretraining Zero per-vendor labelling required
Synthetic vs. real data Real unlabelled corpus 2-month synthetic experiment failed; real data statistics can't be replicated
Infra simplicity vs. scale Seldon + K8s over Flask 20k bursty daily load required autoscaled GPU provisioning
Coverage vs. risk Bad-vendor always-manual flag ~20 low-volume, high-noise vendors explicitly routed to manual; flagging used to push vendors to improve
Metric definition Client-declared values as ground truth Eliminated annotation cost; auto-clearance decision is the business outcome