DoorDash Interview — Project Write-up

3. Technical ConsiderationsURL copied

Architecture

Three input modalities fused: (1) document image, (2) OCR-extracted text tokens, (3) bounding box coordinates of each word

No existing open-source model supported all three. Built custom multimodal Transformer architecture from scratch.

Pretraining: masked prediction across all three modalities simultaneously

Fine-tuning: supervised extraction heads for 6 fields

Scale challenge — training on ~1 crore documents

Pulling all data in bulk was infeasible (terabytes, storage limits)

Built a pull-based streaming pipeline: documents fetched from Azure Blob Storage on demand, one batch at a time

Local disk footprint during training: <100MB at any moment regardless of dataset size

Enabled 10x–100x scale-up without infrastructure changes

Deployment stack

Seldon DAG for pipeline orchestration (OCR node → extraction model node → output routing)

Triton Inference Server for model serving (versioned, multi-model, supports V2/V3/V4 rollouts)

Kubernetes autoscaling for GPU provisioning

SLA constraint

20,000 invoices arrive as a daily burst (early morning). Must process within 2 hours.

Calculated: 4 GPUs → ~6,000 docs/2 hours → needed ~14 GPUs → 4 large GPU machines provisioned in off-peak morning window

Invoice Processing Pipeline

width: 90vw
height: 40vh
sequenceDiagram
    autonumber
    participant AP as Accounts Payable
    participant OCR as OCR Node
    participant Model as Extraction Model<br/>(Triton)
    participant DB as Client-Declared DB
    participant HQ as Manual Review Queue

    AP->>OCR: Invoice arrives (PDF/image)
    OCR->>Model: Layout tokens + text + bounding boxes
    Model->>DB: Compare 6 extracted fields vs declared values
    alt All 6 match
        DB-->>AP: Auto-cleared ✓
    else Mismatch or high-risk vendor
        DB->>HQ: Route to manual reviewer
        HQ-->>AP: Manual validation
    end

Key trade-offs

Dimension	Decision	Rationale
Availability/throughput vs. latency	Single-pass inference with manual fallback for low-confidence/high-risk cases	Preserved 2-hour burst SLA for overall pipeline without adding multi-pass inference latency
Generalization vs. labelling cost	Unsupervised pretraining	Zero per-vendor labelling required
Synthetic vs. real data	Real unlabelled corpus	2-month synthetic experiment failed; real data statistics can't be replicated
Infra simplicity vs. scale	Seldon + K8s over Flask	20k bursty daily load required autoscaled GPU provisioning
Coverage vs. risk	Bad-vendor always-manual flag	~20 low-volume, high-noise vendors explicitly routed to manual; flagging used to push vendors to improve
Metric definition	Client-declared values as ground truth	Eliminated annotation cost; auto-clearance decision is the business outcome

2	✦
3	✦ > 1. Problem Statement
4	✦ > 2. Solutions Explored
5	✦ > 3. Technical Considerations
6	✦ > 4. Measuring Success
7	✦ > 5. Key Learnings