DoorDash Interview — Project Write-up

2. Solutions ExploredURL copied

Evaluation criteria: generalization to unseen vendors, labelling cost, inference latency, maintenance overhead

Option A — Extend the regex system (status quo++)

More rules, more conditions per vendor

Rejected: O(vendors) maintenance. Does not scale. The root problem is architectural.

Option B — Per-vendor rule segregation

Assign each vendor its own extraction rule framework

Rejected: Same problem at a different level of abstraction. Still requires per-vendor engineering work.

Option C — RNN-based Named Entity Recognition on OCR output (tried)

First DL team proposal when the project started: take OCR bounding-box output (text + coordinates), flatten all boxes into a single linear text sequence as a document proxy, and train an RNN to do Named Entity Recognition over that sequence to tag the 6 fields

Labels were available in this formulation (text sequence + entity spans)

Result: accuracy roughly matched the rule-based system (~5% all-6 exact-match). Did not exceed it.

Root cause: formulation mismatch. Invoice understanding is layout-sensitive, but flattening OCR turns it into text-only NER and discards geometric context. Even slight rotations, OCR reading-order changes, or template shifts jumble token order and break entity consistency across vendors.

Rejected: parity with the baseline is not a viable outcome. This failure showed that improving labels alone would not solve it; the representation itself had to be multimodal, which motivated the next paths.

Option D — Synthetic data + supervised model

Built a JSON-based synthetic layout framework: recursive rectilinear page partitioning, each region filled with table/form/text data. Targeted 10 known vendors.

Spent ~2 months on this path. Training accuracy: high. Real-world validation accuracy: poor. Root cause: synthetic data statistics did not match real data — bounding box distributions, field density, and spatial correlations were all off. No statistical metric we tried reliably explained the gap.

Rejected as primary path. Synthetic generation library later reused for KYC/ID card extraction (simpler domain, zero real labels, 60–100% accuracy).

Option E ✅ — Unsupervised multimodal pretraining on historical invoices

Pretrain on the company's ~1 crore historical invoice corpus (unlabelled)

Masking paradigm: randomly mask text tokens, bounding box coordinates, or both → model predicts the masked elements. Progressively increased masking ratio until convergence held even at sparse input.

Fine-tune on a small set of labelled samples to attach extraction heads

This was the winning approach. Fine-tuning accuracy jumped immediately.

What was contentious in the proposal

Direction debate (internal): whether to keep pursuing direct supervised extraction only, or invest in unsupervised pretraining first. Alignment was reached using a stop-gate: run on poor-quality vendors first and continue only if all-6 exact-match moved materially above baseline.

Build vs buy (cross-team): Finance preferred external vendor purchase for lower delivery risk. We secured six months for in-house delivery by committing to measurable business gates: >=50% validation pass-through within 2 hours and lower burst-window operating cost (~$120/day for ~3 hours compute).

Training Pipeline

width: 90vw
height: 38vh
flowchart LR
    A[~1 Crore Unlabelled<br/>Historical Invoices] --> B[Unsupervised Pretraining<br/>Mask text + bounding boxes]
    B --> C[General-purpose<br/>Multimodal Transformer]
    C --> D[Fine-tune on<br/>small labelled set]
    D --> E[Invoice Extraction<br/>Model — all vendors]
    E --> F{Client-declared<br/>values match?}
    F -->|All 6 match| G[Auto-cleared ✓<br/>No manual review]
    F -->|Mismatch| H[Manual Review<br/>Queue]

2	✦
3	✦ > 1. Problem Statement
4	✦ > 2. Solutions Explored
5	✦ > 3. Technical Considerations
6	✦ > 4. Measuring Success
7	✦ > 5. Key Learnings