DoorDash Interview — Project Write-up

2. Solutions ExploredURL copied

Evaluation criteria: generalization to unseen vendors, labelling cost, inference latency, maintenance overhead

Option A — Extend the regex system (status quo++)

  • More rules, more conditions per vendor
  • Rejected: O(vendors) maintenance. Does not scale. The root problem is architectural.

Option B — Per-vendor rule segregation

  • Assign each vendor its own extraction rule framework
  • Rejected: Same problem at a different level of abstraction. Still requires per-vendor engineering work.

Option C — RNN-based Named Entity Recognition on OCR output (tried)

  • First DL team proposal when the project started: take OCR bounding-box output (text + coordinates), flatten all boxes into a single linear text sequence as a document proxy, and train an RNN to do Named Entity Recognition over that sequence to tag the 6 fields
  • Labels were available in this formulation (text sequence + entity spans)
  • Result: accuracy roughly matched the rule-based system (~5% all-6 exact-match). Did not exceed it.
  • Root cause: formulation mismatch. Invoice understanding is layout-sensitive, but flattening OCR turns it into text-only NER and discards geometric context. Even slight rotations, OCR reading-order changes, or template shifts jumble token order and break entity consistency across vendors.
  • Rejected: parity with the baseline is not a viable outcome. This failure showed that improving labels alone would not solve it; the representation itself had to be multimodal, which motivated the next paths.

Option D — Synthetic data + supervised model

  • Built a JSON-based synthetic layout framework: recursive rectilinear page partitioning, each region filled with table/form/text data. Targeted 10 known vendors.
  • Spent ~2 months on this path. Training accuracy: high. Real-world validation accuracy: poor. Root cause: synthetic data statistics did not match real data — bounding box distributions, field density, and spatial correlations were all off. No statistical metric we tried reliably explained the gap.
  • Rejected as primary path. Synthetic generation library later reused for KYC/ID card extraction (simpler domain, zero real labels, 60–100% accuracy).

Option E ✅ — Unsupervised multimodal pretraining on historical invoices

  • Pretrain on the company's ~1 crore historical invoice corpus (unlabelled)
  • Masking paradigm: randomly mask text tokens, bounding box coordinates, or both → model predicts the masked elements. Progressively increased masking ratio until convergence held even at sparse input.
  • Fine-tune on a small set of labelled samples to attach extraction heads
  • This was the winning approach. Fine-tuning accuracy jumped immediately.

What was contentious in the proposal

  • Direction debate (internal): whether to keep pursuing direct supervised extraction only, or invest in unsupervised pretraining first. Alignment was reached using a stop-gate: run on poor-quality vendors first and continue only if all-6 exact-match moved materially above baseline.
  • Build vs buy (cross-team): Finance preferred external vendor purchase for lower delivery risk. We secured six months for in-house delivery by committing to measurable business gates: >=50% validation pass-through within 2 hours and lower burst-window operating cost (~$120/day for ~3 hours compute).

Training Pipeline

width: 90vw
height: 38vh
flowchart LR
    A[~1 Crore Unlabelled<br/>Historical Invoices] --> B[Unsupervised Pretraining<br/>Mask text + bounding boxes]
    B --> C[General-purpose<br/>Multimodal Transformer]
    C --> D[Fine-tune on<br/>small labelled set]
    D --> E[Invoice Extraction<br/>Model — all vendors]
    E --> F{Client-declared<br/>values match?}
    F -->|All 6 match| G[Auto-cleared ✓<br/>No manual review]
    F -->|Mismatch| H[Manual Review<br/>Queue]