DoorDash Interview — Project Write-up
2. Solutions Explored URL copied
Evaluation criteria: generalization to unseen vendors, labelling cost, inference latency, maintenance overhead
Option A — Extend the regex system (status quo++)
- More rules, more conditions per vendor
- Rejected: O(vendors) maintenance. Does not scale. The root problem is architectural.
Option B — Per-vendor rule segregation
- Assign each vendor its own extraction rule framework
- Rejected: Same problem at a different level of abstraction. Still requires per-vendor engineering work.
Option C — RNN-based Named Entity Recognition on OCR output (tried)
- First DL team proposal when the project started: take OCR bounding-box output (text + coordinates), flatten all boxes into a single linear text sequence as a document proxy, and train an RNN to do Named Entity Recognition over that sequence to tag the 6 fields
- Labels were available in this formulation (text sequence + entity spans)
- Result: accuracy roughly matched the rule-based system (~5% all-6 exact-match). Did not exceed it.
- Root cause: formulation mismatch. Invoice understanding is layout-sensitive, but flattening OCR turns it into text-only NER and discards geometric context. Even slight rotations, OCR reading-order changes, or template shifts jumble token order and break entity consistency across vendors.
- Rejected: parity with the baseline is not a viable outcome. This failure showed that improving labels alone would not solve it; the representation itself had to be multimodal, which motivated the next paths.
Option D — Synthetic data + supervised model
- Built a JSON-based synthetic layout framework: recursive rectilinear page partitioning, each region filled with table/form/text data. Targeted 10 known vendors.
- Spent ~2 months on this path. Training accuracy: high. Real-world validation accuracy: poor. Root cause: synthetic data statistics did not match real data — bounding box distributions, field density, and spatial correlations were all off. No statistical metric we tried reliably explained the gap.
- Rejected as primary path. Synthetic generation library later reused for KYC/ID card extraction (simpler domain, zero real labels, 60–100% accuracy).
Option E ✅ — Unsupervised multimodal pretraining on historical invoices
- Pretrain on the company's ~1 crore historical invoice corpus (unlabelled)
- Masking paradigm: randomly mask text tokens, bounding box coordinates, or both → model predicts the masked elements. Progressively increased masking ratio until convergence held even at sparse input.
- Fine-tune on a small set of labelled samples to attach extraction heads
- This was the winning approach. Fine-tuning accuracy jumped immediately.
What was contentious in the proposal
- Direction debate (internal): whether to keep pursuing direct supervised extraction only, or invest in unsupervised pretraining first. Alignment was reached using a stop-gate: run on poor-quality vendors first and continue only if all-6 exact-match moved materially above baseline.
- Build vs buy (cross-team): Finance preferred external vendor purchase for lower delivery risk. We secured six months for in-house delivery by committing to measurable business gates: >=50% validation pass-through within 2 hours and lower burst-window operating cost (~$120/day for ~3 hours compute).
Training Pipeline
width: 90vw
height: 38vh
flowchart LR
A[~1 Crore Unlabelled<br/>Historical Invoices] --> B[Unsupervised Pretraining<br/>Mask text + bounding boxes]
B --> C[General-purpose<br/>Multimodal Transformer]
C --> D[Fine-tune on<br/>small labelled set]
D --> E[Invoice Extraction<br/>Model — all vendors]
E --> F{Client-declared<br/>values match?}
F -->|All 6 match| G[Auto-cleared ✓<br/>No manual review]
F -->|Mismatch| H[Manual Review<br/>Queue]