DoorDash Interview — Project Write-up

1. Problem StatementURL copied

  • Customer: Accounts Payable and Finance Ops teams at Jio Infocomm, processing ~20,000 vendor invoices per day
  • Scale: ~120 employees dedicated solely to manual invoice validation
  • Existing system: OCR → text file → regex rules + if/else conditions extracting 6 fields per invoice
  • Invoice Number,
  • PO Number,
  • Invoice Date,
  • PO Date,
  • Invoice Amount (numeric),
  • Invoice Amount in Words
  • Why it was broken: The regex file was a growing monolith — every new vendor required new rules, new conditions. System was rigid and unmaintainable. All-6 exact-match accuracy across all vendors: ~5%
  • Why critical: Invoice volume was growing. The only options were: hire more people indefinitely, or buy an external solution, or replace the internal system with a better one. The Finance team preferred the buy option for lower delivery risk, but we proposed an in-house build with a six-month stop-gate to demonstrate improved accuracy and lower operating cost.
  • My role: Lead. Proposed the architecture, set staged stop-gates for delivery risk, coordinated across Finance, Infra/DevOps, and PM. Team of 4. Timeline: 6 months to demonstrate improved accuracy
  • What made it uniquely hard:
  • Early labelled data existed for a text-only NER prototype, but that approach underperformed because it flattened OCR output and discarded spatial layout signal; scaling labels on that wrong formulation would not fix the core issue
  • No off-the-shelf open-source model handled all 3 modalities (image + text + bounding boxes). Microsoft LayoutLM was not publicly licensed at the time
  • ~1 crore historical invoices available but entirely unlabelled
  • Cross-team leadership moments:
  • Engineering leadership initially pushed for direct supervised modeling with immediate business proof. I proposed a staged stop-gated plan (pilot on poor-quality vendors first, then scale only if it beat baseline).
  • Finance stakeholders pushed to buy an external solution. We aligned on a six-month in-house gate: >=50% pass-through in a 2-hour burst window with lower operating cost.
  • Intended outcome: A generalizable extraction system that works across all vendors, including unseen ones, with zero per-vendor configuration
Solution Summary
  • Unsupervised multimodal pretraining on the company's historical invoice corpus.
  • Followed by supervised fine-tuning on a small labelled set.
  • Deployed on Seldon + Triton with Kubernetes autoscaling to meet bursty load.