DoorDash Interview — Project Write-up

Present

8-min readUpdated Apr 30, 2026

Copied Raw Markdown!

Yeshwanth Reddy · y.yesh.r@gmail.com
Project: Transformer-based Invoice Extraction Platform · Jio Infocomm (Reliance)
Role: Manager, Data Science · Dec 2021 – May 2024

1. Problem StatementURL copied

  • Customer: Accounts Payable and Finance Ops teams at Jio Infocomm, processing ~20,000 vendor invoices per day
  • Scale: ~120 employees dedicated solely to manual invoice validation
  • Existing system: OCR → text file → regex rules + if/else conditions extracting 6 fields per invoice
    • Invoice Number,
    • PO Number,
    • Invoice Date,
    • PO Date,
    • Invoice Amount (numeric),
    • Invoice Amount in Words
  • Why it was broken: The regex file was a growing monolith — every new vendor required new rules, new conditions. System was rigid and unmaintainable. All-6 exact-match accuracy across all vendors: ~5%
  • Why critical: Invoice volume was growing. The only options were: hire more people indefinitely, or buy an external solution, or replace the internal system with a better one. The Finance team preferred the buy option for lower delivery risk, but we proposed an in-house build with a six-month stop-gate to demonstrate improved accuracy and lower operating cost.
  • My role: Lead. Proposed the architecture, set staged stop-gates for delivery risk, coordinated across Finance, Infra/DevOps, and PM. Team of 4. Timeline: 6 months to demonstrate improved accuracy
  • What made it uniquely hard:
    • Early labelled data existed for a text-only NER prototype, but that approach underperformed because it flattened OCR output and discarded spatial layout signal; scaling labels on that wrong formulation would not fix the core issue
    • No off-the-shelf open-source model handled all 3 modalities (image + text + bounding boxes). Microsoft LayoutLM was not publicly licensed at the time
    • ~1 crore historical invoices available but entirely unlabelled
  • Cross-team leadership moments:
    • Engineering leadership initially pushed for direct supervised modeling with immediate business proof. I proposed a staged stop-gated plan (pilot on poor-quality vendors first, then scale only if it beat baseline).
    • Finance stakeholders pushed to buy an external solution. We aligned on a six-month in-house gate: >=50% pass-through in a 2-hour burst window with lower operating cost.
  • Intended outcome: A generalizable extraction system that works across all vendors, including unseen ones, with zero per-vendor configuration
Solution Summary
  • Unsupervised multimodal pretraining on the company's historical invoice corpus.
  • Followed by supervised fine-tuning on a small labelled set.
  • Deployed on Seldon + Triton with Kubernetes autoscaling to meet bursty load.

2. Solutions ExploredURL copied

Evaluation criteria: generalization to unseen vendors, labelling cost, inference latency, maintenance overhead

Option A — Extend the regex system (status quo++)

  • More rules, more conditions per vendor
  • Rejected: O(vendors) maintenance. Does not scale. The root problem is architectural.

Option B — Per-vendor rule segregation

  • Assign each vendor its own extraction rule framework
  • Rejected: Same problem at a different level of abstraction. Still requires per-vendor engineering work.

Option C — RNN-based Named Entity Recognition on OCR output (tried)

  • First DL team proposal when the project started: take OCR bounding-box output (text + coordinates), flatten all boxes into a single linear text sequence as a document proxy, and train an RNN to do Named Entity Recognition over that sequence to tag the 6 fields
  • Labels were available in this formulation (text sequence + entity spans)
  • Result: accuracy roughly matched the rule-based system (~5% all-6 exact-match). Did not exceed it.
  • Root cause: formulation mismatch. Invoice understanding is layout-sensitive, but flattening OCR turns it into text-only NER and discards geometric context. Even slight rotations, OCR reading-order changes, or template shifts jumble token order and break entity consistency across vendors.
  • Rejected: parity with the baseline is not a viable outcome. This failure showed that improving labels alone would not solve it; the representation itself had to be multimodal, which motivated the next paths.

Option D — Synthetic data + supervised model

  • Built a JSON-based synthetic layout framework: recursive rectilinear page partitioning, each region filled with table/form/text data. Targeted 10 known vendors.
  • Spent ~2 months on this path. Training accuracy: high. Real-world validation accuracy: poor. Root cause: synthetic data statistics did not match real data — bounding box distributions, field density, and spatial correlations were all off. No statistical metric we tried reliably explained the gap.
  • Rejected as primary path. Synthetic generation library later reused for KYC/ID card extraction (simpler domain, zero real labels, 60–100% accuracy).

Option E ✅ — Unsupervised multimodal pretraining on historical invoices

  • Pretrain on the company's ~1 crore historical invoice corpus (unlabelled)
  • Masking paradigm: randomly mask text tokens, bounding box coordinates, or both → model predicts the masked elements. Progressively increased masking ratio until convergence held even at sparse input.
  • Fine-tune on a small set of labelled samples to attach extraction heads
  • This was the winning approach. Fine-tuning accuracy jumped immediately.

What was contentious in the proposal

  • Direction debate (internal): whether to keep pursuing direct supervised extraction only, or invest in unsupervised pretraining first. Alignment was reached using a stop-gate: run on poor-quality vendors first and continue only if all-6 exact-match moved materially above baseline.
  • Build vs buy (cross-team): Finance preferred external vendor purchase for lower delivery risk. We secured six months for in-house delivery by committing to measurable business gates: >=50% validation pass-through within 2 hours and lower burst-window operating cost (~$120/day for ~3 hours compute).

Training Pipeline

flowchart LR
    A[~1 Crore Unlabelled<br/>Historical Invoices] --> B[Unsupervised Pretraining<br/>Mask text + bounding boxes]
    B --> C[General-purpose<br/>Multimodal Transformer]
    C --> D[Fine-tune on<br/>small labelled set]
    D --> E[Invoice Extraction<br/>Model — all vendors]
    E --> F{Client-declared<br/>values match?}
    F -->|All 6 match| G[Auto-cleared ✓<br/>No manual review]
    F -->|Mismatch| H[Manual Review<br/>Queue]

3. Technical ConsiderationsURL copied

Architecture

Three input modalities fused: (1) document image, (2) OCR-extracted text tokens, (3) bounding box coordinates of each word
No existing open-source model supported all three. Built custom multimodal Transformer architecture from scratch.
Pretraining: masked prediction across all three modalities simultaneously
Fine-tuning: supervised extraction heads for 6 fields

Scale challenge — training on ~1 crore documents

  • Pulling all data in bulk was infeasible (terabytes, storage limits)
  • Built a pull-based streaming pipeline: documents fetched from Azure Blob Storage on demand, one batch at a time
  • Local disk footprint during training: <100MB at any moment regardless of dataset size
  • Enabled 10x–100x scale-up without infrastructure changes

Deployment stack

  • Seldon DAG for pipeline orchestration (OCR node → extraction model node → output routing)
  • Triton Inference Server for model serving (versioned, multi-model, supports V2/V3/V4 rollouts)
  • Kubernetes autoscaling for GPU provisioning

SLA constraint

  • 20,000 invoices arrive as a daily burst (early morning). Must process within 2 hours.
  • Calculated: 4 GPUs → ~6,000 docs/2 hours → needed ~14 GPUs → 4 large GPU machines provisioned in off-peak morning window

Invoice Processing Pipeline

sequenceDiagram
    autonumber
    participant AP as Accounts Payable
    participant OCR as OCR Node
    participant Model as Extraction Model<br/>(Triton)
    participant DB as Client-Declared DB
    participant HQ as Manual Review Queue

    AP->>OCR: Invoice arrives (PDF/image)
    OCR->>Model: Layout tokens + text + bounding boxes
    Model->>DB: Compare 6 extracted fields vs declared values
    alt All 6 match
        DB-->>AP: Auto-cleared ✓
    else Mismatch or high-risk vendor
        DB->>HQ: Route to manual reviewer
        HQ-->>AP: Manual validation
    end

Key trade-offs

Dimension Decision Rationale
Availability/throughput vs. latency Single-pass inference with manual fallback for low-confidence/high-risk cases Preserved 2-hour burst SLA for overall pipeline without adding multi-pass inference latency
Generalization vs. labelling cost Unsupervised pretraining Zero per-vendor labelling required
Synthetic vs. real data Real unlabelled corpus 2-month synthetic experiment failed; real data statistics can't be replicated
Infra simplicity vs. scale Seldon + K8s over Flask 20k bursty daily load required autoscaled GPU provisioning
Coverage vs. risk Bad-vendor always-manual flag ~20 low-volume, high-noise vendors explicitly routed to manual; flagging used to push vendors to improve
Metric definition Client-declared values as ground truth Eliminated annotation cost; auto-clearance decision is the business outcome

4. Measuring SuccessURL copied

Metric definition: All-6 exact-match rate — all six fields correct on the same invoice, verified against client-declared values. This directly maps to the auto-clearance business outcome.

Source: Finance team weekly Excel sheets (invoice-level disposition: auto-cleared vs. manual)

Metric Baseline Result
All-6 exact-match rate (all vendors) ~5% ~50% at launch → ~65% at handover
Day-1 auto-clearance rate 0% (fully manual) ~50%
Manual validation team burden 120 people ~50% reduction (est.)
Burst-window operating cost n/a ~$120/day for ~3-hour compute window

How the number moved

  • Day 1 post-deployment: ~50% auto-clearance
  • Each subsequent model release (V2 → V3 → V4) increased the rate
  • Reached ~65% by project handover
  • Trajectory was sustained — not a one-day spike
  • Iteration trigger each cycle: after 1-2 days of vendor traffic, we reviewed low-accuracy vendor slices, manually identified missing augmentation patterns, added those augmentations, and retrained. This repeated loop drove V2/V3/V4 gains.

Guardrails (controls outside the model)

  • High-value invoices above an amount threshold: always routed to manual regardless of model output
  • ~200 auto-cleared invoices sampled daily by Finance for random audit
  • ~20 bad-vendor list (letterhead autoencoder detection): always flagged as high-risk

5. Key LearningsURL copied

Learning 1 — Synthetic data is a last resort, not a starting point

Spent ~2 months on synthetic-supervised training before pivoting. The failure signal was persistent: high training accuracy, poor real validation, no statistical metric that reliably explained the gap. We kept building more complex analysis instead of changing strategy. The correct trigger to pivot was earlier: if N training cycles show no stable improvement on real data, the approach is wrong — not the measurement. Now: in data-scarce document domains, I default to self-supervised pretraining on real unlabelled data first.

Learning 2 — De-risk research bets with explicit business stop-gates

When leadership is skeptical of longer-horizon ML directions, architecture quality alone is not persuasive. What worked was converting the approach into a staged business commitment: limited pilot scope, pre-agreed threshold, and kill criteria. I now use this pattern by default for contentious initiatives: define the gate first, then ask for runway.