---
title: "Extract Structured Data from Messy Text with AI (Invoices, Leads)"
description: "Extract invoices, leads, and dates from messy text into clean JSON using n8n AI nodes on AgentRoost. No API key needed — LLM credits included."
canonical: https://agentroost.app/en/blog/ai-data-extraction-invoices-leads-n8n
date: 2026-05-15T20:00:00Z
---

[Canonical URL](https://agentroost.app/en/blog/ai-data-extraction-invoices-leads-n8n)

If you've ever tried to parse an invoice email with regex, you already know the pain: one vendor adds a newline before the total, another spells "Invoice #" as "INV-", and suddenly you have seventeen edge-case branches and a suppressed scream.

A schema-constrained LLM prompt handles that variance effortlessly. You tell the model *exactly* which fields you need in a JSON schema, and it returns clean, typed output every time — even when the input looks like it was written by three different people in two different languages on a bad day.

This guide walks through the complete n8n workflow: trigger, extraction prompt, validation, and downstream routing. By the end you'll have a reusable pattern you can point at invoices, contact-form submissions, lead emails, or any other free-form text.

---

## Why Schema-Constrained Extraction Beats Regex

Regex works when format is 100% predictable. Text from humans — or PDFs converted to text — is never 100% predictable.

An LLM with a JSON schema constraint:

- Handles synonyms and spelling variants automatically ("Due date", "Payment due", "Fälligkeitsdatum")
- Returns `null` for missing fields instead of silently skipping them
- Can infer a field from context ("Net 30" → a due date 30 days from the invoice date)
- Produces the same key names every time, making downstream routing trivial

The schema constraint is the important part. Without it, the model might return fields in different orders or wrap values in prose. With it, you get a contract.

---

## The n8n Workflow — Step by Step

### 1. Trigger: Receive the Raw Text

Pick your entry point. Common choices:

- **Webhook node** — receives a POST from your form tool, CRM, or email parser (e.g., Postmark inbound, Cloudmailin)
- **Email Trigger (IMAP)** — polls a mailbox and fires on new messages
- **Schedule Trigger + HTTP Request** — polls an API endpoint for new submissions

For invoices forwarded by email, the Email Trigger works well. Set the **Format** to `Text` to get a clean string. For a contact form, a Webhook node with `Body Content Type: JSON` and a `text` field is the simplest setup.

Either way, you'll have a node that outputs something like:

```
{{ $json.text }}
// or
{{ $json.body }}
```

Name the node **"Raw Input"** — you'll reference it by name later.

---

### 2. Set Node: Normalize the Input

Before hitting the AI node, normalize with a **Set** node:

| Field | Value |
|---|---|
| `rawText` | `{{ $json.text.trim() }}` |
| `sourceType` | `invoice` (or `lead`, `contact`, etc.) |

This keeps the AI prompt clean and lets you branch later on `sourceType` if you handle multiple document types in one workflow.

---

### 3. AI / LLM Node: Schema-Constrained Extraction

Add an **AI Agent** node (or the **OpenAI** / **Anthropic** node if you prefer direct calls). On AgentRoost your own n8n instance has AI credits already wired in — no API key to paste.

**System prompt:**

```
You are a data extraction assistant. Extract the requested fields from the input text and return ONLY a valid JSON object matching the schema below. If a field is not present in the text, set it to null. Do not add commentary, markdown fences, or extra keys.

Schema:
{
  "invoice_number": "string or null",
  "vendor_name": "string or null",
  "invoice_date": "ISO 8601 date string or null",
  "due_date": "ISO 8601 date string or null",
  "line_items": [
    {
      "description": "string",
      "quantity": "number or null",
      "unit_price": "number or null",
      "total": "number or null"
    }
  ],
  "subtotal": "number or null",
  "tax": "number or null",
  "total_due": "number or null",
  "currency": "3-letter ISO 4217 code or null"
}
```

**User message:**

```
{{ $('Raw Input').item.json.rawText }}
```

A few practical notes:
- Listing "or null" in the schema description is what prevents hallucinated values.
- Asking for ISO 8601 dates means you get `2026-03-15`, not "March 15th" — directly comparable in downstream logic.
- Asking for numeric `unit_price` (not a string) means you can do math on it in later nodes without a parse step.

For **lead extraction** the schema looks different:

```json
{
  "first_name": "string or null",
  "last_name": "string or null",
  "email": "email string or null",
  "company": "string or null",
  "phone": "string or null",
  "interest": "string or null",
  "urgency": "high | medium | low | null"
}
```

Swap the schema in the system prompt. Same workflow, same pattern.

---

### 4. Code Node: Parse and Validate

The AI node returns a string. Parse it and validate the required fields before routing:

```javascript
const raw = $input.first().json.text;

let parsed;
try {
  // Strip accidental markdown fences if the model slips
  const cleaned = raw.replace(/^```json\n?|```$/g, '').trim();
  parsed = JSON.parse(cleaned);
} catch (e) {
  throw new Error('AI returned non-JSON: ' + raw.substring(0, 200));
}

// Require at least one identifying field
if (!parsed.invoice_number && !parsed.vendor_name && !parsed.total_due) {
  throw new Error('Extraction returned no usable fields');
}

return [{ json: parsed }];
```

The error throw stops the execution cleanly, which n8n will surface in the execution log so you can inspect the raw input that caused the failure.

---

### 5. IF Node: Route by Extraction Quality

Add an **IF** node to branch on confidence:

- **Branch A (clean):** `total_due` is not empty AND `invoice_date` is not empty
- **Branch B (partial):** everything else — route to a "needs review" Google Sheet or send yourself a Slack/email alert

This is the pattern that makes AI extraction production-safe: you don't blindly trust the output, you route uncertain results to a human review queue.

---

### 6. Downstream: Write to Sheets, CRM, or Database

From Branch A, add your sink:

- **Google Sheets node** — append a row. Map `{{ $json.vendor_name }}`, `{{ $json.invoice_date }}`, `{{ $json.total_due }}`, `{{ $json.currency }}` to columns.
- **HubSpot / Pipedrive node** — create or update a contact using the extracted lead fields.
- **HTTP Request node** — POST the clean JSON to your internal API or Airtable.
- **Postgres / MySQL node** — insert a row directly if you're feeding a finance system.

A webhook trigger also means you get a **public HTTPS URL** for your n8n instance, so external tools (your email parser, your form tool, your Shopify webhook) can push directly to it without polling.

---

## Tips and Common Pitfalls

**Chunk long documents.** If an invoice PDF converts to 4,000 words, the extraction prompt gets expensive and noisy. Use a Code node to extract just the first 2,000 characters, or split by page and merge results.

**Test with adversarial samples first.** Run the workflow on your five messiest real examples before wiring it to production data. The failure modes are always in the edge cases.

**Don't trust currency symbols alone.** Ask for ISO 4217 codes ("USD", "EUR") in the schema. "$" is ambiguous (USD? CAD? AUD?). The model will usually infer correctly if you ask explicitly.

**Version your system prompt.** Keep the prompt text in a **Set** node at the top of the workflow, not hardcoded in the AI node. That way you can update the extraction schema without hunting through node config.

**Use the IF node liberally.** A `null` in `total_due` when processing invoices is a signal, not a failure. Route it to review rather than silently dropping the record.

---

## How to Run This on AgentRoost

You get your own private n8n instance — your login, your workflows, your data — at `https://<your-id>.agentroost.app`. The AI nodes come with credits already included. No OpenAI key to manage, no separate billing account to set up.

**To get started:**

1. Sign up at [agentroost.app](/en/agents/n8n)
2. Pick the **n8n** framework, name your instance
3. Your private n8n editor opens in about two minutes
4. Import the workflow above (or build it node by node)
5. The AI node already has credits — paste your system prompt, test with a real invoice

Pricing starts at **$19.99/mo all-in** — that's compute, AI credits, SSL, public webhook URL, and no DevOps. The 14-day money-back guarantee means you can test the full extraction workflow on real data before committing.

[Compare plans and see what's included](/en/pricing)

---

The extraction pattern in this guide — normalize input, schema-constrain the prompt, parse and validate, route by quality, write downstream — works for invoices, leads, support tickets, product feedback, and any other free-form text your business generates. Build it once, point it at the next messy text source, swap the schema.
