---
title: "Automate Invoice PDF Extraction with n8n and AI"
description: "n8n workflow that auto-extracts invoice fields from PDFs with an AI node, writes results to Google Sheets. AI credits included, no API key needed."
canonical: https://agentroost.app/en/blog/invoice-pdf-data-extraction-n8n-ai
date: 2026-06-11T04:00:00Z
---

[Canonical URL](https://agentroost.app/en/blog/invoice-pdf-data-extraction-n8n-ai)

Manual invoice processing is one of those tasks that looks simple until you're staring at your 40th PDF of the month. The amounts are there, the vendor names are there, the line items are there — but they're locked inside a file. Getting them into a spreadsheet means copy-paste, typos, and wasted time.

n8n solves this cleanly. You can wire together a workflow that monitors a Gmail label or an email inbox, pulls the PDF attachment, runs it through an AI node that understands the layout of the invoice, and pushes each field into a Google Sheet row — fully automated, running 24/7.

This guide walks you through building exactly that workflow, node by node.

---

## What the Finished Workflow Does

1. Triggers when a new email arrives (or on a schedule, scanning a folder).
2. Downloads the PDF attachment.
3. Extracts the raw text from the PDF.
4. Sends that text to an AI/LLM node with a structured extraction prompt.
5. Parses the JSON output.
6. Appends a new row to a Google Sheet with vendor, date, invoice number, line items, subtotal, tax, and total.

No vision API subscription. No Python script. No cron job on a server you have to maintain. Just your own n8n instance and a workflow.

---

## Building the Workflow, Node by Node

### 1. Gmail Trigger (or IMAP Email node)

Use the **Gmail Trigger** node if you're on Google Workspace. Set it to watch a specific label — for example, `invoices` — so only relevant emails fire the workflow.

```json
{
  "event": "messageReceived",
  "filters": {
    "label": "invoices"
  }
}
```

If your team uses a shared Outlook inbox, swap in the **IMAP Email** node instead. Set poll interval to something reasonable: every 5–15 minutes works for most accounting workflows.

### 2. Extract the Attachment

After the trigger, add a **Set** node (or use the built-in `$json.attachments` path) to isolate the attachment. The Gmail Trigger surfaces `attachments` as an array. Use an **IF** node to confirm the attachment MIME type is `application/pdf` before proceeding — this prevents the workflow from choking on emails that have images or no attachments at all.

```
IF  {{ $json.attachments[0].mimeType === "application/pdf" }}
  → true branch: continue
  → false branch: stop and log a note
```

### 3. Extract Text from the PDF

n8n has an **Extract from File** node that handles PDF text extraction natively — no external service needed. Connect it to the binary attachment data from the email node.

Set:
- **Operation**: Extract Text
- **Input Binary Field**: `data` (the default field name for email attachments)

The node outputs a `text` field containing everything the PDF parser could read. For most machine-generated invoices (i.e., invoices created by accounting software, not scanned paper), this is completely reliable. For scanned/image PDFs you'd need an OCR step — but the majority of vendor invoices today are digitally generated and parse cleanly.

### 4. The AI Node — the Core of the Workflow

This is where the real work happens. Add an **AI / LLM** node (in n8n's node palette it appears under "AI" — the exact label depends on your n8n version, but look for the OpenAI, Anthropic, or "Basic LLM Chain" node).

Set the **System Prompt** to something like:

```
You are an invoice parser. Extract the following fields from the invoice text and return them as valid JSON. Return ONLY the JSON object, no explanation.

Fields to extract:
- vendor_name (string)
- invoice_number (string)
- invoice_date (ISO 8601 date string)
- due_date (ISO 8601 date string, or null if not present)
- line_items (array of objects with: description, quantity, unit_price, total)
- subtotal (number)
- tax (number)
- total (number)
- currency (3-letter ISO code, e.g. "USD")

If a field is not present, use null.
```

Set the **User Prompt** to pass in the extracted text:
```
{{ $json.text }}
```

Choose your model. GPT-4o-mini is a solid default for this task — it handles structured extraction well and is fast. On AgentRoost, this model is available from your included credits; you don't need an OpenAI API key.

### 5. Parse the JSON Output

The AI node returns a string. Add a **Code** node (JavaScript) to parse it:

```javascript
const raw = $input.first().json.text;
// Strip potential markdown code fences
const cleaned = raw.replace(/```json|```/g, '').trim();
return [{ json: JSON.parse(cleaned) }];
```

This handles cases where the model wraps the JSON in a code block, which some models do by habit.

### 6. Write to Google Sheets

Add the **Google Sheets** node.

- **Operation**: Append Row
- **Spreadsheet**: Your invoice log spreadsheet
- **Sheet**: Sheet1 (or whatever you named it)

Map the fields from the parsed JSON:

| Sheet Column | n8n Expression |
|---|---|
| Vendor | `{{ $json.vendor_name }}` |
| Invoice # | `{{ $json.invoice_number }}` |
| Date | `{{ $json.invoice_date }}` |
| Due Date | `{{ $json.due_date }}` |
| Subtotal | `{{ $json.subtotal }}` |
| Tax | `{{ $json.tax }}` |
| Total | `{{ $json.total }}` |
| Currency | `{{ $json.currency }}` |

For line items, you have two options: write them as a JSON string into a single "Line Items" column, or fan out with a **Loop Over Items** node to write one row per line item. The single-column approach is simpler and easier to expand later.

---

## Common Pitfalls

**The AI returns slightly different keys on each run.** Solve this by being explicit in your prompt: list every key name exactly, and add `Return ONLY the JSON object` to suppress surrounding text. Testing with 10–15 real invoices before going live will reveal any edge cases.

**The PDF parser gets blank output.** This usually means the PDF is image-based (a scan). You can route those to a different branch with an IF node: if `text.length < 50`, send yourself a Slack or email notification to handle manually.

**Date format inconsistencies.** Vendors send dates as `Jan 15, 2025`, `15/01/2025`, `2025-01-15`, and countless other formats. Tell the AI to normalize to ISO 8601 in the prompt. Still validate on the Google Sheets side by formatting the column as plain text, then parsing it with `=DATEVALUE()` separately.

**Duplicate processing.** If you use a polling trigger, the same email can fire more than once. Add a label to processed emails in a final **Gmail** node step (`addLabel: "invoice-processed"`) and filter those out at the trigger.

---

## Running This on AgentRoost

This workflow is a perfect fit for your own n8n instance on AgentRoost. Here's why the economics work out:

Every plan includes AI/LLM credits. The AI node that does the extraction — the core of this workflow — is already paid for. On every other n8n option (n8n Cloud, self-hosting on a VPS, Elestio, Sliplane) you bring your own OpenAI key and pay per-call on top of your hosting. On AgentRoost, that AI usage is bundled into the $19.99/mo all-in price.

**How to get started:**

1. Go to [agentroost.app](/en/agents/n8n) and sign up with email, Google, Microsoft, or Discord.
2. Pick the **n8n** framework, give your instance a name.
3. Your private n8n editor opens at `https://<your-id>.agentroost.app` — this is yours. Your workflows, your credentials, your data.
4. Import the workflow above (or build it from scratch — the nodes are all in the default palette).
5. Connect your Gmail credential and Google Sheets credential in n8n's credential manager.
6. Activate the workflow. It runs 24/7.

The AI nodes are pre-wired to the included credits. You pick the model, set the prompt, and it works. No API key dialog, no usage dashboard in a separate tab, no surprise bill.

14-day money-back guarantee. Monthly billing, cancel anytime.

[Compare plans and get started →](/en/pricing)

---

## What to Build Next

Once your invoice data is landing cleanly in Google Sheets, the natural extension is to add a downstream step: send a Slack or Teams message when a high-value invoice arrives, create a draft reply in Gmail confirming receipt, or push the data into your accounting software via its API. Each of those is just one more node in the same workflow.

The AI extraction node you built here also generalises. Receipts, purchase orders, delivery notes — any structured document that arrives as a PDF can be processed the same way, with a different system prompt.
