How to Build an AI-Powered Form That Extracts Data from PDFs

Tuesday 24/02/2026

·11 min read
Share:

You have a stack of invoices, receipts, or contracts, and you need the data inside them in a structured format. Copy-pasting fields manually is soul-crushing. OCR tools give you raw text with no structure. What you actually want is to upload a PDF and get back clean, typed JSON that you can drop straight into a form.

Claude can read PDFs natively — no OCR layer, no text extraction step. You send the file, describe the structure you want, and get back JSON. Here's how to build the full pipeline in Next.js: file upload, PDF processing with Claude, and a dynamic form that renders the extracted fields for review and editing.

How Claude handles PDFs

Claude accepts PDFs as base64-encoded content in the document source type. Unlike vision-based approaches that screenshot each page, Claude processes the actual document content — so it handles multi-page PDFs, tables, and small text reliably. There's a 32MB file size limit and a maximum of 100 pages per request.

The API call looks like this:

// src/lib/extract-pdf.ts
import Anthropic from '@anthropic-ai/sdk'

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })

export async function extractFromPDF<T>(
    pdfBase64: string,
    schema: SchemaField[],
    documentType: string
): Promise<T> {
    const schemaDescription = schema
        .map((f) => `- "${f.name}" (${f.type}): ${f.description}`)
        .join('\n')

    const response = await client.messages.create({
        model: 'claude-sonnet-4-5-20250929',
        max_tokens: 4096,
        messages: [
            {
                role: 'user',
                content: [
                    {
                        type: 'document',
                        source: {
                            type: 'base64',
                            media_type: 'application/pdf',
                            data: pdfBase64,
                        },
                    },
                    {
                        type: 'text',
                        text: `Extract data from this ${documentType}. Return ONLY valid JSON matching this schema:\n\n${schemaDescription}\n\nIf a field is not found in the document, use null. Do not include any text outside the JSON object.`,
                    },
                ],
            },
        ],
    })

    const text = response.content[0]
    if (text.type !== 'text') {
        throw new Error('Unexpected response type from Claude')
    }

    // Strip markdown code fences if Claude wraps the JSON
    const cleaned = text.text.replace(/^```json\n?/, '').replace(/\n?```$/, '')
    return JSON.parse(cleaned) as T
}

The document content block is the key piece. You pass type: 'document' with a base64 source, and Claude reads the PDF directly. No need to convert pages to images first.

Defining extraction schemas

You need a way to describe what fields to extract. This schema drives both the Claude prompt and the form rendering:

// src/lib/schemas.ts
export interface SchemaField {
    name: string
    type: 'string' | 'number' | 'date' | 'boolean' | 'string[]'
    description: string
    required: boolean
}

export const invoiceSchema: SchemaField[] = [
    {
        name: 'vendor_name',
        type: 'string',
        description: 'Company or person who issued the invoice',
        required: true,
    },
    {
        name: 'invoice_number',
        type: 'string',
        description: 'Invoice or receipt number',
        required: true,
    },
    {
        name: 'date',
        type: 'date',
        description: 'Invoice date in YYYY-MM-DD format',
        required: true,
    },
    {
        name: 'due_date',
        type: 'date',
        description: 'Payment due date in YYYY-MM-DD format',
        required: false,
    },
    {
        name: 'total_amount',
        type: 'number',
        description: 'Total amount due including tax',
        required: true,
    },
    {
        name: 'currency',
        type: 'string',
        description: 'Currency code (e.g., USD, EUR, ILS)',
        required: true,
    },
    {
        name: 'line_items',
        type: 'string[]',
        description: 'List of item descriptions on the invoice',
        required: false,
    },
    {
        name: 'tax_amount',
        type: 'number',
        description: 'Tax amount if listed separately',
        required: false,
    },
]

You can create different schemas for different document types — contracts, receipts, applications, whatever. The same extraction function works for all of them because Claude adapts to the prompt.

The API route

The Next.js API route handles the file upload, converts it to base64, and calls Claude:

// src/pages/api/extract.ts
import type { NextApiRequest, NextApiResponse } from 'next'
import formidable from 'formidable'
import fs from 'fs'
import { extractFromPDF } from '@/src/lib/extract-pdf'
import { invoiceSchema, SchemaField } from '@/src/lib/schemas'

export const config = {
    api: { bodyParser: false },
}

const schemas: Record<string, SchemaField[]> = {
    invoice: invoiceSchema,
}

export default async function handler(
    req: NextApiRequest,
    res: NextApiResponse
) {
    if (req.method !== 'POST') {
        return res.status(405).json({ error: 'Method not allowed' })
    }

    const form = formidable({ maxFileSize: 32 * 1024 * 1024 })

    try {
        const [fields, files] = await form.parse(req)
        const file = files.pdf?.[0]
        const documentType = fields.documentType?.[0] || 'invoice'

        if (!file) {
            return res.status(400).json({ error: 'No PDF file uploaded' })
        }

        if (file.mimetype !== 'application/pdf') {
            return res.status(400).json({ error: 'File must be a PDF' })
        }

        const fileBuffer = fs.readFileSync(file.filepath)
        const pdfBase64 = fileBuffer.toString('base64')

        const schema = schemas[documentType] || invoiceSchema
        const extracted = await extractFromPDF(
            pdfBase64,
            schema,
            documentType
        )

        // Clean up temp file
        fs.unlinkSync(file.filepath)

        return res.status(200).json({
            data: extracted,
            schema,
        })
    } catch (error) {
        console.error('Extraction error:', error)
        const message =
            error instanceof Error
                ? error.message
                : 'Failed to extract data'
        return res.status(500).json({ error: message })
    }
}

Gotcha: You must disable Next.js body parsing with export const config = { api: { bodyParser: false } } for file uploads to work with formidable. Without this, Next.js tries to parse the multipart body itself and formidable gets an empty request.

Install the dependencies:

pnpm add @anthropic-ai/sdk formidable
pnpm add -D @types/formidable

The upload and form component

Now the frontend. The component handles three states: upload, processing, and review. When Claude returns the extracted data, it renders a form with editable fields so the user can correct anything Claude got wrong.

// src/components/PDFExtractor.tsx
import { useState, FormEvent, ChangeEvent } from 'react'

interface SchemaField {
    name: string
    type: 'string' | 'number' | 'date' | 'boolean' | 'string[]'
    description: string
    required: boolean
}

interface ExtractionResult {
    data: Record<string, unknown>
    schema: SchemaField[]
}

export default function PDFExtractor() {
    const [file, setFile] = useState<File | null>(null)
    const [loading, setLoading] = useState(false)
    const [error, setError] = useState<string | null>(null)
    const [result, setResult] = useState<ExtractionResult | null>(null)
    const [formData, setFormData] = useState<Record<string, unknown>>({})

    async function handleUpload(e: FormEvent) {
        e.preventDefault()
        if (!file) return

        setLoading(true)
        setError(null)

        const body = new FormData()
        body.append('pdf', file)
        body.append('documentType', 'invoice')

        try {
            const res = await fetch('/api/extract', {
                method: 'POST',
                body,
            })

            if (!res.ok) {
                const err = await res.json()
                throw new Error(err.error || 'Extraction failed')
            }

            const extraction: ExtractionResult = await res.json()
            setResult(extraction)
            setFormData(extraction.data)
        } catch (err) {
            setError(
                err instanceof Error
                    ? err.message
                    : 'Something went wrong'
            )
        } finally {
            setLoading(false)
        }
    }

    function handleFieldChange(name: string, value: unknown) {
        setFormData((prev) => ({ ...prev, [name]: value }))
    }

    function handleSubmit() {
        // Send formData to your backend, database, etc.
        console.log('Submitting:', formData)
    }

    if (result) {
        return (
            <div>
                <h2>Review Extracted Data</h2>
                <p>Check the fields below and correct anything that looks wrong.</p>
                <form onSubmit={(e) => { e.preventDefault(); handleSubmit() }}>
                    {result.schema.map((field) => (
                        <FieldInput
                            key={field.name}
                            field={field}
                            value={formData[field.name]}
                            onChange={(val) =>
                                handleFieldChange(field.name, val)
                            }
                        />
                    ))}
                    <button type="submit">
                        Confirm & Save
                    </button>
                    <button
                        type="button"
                        onClick={() => {
                            setResult(null)
                            setFile(null)
                        }}
                    >
                        Upload Different PDF
                    </button>
                </form>
            </div>
        )
    }

    return (
        <div>
            <form onSubmit={handleUpload}>
                <label htmlFor="pdf-upload">
                    Upload a PDF invoice
                </label>
                <input
                    id="pdf-upload"
                    type="file"
                    accept=".pdf"
                    onChange={(e: ChangeEvent<HTMLInputElement>) =>
                        setFile(e.target.files?.[0] || null)
                    }
                />
                <button type="submit" disabled={!file || loading}>
                    {loading ? 'Extracting...' : 'Extract Data'}
                </button>
            </form>
            {error && <p style={{ color: 'red' }}>{error}</p>}
        </div>
    )
}

function FieldInput({
    field,
    value,
    onChange,
}: {
    field: SchemaField
    value: unknown
    onChange: (val: unknown) => void
}) {
    const label = field.name.replace(/_/g, ' ')
    const displayValue = value ?? ''

    switch (field.type) {
        case 'number':
            return (
                <div>
                    <label>{label}</label>
                    <input
                        type="number"
                        step="0.01"
                        value={displayValue as number}
                        onChange={(e) =>
                            onChange(parseFloat(e.target.value) || 0)
                        }
                        required={field.required}
                    />
                </div>
            )
        case 'date':
            return (
                <div>
                    <label>{label}</label>
                    <input
                        type="date"
                        value={displayValue as string}
                        onChange={(e) => onChange(e.target.value)}
                        required={field.required}
                    />
                </div>
            )
        case 'boolean':
            return (
                <div>
                    <label>
                        <input
                            type="checkbox"
                            checked={!!value}
                            onChange={(e) => onChange(e.target.checked)}
                        />
                        {label}
                    </label>
                </div>
            )
        case 'string[]':
            return (
                <div>
                    <label>{label}</label>
                    <textarea
                        value={
                            Array.isArray(value)
                                ? value.join('\n')
                                : (displayValue as string)
                        }
                        onChange={(e) =>
                            onChange(e.target.value.split('\n'))
                        }
                        rows={4}
                        placeholder="One item per line"
                    />
                </div>
            )
        default:
            return (
                <div>
                    <label>{label}</label>
                    <input
                        type="text"
                        value={displayValue as string}
                        onChange={(e) => onChange(e.target.value)}
                        required={field.required}
                    />
                </div>
            )
    }
}

The FieldInput component renders different input types based on the schema. Dates get date pickers, numbers get numeric inputs, arrays get textareas. This is driven entirely by the schema you defined earlier — add a new field to the schema and it shows up in the form automatically.

Handling extraction errors gracefully

Claude is good at extracting data from PDFs, but it's not perfect. Scanned documents with poor quality, handwritten forms, or unusual layouts can trip it up. Here's how to make the extraction more robust:

// src/lib/extract-pdf.ts — improved version with validation
export async function extractFromPDF<T>(
    pdfBase64: string,
    schema: SchemaField[],
    documentType: string
): Promise<{ data: T; confidence: Record<string, 'high' | 'medium' | 'low'> }> {
    const schemaDescription = schema
        .map((f) => `- "${f.name}" (${f.type}): ${f.description}`)
        .join('\n')

    const response = await client.messages.create({
        model: 'claude-sonnet-4-5-20250929',
        max_tokens: 4096,
        messages: [
            {
                role: 'user',
                content: [
                    {
                        type: 'document',
                        source: {
                            type: 'base64',
                            media_type: 'application/pdf',
                            data: pdfBase64,
                        },
                    },
                    {
                        type: 'text',
                        text: `Extract data from this ${documentType}. Return a JSON object with two keys:
1. "data": an object matching this schema:
${schemaDescription}

2. "confidence": an object with the same keys, each set to "high", "medium", or "low" based on how confident you are in the extracted value.

Use null for fields not found. Return ONLY the JSON.`,
                    },
                ],
            },
        ],
    })

    const text = response.content[0]
    if (text.type !== 'text') {
        throw new Error('Unexpected response type from Claude')
    }

    const cleaned = text.text.replace(/^```json\n?/, '').replace(/\n?```$/, '')
    return JSON.parse(cleaned) as {
        data: T
        confidence: Record<string, 'high' | 'medium' | 'low'>
    }
}

Now Claude returns a confidence level for each field. You can use this in the UI to highlight fields that might need manual review — show a yellow border for "medium" confidence and red for "low." This saves the user from checking every field when most of them are correct.

Validating the extracted data

Don't trust Claude's output blindly. Validate the parsed JSON against your schema before rendering the form:

// src/lib/validate.ts
import { SchemaField } from './schemas'

interface ValidationError {
    field: string
    message: string
}

export function validateExtraction(
    data: Record<string, unknown>,
    schema: SchemaField[]
): ValidationError[] {
    const errors: ValidationError[] = []

    for (const field of schema) {
        const value = data[field.name]

        if (field.required && (value === null || value === undefined)) {
            errors.push({
                field: field.name,
                message: `Required field "${field.name}" is missing`,
            })
            continue
        }

        if (value === null || value === undefined) continue

        switch (field.type) {
            case 'number':
                if (typeof value !== 'number' || isNaN(value)) {
                    errors.push({
                        field: field.name,
                        message: `"${field.name}" should be a number, got ${typeof value}`,
                    })
                }
                break
            case 'date':
                if (
                    typeof value !== 'string' ||
                    isNaN(Date.parse(value))
                ) {
                    errors.push({
                        field: field.name,
                        message: `"${field.name}" is not a valid date`,
                    })
                }
                break
            case 'string[]':
                if (!Array.isArray(value)) {
                    errors.push({
                        field: field.name,
                        message: `"${field.name}" should be an array`,
                    })
                }
                break
        }
    }

    return errors
}

Run this before setting the form data. If there are validation errors, you can either show them to the user or attempt a second extraction with a more specific prompt.

Cost and performance

A single invoice extraction with claude-sonnet-4-5-20250929 costs roughly $0.01-0.03 depending on the PDF size. That's cheap enough for most use cases, but if you're processing hundreds of documents, consider:

  • Batch processing: Collect PDFs and process them in bulk during off-peak hours using the rate limiting patterns from the previous post
  • Caching: If users upload the same document twice, hash the file and return the cached result
  • Model selection: Use claude-haiku-4-5-20251001 for simple documents (receipts, basic invoices) and save Sonnet for complex multi-page contracts

Processing time is typically 3-8 seconds per document. Show a progress indicator — users are surprisingly patient when they can see something happening.

What's next

This pattern works for any document type — contracts, applications, medical forms, shipping labels. Define a schema, point Claude at the PDF, and you get structured data back. The form component adapts automatically because it's driven by the schema.

If you're processing lots of documents and worried about costs, the next post covers the real cost of running an AI feature in production — with actual math and optimization strategies.

For the extraction API itself, make sure you've got proper error handling and retries in place. The rate limits and errors post has copy-pasteable utilities for exponential backoff and request queuing that work well with document processing pipelines.

Share:
VA

Vadim Alakhverdov

Software developer writing about JavaScript, web development, and developer tools.

Related Posts