📊 Data Extraction Prompt

A data extraction prompt instructs an LLM to parse unstructured text and extract structured data — entities, relationships, attributes, and patterns — returning clean, machine-readable output like JSON, CSV, or structured tables.

Why This Matters

Over 80% of enterprise data is unstructured: emails, contracts, reports, support tickets, social media posts. Manually extracting structured information from these sources is slow, expensive, and error-prone. A well-engineered extraction prompt turns an LLM into a highly flexible data parser that can handle messy, real-world text without custom code for each format.

The Production Prompt

Data Extraction — Full System Prompt
You are an expert data extraction system. Your job is to parse unstructured text and return perfectly structured data.

**Core Rules:**
1. Extract ONLY information that is explicitly stated in the text — never infer, guess, or fabricate data
2. If a requested field is not present in the text, return null for that field — never leave it out
3. Maintain the exact values from the source — do not paraphrase names, numbers, or dates
4. Normalize dates to ISO 8601 format (YYYY-MM-DD) unless otherwise specified
5. Normalize currency to numeric values with the currency code (e.g., {"amount": 1500.00, "currency": "USD"})
6. For ambiguous values, include a "confidence" field: "high", "medium", or "low"

**Entity Extraction Schema:**
When extracting entities, return this structure:
```json
{
  "entities": [
    {
      "text": "exact text from source",
      "type": "PERSON | ORGANIZATION | LOCATION | DATE | MONEY | PRODUCT | EMAIL | PHONE",
      "normalized": "standardized form",
      "confidence": "high | medium | low"
    }
  ]
}

Relationship Extraction: When relationships between entities exist, capture them:

{
  "relationships": [
    {
      "subject": "entity_1",
      "predicate": "relationship type",
      "object": "entity_2",
      "context": "supporting text from source"
    }
  ]
}

Output Rules:

Return valid JSON only — no markdown, no explanations, no commentary
Preserve all extracted fields even if empty (null values)
If the input contains multiple records, return a JSON array
Escape special characters properly in all string values

Error Handling:

If the text is too ambiguous to extract reliably, return: {"error": "ambiguous_input", "details": "description of the ambiguity"}
If the requested data type doesn't exist in the text, return the schema with all null values

## Bad vs. Improved Prompts

### ❌ Bad Prompt

```text
Extract the important information from this email:

"Hi John, I wanted to follow up on our meeting last Tuesday. The revised budget for Project Atlas is $2.3M, down from the original $2.8M estimate. Sarah from marketing confirmed the launch date is March 15th. Can we sync tomorrow at 3pm? — Dave"

Why it fails: No schema definition, no output format, "important information" is subjective. The model will return prose instead of structured data.

✅ Improved Prompt

You are a data extraction system. Parse the following email and return a JSON object with this exact schema:

{
  "sender": "string or null",
  "recipient": "string or null",
  "project_name": "string or null",
  "budget": {"current": number or null, "previous": number or null, "currency": "string"},
  "key_date": {"event": "string", "date": "YYYY-MM-DD or null"},
  "people_mentioned": [{"name": "string", "role_or_department": "string or null"}],
  "action_items": [{"description": "string", "due": "string or null"}]
}

Rules:
- Extract ONLY what is explicitly stated — do not infer
- Normalize dates to ISO 8601 format
- Return valid JSON only, no explanations

Email:
"Hi John, I wanted to follow up on our meeting last Tuesday. The revised budget for Project Atlas is $2.3M, down from the original $2.8M estimate. Sarah from marketing confirmed the launch date is March 15th. Can we sync tomorrow at 3pm? — Dave"

Try It Yourself

🧪 Try It Yourself

Edit the prompt and click Run to see the AI response.

Tips for Customization

Customization	How to Modify the Prompt
Invoice parsing	Define schema: invoice_number, vendor, line_items[], total, tax, due_date, payment_terms
Resume parsing	Define schema: name, email, phone, experience[], education[], skills[], certifications[]
Contract analysis	Define schema: parties[], effective_date, termination_clause, obligations[], payment_terms
Support tickets	Define schema: issue_category, severity, product, steps_to_reproduce, customer_sentiment
Batch processing	Add: "The input contains multiple records separated by '---'. Return a JSON array with one object per record."
Output format	Switch from JSON to CSV: "Return results as CSV with these column headers: ..."

Practice Challenge

Challenge

Find a real email, job posting, or product review. Write an extraction prompt with a specific JSON schema that captures all the meaningful data. Run it and check:

Did the model extract only stated facts (no hallucinated data)?
Are null values present where data is missing (not omitted)?
Is the output valid JSON (paste it into a JSON validator)?
Try the same prompt on 3 different texts — is the output consistent?

Real-World Scenario

Scenario: A legal tech company needs to extract key terms from thousands of contracts to build a searchable database.

Implementation approach:

PDF-to-text conversion: extract raw text from contract PDFs using OCR if needed
Chunking: split long contracts into sections (parties, terms, obligations, payment, termination) using heading detection
Per-section extraction: run each section through a tailored extraction prompt with a section-specific schema
Aggregation: combine per-section results into a single contract JSON document
Validation pipeline:
- Schema validation: ensure output matches the expected JSON schema
- Cross-reference check: verify extracted party names appear in the original text
- Confidence filtering: flag any "low" confidence extractions for human review
Database insertion: load validated JSON into a structured database for querying
Temperature setting: 0.0 — extraction must be deterministic and faithful to the source text

This system processes 500+ contracts per hour with 94% accuracy, requiring human review for only 6% of extractions.

Interview Question

Q: How do you handle data extraction when the input text is messy, inconsistent, or contains conflicting information?

A: I address this at three levels:

Prompt-level robustness — include explicit instructions for edge cases: "If dates appear in multiple formats (MM/DD/YYYY, Month DD YYYY, DD-MM-YYYY), normalize all to ISO 8601. If conflicting values exist for the same field, extract both and add a 'conflict' flag."
Schema design — design the schema to accommodate ambiguity: include confidence scores, allow array values where you might expect a single value (e.g., if two different phone numbers appear for the same person), and always include null as a valid option.
Pre-processing — clean the text before sending to the LLM: strip excessive whitespace, fix OCR artifacts (common: 'l' ↔ '1', 'O' ↔ '0'), normalize Unicode characters. This reduces the noise the model has to handle.
Post-processing validation — validate the JSON output against the schema, check that extracted values actually appear in the source text (faithfulness check), and run type validation on dates/numbers/emails.

Summary

Data extraction prompts must define an explicit output schema — the model should know the exact structure expected
Always handle missing data by requiring null values rather than omission
Instruct the model to extract only stated facts — the #1 risk is the model inferring data that isn't there
Use temperature: 0.0 for extraction tasks — you want deterministic, faithful output
Include confidence scores for ambiguous extractions so downstream systems can flag uncertain data for human review

Why This Matters​

The Production Prompt​

✅ Improved Prompt​

Try It Yourself​

🧪 Try It Yourself

Tips for Customization​

Practice Challenge​

Real-World Scenario​

Interview Question​

Summary​