Skip to main content

๐Ÿ“Š Data Extraction Prompt

A data extraction prompt instructs an LLM to parse unstructured text and extract structured data โ€” entities, relationships, attributes, and patterns โ€” returning clean, machine-readable output like JSON, CSV, or structured tables.

Why This Mattersโ€‹

Over 80% of enterprise data is unstructured: emails, contracts, reports, support tickets, social media posts. Manually extracting structured information from these sources is slow, expensive, and error-prone. A well-engineered extraction prompt turns an LLM into a highly flexible data parser that can handle messy, real-world text without custom code for each format.

The Production Promptโ€‹

Data Extraction โ€” Full System Prompt
You are an expert data extraction system. Your job is to parse unstructured text and return perfectly structured data.

**Core Rules:**
1. Extract ONLY information that is explicitly stated in the text โ€” never infer, guess, or fabricate data
2. If a requested field is not present in the text, return null for that field โ€” never leave it out
3. Maintain the exact values from the source โ€” do not paraphrase names, numbers, or dates
4. Normalize dates to ISO 8601 format (YYYY-MM-DD) unless otherwise specified
5. Normalize currency to numeric values with the currency code (e.g., {"amount": 1500.00, "currency": "USD"})
6. For ambiguous values, include a "confidence" field: "high", "medium", or "low"

**Entity Extraction Schema:**
When extracting entities, return this structure:
```json
{
"entities": [
{
"text": "exact text from source",
"type": "PERSON | ORGANIZATION | LOCATION | DATE | MONEY | PRODUCT | EMAIL | PHONE",
"normalized": "standardized form",
"confidence": "high | medium | low"
}
]
}

Relationship Extraction: When relationships between entities exist, capture them:

{
"relationships": [
{
"subject": "entity_1",
"predicate": "relationship type",
"object": "entity_2",
"context": "supporting text from source"
}
]
}

Output Rules:

  • Return valid JSON only โ€” no markdown, no explanations, no commentary
  • Preserve all extracted fields even if empty (null values)
  • If the input contains multiple records, return a JSON array
  • Escape special characters properly in all string values

Error Handling:

  • If the text is too ambiguous to extract reliably, return: {"error": "ambiguous_input", "details": "description of the ambiguity"}
  • If the requested data type doesn't exist in the text, return the schema with all null values

## Bad vs. Improved Prompts

### โŒ Bad Prompt

```text
Extract the important information from this email:

"Hi John, I wanted to follow up on our meeting last Tuesday. The revised budget for Project Atlas is $2.3M, down from the original $2.8M estimate. Sarah from marketing confirmed the launch date is March 15th. Can we sync tomorrow at 3pm? โ€” Dave"

Why it fails: No schema definition, no output format, "important information" is subjective. The model will return prose instead of structured data.

โœ… Improved Promptโ€‹

You are a data extraction system. Parse the following email and return a JSON object with this exact schema:

{
"sender": "string or null",
"recipient": "string or null",
"project_name": "string or null",
"budget": {"current": number or null, "previous": number or null, "currency": "string"},
"key_date": {"event": "string", "date": "YYYY-MM-DD or null"},
"people_mentioned": [{"name": "string", "role_or_department": "string or null"}],
"action_items": [{"description": "string", "due": "string or null"}]
}

Rules:
- Extract ONLY what is explicitly stated โ€” do not infer
- Normalize dates to ISO 8601 format
- Return valid JSON only, no explanations

Email:
"Hi John, I wanted to follow up on our meeting last Tuesday. The revised budget for Project Atlas is $2.3M, down from the original $2.8M estimate. Sarah from marketing confirmed the launch date is March 15th. Can we sync tomorrow at 3pm? โ€” Dave"

Try It Yourselfโ€‹

๐Ÿงช Try It Yourself

Edit the prompt and click Run to see the AI response.

Tips for Customizationโ€‹

CustomizationHow to Modify the Prompt
Invoice parsingDefine schema: invoice_number, vendor, line_items[], total, tax, due_date, payment_terms
Resume parsingDefine schema: name, email, phone, experience[], education[], skills[], certifications[]
Contract analysisDefine schema: parties[], effective_date, termination_clause, obligations[], payment_terms
Support ticketsDefine schema: issue_category, severity, product, steps_to_reproduce, customer_sentiment
Batch processingAdd: "The input contains multiple records separated by '---'. Return a JSON array with one object per record."
Output formatSwitch from JSON to CSV: "Return results as CSV with these column headers: ..."

Practice Challengeโ€‹

Challenge

Find a real email, job posting, or product review. Write an extraction prompt with a specific JSON schema that captures all the meaningful data. Run it and check:

  1. Did the model extract only stated facts (no hallucinated data)?
  2. Are null values present where data is missing (not omitted)?
  3. Is the output valid JSON (paste it into a JSON validator)?
  4. Try the same prompt on 3 different texts โ€” is the output consistent?

Real-World Scenarioโ€‹

Scenario: A legal tech company needs to extract key terms from thousands of contracts to build a searchable database.

Implementation approach:

  1. PDF-to-text conversion: extract raw text from contract PDFs using OCR if needed
  2. Chunking: split long contracts into sections (parties, terms, obligations, payment, termination) using heading detection
  3. Per-section extraction: run each section through a tailored extraction prompt with a section-specific schema
  4. Aggregation: combine per-section results into a single contract JSON document
  5. Validation pipeline:
    • Schema validation: ensure output matches the expected JSON schema
    • Cross-reference check: verify extracted party names appear in the original text
    • Confidence filtering: flag any "low" confidence extractions for human review
  6. Database insertion: load validated JSON into a structured database for querying
  7. Temperature setting: 0.0 โ€” extraction must be deterministic and faithful to the source text

This system processes 500+ contracts per hour with 94% accuracy, requiring human review for only 6% of extractions.

Interview Questionโ€‹

Interview Question

Q: How do you handle data extraction when the input text is messy, inconsistent, or contains conflicting information?

A: I address this at three levels:

  1. Prompt-level robustness โ€” include explicit instructions for edge cases: "If dates appear in multiple formats (MM/DD/YYYY, Month DD YYYY, DD-MM-YYYY), normalize all to ISO 8601. If conflicting values exist for the same field, extract both and add a 'conflict' flag."
  2. Schema design โ€” design the schema to accommodate ambiguity: include confidence scores, allow array values where you might expect a single value (e.g., if two different phone numbers appear for the same person), and always include null as a valid option.
  3. Pre-processing โ€” clean the text before sending to the LLM: strip excessive whitespace, fix OCR artifacts (common: 'l' โ†” '1', 'O' โ†” '0'), normalize Unicode characters. This reduces the noise the model has to handle.
  4. Post-processing validation โ€” validate the JSON output against the schema, check that extracted values actually appear in the source text (faithfulness check), and run type validation on dates/numbers/emails.

Summaryโ€‹

Summary
  • Data extraction prompts must define an explicit output schema โ€” the model should know the exact structure expected
  • Always handle missing data by requiring null values rather than omission
  • Instruct the model to extract only stated facts โ€” the #1 risk is the model inferring data that isn't there
  • Use temperature: 0.0 for extraction tasks โ€” you want deterministic, faithful output
  • Include confidence scores for ambiguous extractions so downstream systems can flag uncertain data for human review