๐ Data Extraction Prompt
A data extraction prompt instructs an LLM to parse unstructured text and extract structured data โ entities, relationships, attributes, and patterns โ returning clean, machine-readable output like JSON, CSV, or structured tables.
Why This Mattersโ
Over 80% of enterprise data is unstructured: emails, contracts, reports, support tickets, social media posts. Manually extracting structured information from these sources is slow, expensive, and error-prone. A well-engineered extraction prompt turns an LLM into a highly flexible data parser that can handle messy, real-world text without custom code for each format.
The Production Promptโ
You are an expert data extraction system. Your job is to parse unstructured text and return perfectly structured data.
**Core Rules:**
1. Extract ONLY information that is explicitly stated in the text โ never infer, guess, or fabricate data
2. If a requested field is not present in the text, return null for that field โ never leave it out
3. Maintain the exact values from the source โ do not paraphrase names, numbers, or dates
4. Normalize dates to ISO 8601 format (YYYY-MM-DD) unless otherwise specified
5. Normalize currency to numeric values with the currency code (e.g., {"amount": 1500.00, "currency": "USD"})
6. For ambiguous values, include a "confidence" field: "high", "medium", or "low"
**Entity Extraction Schema:**
When extracting entities, return this structure:
```json
{
"entities": [
{
"text": "exact text from source",
"type": "PERSON | ORGANIZATION | LOCATION | DATE | MONEY | PRODUCT | EMAIL | PHONE",
"normalized": "standardized form",
"confidence": "high | medium | low"
}
]
}
Relationship Extraction: When relationships between entities exist, capture them:
{
"relationships": [
{
"subject": "entity_1",
"predicate": "relationship type",
"object": "entity_2",
"context": "supporting text from source"
}
]
}
Output Rules:
- Return valid JSON only โ no markdown, no explanations, no commentary
- Preserve all extracted fields even if empty (null values)
- If the input contains multiple records, return a JSON array
- Escape special characters properly in all string values
Error Handling:
- If the text is too ambiguous to extract reliably, return:
{"error": "ambiguous_input", "details": "description of the ambiguity"} - If the requested data type doesn't exist in the text, return the schema with all null values
## Bad vs. Improved Prompts
### โ Bad Prompt
```text
Extract the important information from this email:
"Hi John, I wanted to follow up on our meeting last Tuesday. The revised budget for Project Atlas is $2.3M, down from the original $2.8M estimate. Sarah from marketing confirmed the launch date is March 15th. Can we sync tomorrow at 3pm? โ Dave"
Why it fails: No schema definition, no output format, "important information" is subjective. The model will return prose instead of structured data.
โ Improved Promptโ
You are a data extraction system. Parse the following email and return a JSON object with this exact schema:
{
"sender": "string or null",
"recipient": "string or null",
"project_name": "string or null",
"budget": {"current": number or null, "previous": number or null, "currency": "string"},
"key_date": {"event": "string", "date": "YYYY-MM-DD or null"},
"people_mentioned": [{"name": "string", "role_or_department": "string or null"}],
"action_items": [{"description": "string", "due": "string or null"}]
}
Rules:
- Extract ONLY what is explicitly stated โ do not infer
- Normalize dates to ISO 8601 format
- Return valid JSON only, no explanations
Email:
"Hi John, I wanted to follow up on our meeting last Tuesday. The revised budget for Project Atlas is $2.3M, down from the original $2.8M estimate. Sarah from marketing confirmed the launch date is March 15th. Can we sync tomorrow at 3pm? โ Dave"
Try It Yourselfโ
๐งช Try It Yourself
Edit the prompt and click Run to see the AI response.
Tips for Customizationโ
| Customization | How to Modify the Prompt |
|---|---|
| Invoice parsing | Define schema: invoice_number, vendor, line_items[], total, tax, due_date, payment_terms |
| Resume parsing | Define schema: name, email, phone, experience[], education[], skills[], certifications[] |
| Contract analysis | Define schema: parties[], effective_date, termination_clause, obligations[], payment_terms |
| Support tickets | Define schema: issue_category, severity, product, steps_to_reproduce, customer_sentiment |
| Batch processing | Add: "The input contains multiple records separated by '---'. Return a JSON array with one object per record." |
| Output format | Switch from JSON to CSV: "Return results as CSV with these column headers: ..." |
Practice Challengeโ
Find a real email, job posting, or product review. Write an extraction prompt with a specific JSON schema that captures all the meaningful data. Run it and check:
- Did the model extract only stated facts (no hallucinated data)?
- Are null values present where data is missing (not omitted)?
- Is the output valid JSON (paste it into a JSON validator)?
- Try the same prompt on 3 different texts โ is the output consistent?
Real-World Scenarioโ
Scenario: A legal tech company needs to extract key terms from thousands of contracts to build a searchable database.
Implementation approach:
- PDF-to-text conversion: extract raw text from contract PDFs using OCR if needed
- Chunking: split long contracts into sections (parties, terms, obligations, payment, termination) using heading detection
- Per-section extraction: run each section through a tailored extraction prompt with a section-specific schema
- Aggregation: combine per-section results into a single contract JSON document
- Validation pipeline:
- Schema validation: ensure output matches the expected JSON schema
- Cross-reference check: verify extracted party names appear in the original text
- Confidence filtering: flag any "low" confidence extractions for human review
- Database insertion: load validated JSON into a structured database for querying
- Temperature setting:
0.0โ extraction must be deterministic and faithful to the source text
This system processes 500+ contracts per hour with 94% accuracy, requiring human review for only 6% of extractions.
Interview Questionโ
Q: How do you handle data extraction when the input text is messy, inconsistent, or contains conflicting information?
A: I address this at three levels:
- Prompt-level robustness โ include explicit instructions for edge cases: "If dates appear in multiple formats (MM/DD/YYYY, Month DD YYYY, DD-MM-YYYY), normalize all to ISO 8601. If conflicting values exist for the same field, extract both and add a 'conflict' flag."
- Schema design โ design the schema to accommodate ambiguity: include confidence scores, allow array values where you might expect a single value (e.g., if two different phone numbers appear for the same person), and always include null as a valid option.
- Pre-processing โ clean the text before sending to the LLM: strip excessive whitespace, fix OCR artifacts (common: 'l' โ '1', 'O' โ '0'), normalize Unicode characters. This reduces the noise the model has to handle.
- Post-processing validation โ validate the JSON output against the schema, check that extracted values actually appear in the source text (faithfulness check), and run type validation on dates/numbers/emails.
Summaryโ
- Data extraction prompts must define an explicit output schema โ the model should know the exact structure expected
- Always handle missing data by requiring null values rather than omission
- Instruct the model to extract only stated facts โ the #1 risk is the model inferring data that isn't there
- Use
temperature: 0.0for extraction tasks โ you want deterministic, faithful output - Include confidence scores for ambiguous extractions so downstream systems can flag uncertain data for human review