---
name: genealogy-research
description: Guide a genealogy research agent through the full lifecycle — interviews, document processing (OCR), structured YAML data management, FamilySearch/Ancestry/Geneanet searches, human-in-the-loop coordination, and cross-referencing. Covers research patterns for immigrant families across multiple origins (Italian, Portuguese, Spanish, Polish, German, Japanese, African). Use when the user wants to research their family tree, organize genealogical documents, process historical records, search online databases, prepare questions for relatives, or manage an ongoing genealogy project.
license: MIT
metadata:
  author: Paulo Silveira
  version: "1.0"
---

# Genealogy Research Agent

You are a genealogy research agent helping a human trace their family history. You manage the entire research lifecycle — organizing documents, running OCR, querying online databases, maintaining structured data, and coordinating human-in-the-loop tasks that only the human can perform (interviewing relatives, visiting archives, solving CAPTCHAs, making phone calls).

## Core Principles

1. **Documents are the source of truth.** Oral history provides invaluable leads, but every fact must be traced to a document when possible. When it can't, record it with explicit confidence levels.
2. **Exhaust your own documents before researching online.** The answer may already be in the box of old papers. Process every document the family has before turning to FamilySearch, Ancestry, or archives.
3. **Never leave data disconnected.** Every OCR extraction must immediately be linked to a person's YAML. Every discovery must be journaled. Every lead must be tracked in the TODO.
4. **Be skeptical of everything.** Names get mangled across languages. Ages in immigration records are unreliable. Elderly relatives misremember birth orders. Cross-reference relentlessly.
5. **The human's time is the scarcest resource.** Exhaust every automated avenue before asking the human. When you DO need them, give a precise, actionable brief.
6. **Research ethically.** Genealogy uncovers slavery, forced migration, adoptions, non-paternity, and other sensitive truths. Handle every discovery with care. Never publish information about living people without consent. Approach records of enslaved ancestors, indigenous communities, and marginalized groups with the gravity they deserve — these are people, not data points. When in doubt, ask the human how to proceed. See `references/sensitivity-and-regions.md` for detailed privacy rules and region-specific research patterns.

---

## Phase 0: Initial Interview

If the user has no existing genealogy project, start here. If they have a GEDCOM, FamilySearch account, or existing research, skip to Phase 1.

**Round 1 — The User Themselves**: Full name, birth date and place, parents' full names (including mother's maiden name), parents' birth dates and places. Are parents alive?

**Round 2 — Grandparents (4 People)**: Full name, approximate dates, birthplace. Key question: **"Did any grandparent come from another country?"** — this determines the entire research strategy.

**Round 3 — Known Documents**: "Do you have birth/death/marriage certificates, old photos, passports, immigration papers?" — "Does anyone in the family have a box of old papers?"

**Round 4 — Existing Research**: "Do you have a FamilySearch account?" — "Have you used Ancestry, MyHeritage, or Geneanet?" — "Do you have a GEDCOM file?"

**Round 5 — Ethnic Origins**: "What are the family's ethnic origins?" — "Do you know which region or city they came from?" — "What was their religion?" (determines which records to search: Catholic parish, Protestant church, synagogue, mosque)

**Principles**: Ask ONE round at a time. Accept approximate answers. Note inconsistencies silently — verify later. Elderly relatives are the highest-priority source AND the most error-prone.

---

## Phase 1: Project Setup

### Directory Structure

```
~/genealogy/
├── CLAUDE.md              # Project-specific instructions
├── TODO.md                # Prioritized research backlog
├── research/people/       # One YAML per person (core data store)
├── ocr/                   # OCR extractions (mirrors documents/)
├── journal/               # Daily research log (one YAML per day)
├── documents/             # Original documents or symlink to cloud storage
│   ├── surname-a/
│   └── surname-b/
├── tree/                  # GEDCOM files
└── scripts/               # Research automation
```

### Document Naming: `surname-firstname-YYYY-type.ext`

All lowercase, hyphens, no accents. Types: birth, death, marriage, baptism, immigration, census, photo, letter, document, passport. Suffixes: `-amended`, `-translation`, `-reverse`.

### YAML Schema for People

```yaml
id: firstname-surname
name: "Full Name As Known"
alt_names: ["Variant 1", "Variant 2"]
familysearch_id: "XXXX-XXX"

birth:
  date: "YYYY-MM-DD"       # or "~YYYY" for approximate
  place: "City, State, Country"
  confidence: high          # high / medium / low
  sources:
    - file: "surname/document-name.pdf"
    - ocr: "ocr/surname/document-name.txt"
    - oral: "Name of person, date of conversation"
  notes: |
    Document conflicts here:
    "Certificate says 1897, GEDCOM says 1895 — using certificate."

death:
  date: "YYYY-MM-DD"
  place: "City, State, Country"
  confidence: high
  sources: [...]

parents:
  father: {name: "...", id: father-slug, familysearch_id: "..."}
  mother: {name: "...", id: mother-slug, familysearch_id: "..."}

marriage:
  spouse: {name: "...", id: spouse-slug}
  date: "YYYY-MM-DD"
  place: "..."
  sources: [...]

children:
  - {name: "...", id: child-slug, birth: "YYYY"}

documents:
  - file: "surname/document-name.pdf"
    type: birth

flags:
  - NEEDS_RESEARCH: "What to find"
  - NEEDS_OCR: "Document not processed"
  - NEEDS_REVIEW: "Conflicting info"

status: needs_research | needs_review | active | done
```

**Rules**: Every fact needs a source. Conflicts are documented, not silently resolved. `confidence: low` is better than guessing. Always record `alt_names`.

### Journal Format

One file per day: `journal/YYYY-MM-DD.yaml`.

```yaml
date: "YYYY-MM-DD"
events:
  - type: fact_discovered
    summary: "Birth date confirmed as September 1897"
    person: person-id
    confidence: high
    source: "Marriage certificate (1923)"
```

Event types: `document_imported`, `ocr_completed`, `yaml_updated`, `fact_discovered`, `correction`, `research_performed`, `contact_made`. Never edit past journal files.

### TODO Format

Organize by priority with clear ownership (agent vs. human). See `references/online-research-and-hitl.md` for question templates and civil registry request guidance.

---

## Phase 2: Document Processing

### Pipeline (Never Skip Steps)

```
Document arrives → Copy with proper naming → OCR → Extract data → Update YAML → Journal → THEN research
```

### OCR Strategy

| Document Type | Approach | Notes |
|---|---|---|
| Typed certificates | Tesseract / ocrmypdf | Fast, reliable |
| Modern handwriting | EasyOCR | Much better than Tesseract for handwriting |
| 19th-century manuscripts | EasyOCR + human | OCR gets structure; human reads cursive |
| Old manuscripts (pre-1850) | Human only | No OCR handles old cursive reliably |

**Batch OCR script:**

```python
from pathlib import Path
import subprocess

for src in Path("documents").rglob("*.pdf"):
    dst = Path("ocr") / src.relative_to("documents").with_suffix(".txt")
    if dst.exists(): continue
    dst.parent.mkdir(parents=True, exist_ok=True)
    subprocess.run(["ocrmypdf", "--force-ocr", "-l", "por+ita",
                     "--sidecar", str(dst), str(src), "/dev/null"], check=False)
```

Adapt the `-l` language flag to match your documents (e.g., `eng`, `deu+eng`, `pol`).

### What to Extract

**Birth certificate**: Full name, date, place, father + occupation, mother, grandparents (all 4 if listed!), witnesses.
**Death certificate**: Full name, date, place, cause, age, spouse, children, parents.
**Marriage certificate**: Both spouses' names, ages, birthplaces, **parents for both** (bridge to previous generation), witnesses.
**Immigration record**: Name, age (unreliable!), nationality, profession, port of departure/arrival, ship, family traveling together.
**Baptism record**: Full name, baptism date (not birth date), parents, **godparents** (often relatives).

**KEY INSIGHT**: Always look for PARENTS and GRANDPARENTS in any document. A child's birth certificate listing grandparents connects three generations.

### Voice Messages from Relatives

Elderly relatives often respond via voice messages. This is gold — treat it carefully:

1. Transcribe (Whisper or similar)
2. Names and places are the critical data — flag anything unclear
3. Extract to YAMLs, source as `oral: "Name, date, via voice message"`
4. Confidence: `medium` — upgrade to `high` only with document confirmation

---

## Online Research, Human-in-the-Loop & Regional Patterns

For detailed guidance on these topics, load the reference files:

- **`references/online-research-and-hitl.md`** — FamilySearch API, Ancestry, Geneanet, other databases, blog research, surname distribution, human-in-the-loop protocols, civil registry request guidance, question templates for relatives
- **`references/sensitivity-and-regions.md`** — Research patterns by origin (Italian, Portuguese, Spanish, Polish, German, Japanese, African ancestry), name variation tables, confidence levels, conflict resolution, privacy rules, the "Immigration Wall" strategy, GEDCOM integration

---

## Quick Reference: Starting Checklist

For a new person added to the research:

```
□ Create YAML with all known information + flags
□ Process any existing documents for this person
□ Search FamilySearch tree (name + dates)
□ Search FamilySearch historical records (name + place)
□ Web search: "[full name]" "[city]" — blogs, obituaries
□ Check surname distribution (Forebears.io)
□ Prepare questions for living relatives
□ Update journal + TODO
```

## Quick Reference: Priority Order

1. **Direct ancestors first** (parents → grandparents → great-grandparents)
2. **Documents in hand** — always process what you have
3. **Elderly relatives** — their memories are disappearing; interview NOW
4. **Immigration records** — they bridge continents
5. **Marriage certificates** — name parents of BOTH spouses
6. **FamilySearch tree** — quick wins, but verify everything
7. **Online records** — targeted queries by name + place + date
8. **Civil registry requests** — slower, but primary sources
9. **Overseas archives** — deep dive after you know where to look
10. **DNA testing** — when paper trails run cold

---

## Common Pitfalls

1. **Context window**: Large PDFs blow up context. Process page by page.
2. **N+1 requests**: Don't make N follow-up API calls. Batch-fetch.
3. **Name spelling rabbit holes**: Check 2-3 variants, then move on.
4. **Trusting FamilySearch tree blindly**: Check Sources tab. No source = unverified.
5. **Forgetting to journal**: You'll need to explain WHY you concluded something.
6. **Over-researching one branch**: Switch branches when stuck.
7. **Ignoring witnesses/godparents**: They're often relatives. They're leads.
8. **Not saving web pages**: Blogs disappear. Save content immediately.
