diff --git a/.gitignore b/.gitignore index a54a883..031c8f9 100644 --- a/.gitignore +++ b/.gitignore @@ -5,3 +5,4 @@ src/ package.json package-lock.json tsconfig.json +harness-events.jsonl diff --git a/README.md b/README.md index 4535457..f7df802 100644 --- a/README.md +++ b/README.md @@ -1,241 +1,200 @@ -# Build Multimodal Search with Claude Code +# Multimodal RAG with Gemini Embedding 2 and Claude Code -Search across your PDFs, images, and documents using plain English. -No coding required. Claude Code builds everything for you. +Search across PDFs, images, and documents using plain English. +No coding required. Claude Code builds everything from prompts. -## What you will build +![Search for "What is the largest planet?" returns both the Jupiter photograph and the PDF fact sheet](docs/demo-screenshot.png) -A local search app that lets you ask questions like: +> **Gemini Embedding 2** converts text, images, and video into the same +> searchable space. **Claude Code** builds the app. **Pinecone** stores the +> vectors. You just copy four prompts. -- "What is the largest planet in our solar system?" -- "Show me photos from the first Moon landing" -- "Which moon has active volcanoes?" +## Table of Contents -The app searches through your PDFs and images simultaneously and -gives you answers with sources. You talk to it in plain English. +- [Quick Start](#quick-start) +- [What This Does](#what-this-does) +- [Prerequisites](#prerequisites) +- [Step-by-Step Guide](#step-by-step-guide) +- [Example Data](#example-data) +- [Why Image Descriptions Matter](#why-image-descriptions-matter) +- [Costs](#costs) +- [Troubleshooting](#troubleshooting) +- [How It Works](#how-it-works) +- [License](#license) -## How is this different from a Google search? +## Quick Start -Google searches the internet. This searches YOUR files. - -Imagine you have 500 PDFs, research papers, photos, and notes -scattered across folders. Normal file search only matches exact -words. This system understands meaning. You ask "what do we know -about storms on other planets?" and it finds the Jupiter fact sheet -mentioning wind speeds, the Jupiter photograph showing cloud bands, -and the solar system overview describing atmospheric composition. - -It connects information across files and formats. That is what -makes it powerful. - -## What you need - -1. **Claude Code** (comes with Claude Pro at $20/month or Claude Max) -2. **A Google AI Studio account** (free) for Gemini embeddings -3. **A Pinecone account** (free tier) for the vector database -4. **30-45 minutes** for your first time - -No programming knowledge required. You will copy prompts into -Claude Code, and it will build everything. - -## How it works (the simple version) - -``` -Your files ──> Embeddings (Gemini) ──> Vector database (Pinecone) - │ -Your question ──> Embedding (Gemini) ──> Search ──> Claude answers +```bash +git clone https://git.thedharmalab.com/ktg/multimodal-rag-guide.git +cd multimodal-rag-guide +claude ``` -1. Your files get converted into "embeddings" (numerical fingerprints - that capture meaning) -2. When you ask a question, it gets the same treatment -3. The system finds fingerprints that match -4. Claude reads the matching content and answers your question +Then paste the prompt from [`prompts/01-setup.md`](prompts/01-setup.md) into Claude Code. -For a deeper explanation, see [concepts.md](concepts.md). +Four prompts, 30 minutes, working multimodal search. -## Step 0: Get your accounts (10 minutes) +## What This Does -### Google AI Studio (for embeddings) +One search box that understands PDFs, images, and text at the same time. -Embeddings convert your content into searchable vectors. We use -Google's Gemini Embedding 2 for this because it handles text, -images, and video. +Ask "What is the largest planet in our solar system?" and the system +returns the Jupiter fact sheet from a PDF, the Voyager photograph of +the Great Red Spot from a JPG, and a confidence score for each result. +One question, multiple formats, ranked by meaning. + +This is called Retrieval-Augmented Generation (RAG). Google's +Gemini Embedding 2 handles the multimodal part: it converts different +content types into the same numerical format so they become searchable +together. Claude Code handles the building part: it reads your prompts +and writes all the code. You handle neither. + +## Prerequisites + +| Requirement | Cost | What it does | +|---|---|---| +| [Claude Code](https://claude.ai) | Part of Claude Pro ($20/mo) or Max | Builds the app and answers questions | +| [Google AI Studio](https://aistudio.google.com/) | Free tier | Gemini Embedding 2 API key | +| [Pinecone](https://www.pinecone.io/) | Free tier | Vector database for storing embeddings | + +No programming knowledge required. + +## Step-by-Step Guide + +### Step 0: Get your API keys (10 minutes) + +**Google AI Studio** (for Gemini Embedding 2): 1. Go to [aistudio.google.com](https://aistudio.google.com/) 2. Sign in with a Google account 3. Click "Get API key" in the left sidebar -4. Click "Create API key" -5. Copy the key somewhere safe +4. Click "Create API key" and copy it -**What is an API key?** It is like a password that lets your app -talk to Google's embedding service. You will paste it into a -configuration file later. It never leaves your computer. - -### Pinecone (for storing embeddings) - -A vector database stores embeddings so you can search through -them. Think of it as a smart filing cabinet. +**Pinecone** (for the vector database): 1. Go to [pinecone.io](https://www.pinecone.io/) and create a free account -2. Once in the dashboard, click "Create Index" -3. Name it `space-search` (or whatever you like) -4. Set dimensions to `3072` (this matches Gemini Embedding 2) -5. Choose the `cosine` metric -6. Select the free "Starter" plan -7. Copy your API key from the "API Keys" section +2. In the dashboard, click "Create Index" +3. Name it `space-search`, set dimensions to `3072`, choose `cosine` metric +4. Select the free "Starter" plan +5. Copy your API key from "API Keys" -### Verify you have Claude Code +### Step 1: Clone and start Claude Code (5 minutes) -Open your terminal and type `claude`. If Claude Code starts, -you are ready. If not, install it: - -``` -npm install -g @anthropic-ai/claude-code -``` - -You need a Claude Pro or Max subscription for this to work. - -## Step 1: Get the example files - -Clone or download this repository. The `example-data/` folder -contains everything you need to get started: - -**PDFs:** -- `solar-system-overview.pdf` - Overview of our solar system (NASA) -- `jupiter-fact-sheet.pdf` - Detailed data about Jupiter (NASA) -- `solar-system-moons.pdf` - Guide to planetary moons (NASA) - -**Images:** -- `earthrise.jpg` - Earth seen from lunar orbit, Apollo 8 (1968) -- `aldrin-moon.jpg` - Buzz Aldrin on the Moon, Apollo 11 (1969) -- `jupiter-great-red-spot.jpg` - Jupiter photographed by Voyager 1 (1979) -- `iss-over-earth.jpg` - The Moon seen from the ISS - -**Descriptions:** -- `descriptions.md` - Detailed text descriptions of each image. - This is the most important file for image search quality. - See the section below on why descriptions matter. - -All files are NASA public domain. No copyright restrictions. - -## Step 2: Start Claude Code (5 minutes) - -Open your terminal, navigate to this folder, and start Claude Code: - -``` +```bash +git clone https://git.thedharmalab.com/ktg/multimodal-rag-guide.git +cd multimodal-rag-guide claude ``` -Then copy the prompt from [prompts/01-setup.md](prompts/01-setup.md) -and paste it into Claude Code. +Paste the prompt from [`prompts/01-setup.md`](prompts/01-setup.md). +Claude Code creates the project structure and installs dependencies. -Claude Code will create the project structure and install -dependencies. When it is done, copy `.env.template` to `.env` -and fill in your API keys. +When done, copy `env.template` to `.env` and fill in your API keys. -## Step 3: Ingest your files (10 minutes) +### Step 2: Ingest your files (10 minutes) -Copy the prompt from [prompts/02-ingest.md](prompts/02-ingest.md) -into Claude Code. +Paste the prompt from [`prompts/02-ingest.md`](prompts/02-ingest.md). -Claude Code will read each file, split it into chunks, generate -embeddings, and store everything in Pinecone. You will see a -summary of what was processed. +Claude Code reads each file, splits it into chunks, generates +embeddings via Gemini Embedding 2, and stores everything in Pinecone. -This is the step where your files become searchable. +### Step 3: Search (5 minutes) -## Step 4: Search (5 minutes) +Paste the prompt from [`prompts/03-search.md`](prompts/03-search.md). -Copy the prompt from [prompts/03-search.md](prompts/03-search.md) -into Claude Code. +Claude Code builds a web interface. Open `http://localhost:3333` +in your browser and try these searches: -Claude Code will build a web interface and start it. Open the URL -it gives you (usually `http://localhost:3333`) in your browser. - -Try these searches: - -| Search query | What should come back | +| Query | Expected results | |---|---| | "What is the largest planet?" | Jupiter fact sheet + Jupiter image | | "First Moon landing" | Aldrin image + solar system overview | -| "Which moon has volcanoes?" | Moons PDF (mentioning Io) | -| "How far is Jupiter from Earth?" | Jupiter fact sheet (588.5 to 968.1 million km) | -| "What do astronauts see from orbit?" | ISS image description | +| "Which moon has volcanoes?" | Moons PDF mentioning Io | +| "How far is Jupiter from Earth?" | Jupiter fact sheet with exact distance | -Notice how a single question can pull results from both PDFs and -images. That is multimodal search. +A single question pulls results from both PDFs and images. -## Step 5: Make it your own +### Step 4: Make it your own -Now that you have seen it work with NASA files, try it with -your own content: +Replace the NASA example files with your own content: -1. Add your own PDFs, images, or documents to the `example-data/` folder -2. Write descriptions for any images (see the tips in `descriptions.md`) -3. Use [prompts/04-improve.md](prompts/04-improve.md) to re-index +1. Add PDFs, images, or documents to `example-data/` +2. Write descriptions for images (see [`example-data/descriptions.md`](example-data/descriptions.md)) +3. Paste [`prompts/04-improve.md`](prompts/04-improve.md) to re-index -Ideas for what to search: -- Your company's internal documents -- Research papers for a project -- Travel photos with descriptions -- Recipe collections -- Course notes and textbook screenshots +Ideas: company documents, research papers, travel photos, +recipe collections, course notes. -## Why image descriptions matter +## Example Data -The search system cannot "see" your images directly. It finds -images through their text descriptions. This means: +The `example-data/` folder contains NASA public domain files +(no copyright restrictions): -**Bad description:** "Photo of a planet" will only match -searches containing "photo" or "planet." +| File | Description | +|---|---| +| `solar-system-overview.pdf` | Overview of our solar system | +| `jupiter-fact-sheet.pdf` | Detailed data about Jupiter | +| `solar-system-moons.pdf` | Guide to planetary moons | +| `earthrise.jpg` | Earth from lunar orbit, Apollo 8 (1968) | +| `aldrin-moon.jpg` | Buzz Aldrin on the Moon, Apollo 11 (1969) | +| `jupiter-great-red-spot.jpg` | Jupiter by Voyager 1 (1979) | +| `iss-over-earth.jpg` | The Moon seen from the ISS | +| `descriptions.md` | Image descriptions for search quality | -**Good description:** "Full-disk portrait of Jupiter captured by -Voyager 1 in 1979, showing horizontal cloud bands and the Great -Red Spot, a massive storm larger than Earth" will match searches -about Jupiter, Voyager missions, storms, cloud patterns, and more. +## Why Image Descriptions Matter -The `descriptions.md` file in `example-data/` shows side-by-side -examples of bad versus good descriptions. Spending five minutes -on better descriptions will dramatically improve your search -results. +The search system finds images through their text descriptions, +not by "seeing" them. A description like "Photo of a planet" only +matches searches containing those exact concepts. A description +like "Full-disk portrait of Jupiter captured by Voyager 1 in 1979, +showing horizontal cloud bands and the Great Red Spot" matches +searches about Jupiter, Voyager missions, storms, and cloud patterns. -## What this costs +See [`example-data/descriptions.md`](example-data/descriptions.md) +for side-by-side examples. -$0 extra if you already have a Claude subscription. -Both Gemini embeddings and Pinecone have generous free tiers. +## Costs -See [costs.md](costs.md) for details. +$0 extra if you already have a Claude subscription. Both Gemini +Embedding 2 and Pinecone have free tiers that cover this guide +and well beyond. -## If you get stuck +See [costs.md](costs.md) for the full breakdown. -See [troubleshooting.md](troubleshooting.md) for the 10 most -common problems and their solutions. +## Troubleshooting -The most effective fix for almost anything: copy the exact error -message and paste it into Claude Code. It is very good at -diagnosing its own work. +See [troubleshooting.md](troubleshooting.md) for the 10 most common +problems. The most effective fix for almost anything: copy the exact +error message and paste it into Claude Code. -## How it works (the deeper version) +## How It Works -Read [concepts.md](concepts.md) for plain-English explanations of: -- What are embeddings? -- What is a vector database? -- What is RAG? -- What is chunking? -- What does "multimodal" mean? +``` +Your files --> Chunking --> Gemini Embedding 2 --> Pinecone (vector DB) + | +Your question --> Gemini Embedding 2 --> Search --> Claude answers +``` -## Credits +Gemini Embedding 2 converts all content types (text, images, video, +audio) into numerical vectors in one shared space. Pinecone stores +and searches those vectors. Claude reads the matching content and +generates answers. -Example data: All PDFs and images are from NASA and are in the -public domain (U.S. Government works, no copyright restrictions). +For plain-English explanations of embeddings, vector databases, RAG, +and chunking, see [concepts.md](concepts.md). -Built with: -- [Claude Code](https://claude.ai) by Anthropic (app building + AI answers) -- [Gemini Embedding 2](https://ai.google.dev/) by Google (multimodal embeddings) -- [Pinecone](https://www.pinecone.io/) (vector database) +## Built With + +- [Claude Code](https://claude.ai) by Anthropic +- [Gemini Embedding 2](https://ai.google.dev/) by Google +- [Pinecone](https://www.pinecone.io/) + +## License + +[MIT](LICENSE) --- -*Part of [The Dharma Lab](https://thedharmalab.com). Read the -[full article](https://thedharmalab.com/) for the story behind this project.* +Part of [The Dharma Lab](https://thedharmalab.com). Read the +[full article](https://thedharmalab.com/) for the story behind +this project. diff --git a/docs/demo-screenshot.png b/docs/demo-screenshot.png new file mode 100644 index 0000000..6c20e00 Binary files /dev/null and b/docs/demo-screenshot.png differ