1
0
Fork 0
multimodal-rag-guide/README.md
Kjell Tore Guttormsen 1000f9a75d Add download instructions, license visibility, disable contributions
- Download section with ZIP option for non-Git users
- License type visible in header and footer
- Tested On section
- Status and last-updated in header
- Issues, PRs, and wiki disabled via Forgejo API
2026-03-12 16:56:14 +01:00

226 lines
7.4 KiB
Markdown

# Multimodal RAG with Gemini Embedding 2 and Claude Code
**Search across PDFs, images, and documents using plain English.**
No coding required. Claude Code builds everything from four prompts.
`MIT License` · `Last updated: March 2026` · `Status: Complete, maintained`
![Search for "What is the largest planet?" returns both the Jupiter photograph and the PDF fact sheet](docs/demo-screenshot.png)
> **Gemini Embedding 2** converts text, images, and video into the same
> searchable space. **Claude Code** builds the app. **Pinecone** stores the
> vectors. You just copy four prompts.
## Table of Contents
- [Download](#download)
- [What This Does](#what-this-does)
- [Prerequisites](#prerequisites)
- [Step-by-Step Guide](#step-by-step-guide)
- [Example Data](#example-data)
- [Why Image Descriptions Matter](#why-image-descriptions-matter)
- [Costs](#costs)
- [Troubleshooting](#troubleshooting)
- [How It Works](#how-it-works)
- [Tested On](#tested-on)
- [License](#license)
## Download
**Option A: Download ZIP (no Git required)**
1. Click the green **Code** button at the top of this page
2. Select **Download ZIP**
3. Unzip the folder and open it in your terminal
**Option B: Git clone**
```bash
git clone https://git.thedharmalab.com/ktg/multimodal-rag-guide.git
cd multimodal-rag-guide
```
Then start Claude Code by typing `claude` in the folder, and paste the
first prompt from [`prompts/01-setup.md`](prompts/01-setup.md).
## What This Does
One search box that understands PDFs, images, and text at the same time.
Ask "What is the largest planet in our solar system?" and the system
returns the Jupiter fact sheet from a PDF, the Voyager photograph of
the Great Red Spot from a JPG, and a confidence score for each result.
One question, multiple formats, ranked by meaning.
This is called Retrieval-Augmented Generation (RAG). Google's
Gemini Embedding 2 handles the multimodal part: it converts different
content types into the same numerical format so they become searchable
together. Claude Code handles the building part: it reads your prompts
and writes all the code. You handle neither.
## Prerequisites
| Requirement | Cost | What it does |
|---|---|---|
| [Claude Code](https://claude.ai) | Part of Claude Pro ($20/mo) or Max | Builds the app and answers questions |
| [Google AI Studio](https://aistudio.google.com/) | Free tier | Gemini Embedding 2 API key |
| [Pinecone](https://www.pinecone.io/) | Free tier | Vector database for storing embeddings |
No programming knowledge required.
## Step-by-Step Guide
### Step 0: Get your API keys (10 minutes)
**Google AI Studio** (for Gemini Embedding 2):
1. Go to [aistudio.google.com](https://aistudio.google.com/)
2. Sign in with a Google account
3. Click "Get API key" in the left sidebar
4. Click "Create API key" and copy it
**Pinecone** (for the vector database):
1. Go to [pinecone.io](https://www.pinecone.io/) and create a free account
2. In the dashboard, click "Create Index"
3. Name it `space-search`, set dimensions to `3072`, choose `cosine` metric
4. Select the free "Starter" plan
5. Copy your API key from "API Keys"
### Step 1: Get the files and start Claude Code (5 minutes)
Download and unzip (see [Download](#download) above), or:
```bash
git clone https://git.thedharmalab.com/ktg/multimodal-rag-guide.git
cd multimodal-rag-guide
```
Open your terminal in the folder and type:
```bash
claude
```
Paste the prompt from [`prompts/01-setup.md`](prompts/01-setup.md).
Claude Code creates the project structure and installs dependencies.
When done, copy `env.template` to `.env` and fill in your API keys.
### Step 2: Ingest your files (10 minutes)
Paste the prompt from [`prompts/02-ingest.md`](prompts/02-ingest.md).
Claude Code reads each file, splits it into chunks, generates
embeddings via Gemini Embedding 2, and stores everything in Pinecone.
### Step 3: Search (5 minutes)
Paste the prompt from [`prompts/03-search.md`](prompts/03-search.md).
Claude Code builds a web interface. Open `http://localhost:3333`
in your browser and try these searches:
| Query | Expected results |
|---|---|
| "What is the largest planet?" | Jupiter fact sheet + Jupiter image |
| "First Moon landing" | Aldrin image + solar system overview |
| "Which moon has volcanoes?" | Moons PDF mentioning Io |
| "How far is Jupiter from Earth?" | Jupiter fact sheet with exact distance |
A single question pulls results from both PDFs and images.
### Step 4: Make it your own
Replace the NASA example files with your own content:
1. Add PDFs, images, or documents to `example-data/`
2. Write descriptions for images (see [`example-data/descriptions.md`](example-data/descriptions.md))
3. Paste [`prompts/04-improve.md`](prompts/04-improve.md) to re-index
Ideas: company documents, research papers, travel photos,
recipe collections, course notes.
## Example Data
The `example-data/` folder contains NASA public domain files
(no copyright restrictions):
| File | Description |
|---|---|
| `solar-system-overview.pdf` | Overview of our solar system |
| `jupiter-fact-sheet.pdf` | Detailed data about Jupiter |
| `solar-system-moons.pdf` | Guide to planetary moons |
| `earthrise.jpg` | Earth from lunar orbit, Apollo 8 (1968) |
| `aldrin-moon.jpg` | Buzz Aldrin on the Moon, Apollo 11 (1969) |
| `jupiter-great-red-spot.jpg` | Jupiter by Voyager 1 (1979) |
| `iss-over-earth.jpg` | The Moon seen from the ISS |
| `descriptions.md` | Image descriptions for search quality |
## Why Image Descriptions Matter
The search system finds images through their text descriptions,
not by "seeing" them. A description like "Photo of a planet" only
matches searches containing those exact concepts. A description
like "Full-disk portrait of Jupiter captured by Voyager 1 in 1979,
showing horizontal cloud bands and the Great Red Spot" matches
searches about Jupiter, Voyager missions, storms, and cloud patterns.
See [`example-data/descriptions.md`](example-data/descriptions.md)
for side-by-side examples.
## Costs
$0 extra if you already have a Claude subscription. Both Gemini
Embedding 2 and Pinecone have free tiers that cover this guide
and well beyond.
See [costs.md](costs.md) for the full breakdown.
## Troubleshooting
See [troubleshooting.md](troubleshooting.md) for the 10 most common
problems. The most effective fix for almost anything: copy the exact
error message and paste it into Claude Code.
## How It Works
```
Your files --> Chunking --> Gemini Embedding 2 --> Pinecone (vector DB)
|
Your question --> Gemini Embedding 2 --> Search --> Claude answers
```
Gemini Embedding 2 converts all content types (text, images, video,
audio) into numerical vectors in one shared space. Pinecone stores
and searches those vectors. Claude reads the matching content and
generates answers.
For plain-English explanations of embeddings, vector databases, RAG,
and chunking, see [concepts.md](concepts.md).
## Tested On
- macOS (Apple Silicon and Intel)
- Claude Code with Claude Pro subscription
- Gemini Embedding 2 free tier
- Pinecone free tier (Starter plan)
Should work on any system that runs Claude Code (macOS, Linux, Windows via WSL).
## Built With
- [Claude Code](https://claude.ai) by Anthropic
- [Gemini Embedding 2](https://ai.google.dev/) by Google
- [Pinecone](https://www.pinecone.io/)
## License
This project is licensed under the [MIT License](LICENSE). You are free
to use, modify, and distribute it.
Example data (NASA images and PDFs) is in the public domain.
---
Part of [The Dharma Lab](https://thedharmalab.com).