1
0
Fork 0

Initial commit: multimodal RAG guide with Claude Code

Prompt-driven guide for building multimodal search using
Gemini Embedding 2 + Pinecone + Claude Code. Includes example
data (NASA public domain), step-by-step prompts, concepts
explainer, cost breakdown, and troubleshooting guide.
This commit is contained in:
Kjell Tore Guttormsen 2026-03-12 16:36:22 +01:00
commit edcd1721df
19 changed files with 4446 additions and 0 deletions

94
concepts.md Normal file
View file

@ -0,0 +1,94 @@
# Concepts: What You Need to Know (and Nothing More)
This page explains the key ideas behind multimodal search.
You do not need to understand these concepts to follow the guide.
But if you are curious about what is happening behind the scenes,
this is for you.
## What is an embedding?
Think of it as a fingerprint for meaning.
When you read the sentence "Jupiter is the largest planet," your brain
understands what it means. An embedding is a way for a computer to do
something similar. It converts text (or an image) into a long list of
numbers that captures the meaning of that content.
The key insight: things that mean similar things get similar numbers.
So "Jupiter is massive" and "Jupiter is the biggest planet" would have
very similar embeddings, even though the words are different.
You never see these numbers. They work behind the scenes.
## What is a vector database?
A place to store embeddings so you can search through them quickly.
Imagine a library where books are not organized by author or title,
but by what they are about. You walk in and say "I want something
about storms on other planets" and the librarian immediately hands
you the right book. That is what a vector database does, but with
your files.
We use Pinecone in this guide because it has a free tier and works
well. There are other options (Chroma, Weaviate, Qdrant), but
Pinecone requires the least setup.
## What is RAG?
RAG stands for Retrieval-Augmented Generation. Big name, simple idea.
Normally, when you ask an AI a question, it answers from its training
data. It might know general facts, but it does not know about YOUR
files. RAG changes that.
With RAG, the AI first searches through your documents to find
relevant information, then uses what it found to answer your question.
It is like giving the AI a cheat sheet of your own content before
it answers.
Without RAG: "What do we know about Jupiter's atmosphere?"
The AI answers from general knowledge.
With RAG: "What do we know about Jupiter's atmosphere?"
The AI searches your PDFs and images, finds the Jupiter fact sheet
and the Voyager photo, and answers based on YOUR specific collection.
## What is chunking?
Your documents might be long. A 50-page PDF cannot be processed
as one piece. Chunking means splitting it into smaller sections
that the AI can work with.
Think of it like cutting a book into chapters. Each chapter gets
its own embedding. When you search, the system finds the right
chapter, not the whole book.
Claude Code handles chunking automatically. You do not need to
do anything.
## What does "multimodal" mean?
"Multi" means many. "Modal" means types.
Regular search works with text only. Multimodal search works with
text AND images AND PDFs AND videos. You can search across all
of them at once.
This is what makes this project interesting. You ask a question
in plain English, and the system searches through your PDFs,
images, and their descriptions to find the best answer, regardless
of what format the information is in.
## How does it all fit together?
1. You put files in a folder (PDFs, images with descriptions)
2. Claude Code builds a system that reads each file
3. Each piece of content gets converted to an embedding (a fingerprint)
4. The embeddings are stored in Pinecone (the vector database)
5. When you search, your question also gets converted to an embedding
6. Pinecone finds the stored embeddings most similar to your question
7. The matching content is shown to you (or fed to an AI for a detailed answer)
That is it. The rest is implementation details, and Claude Code
handles those for you.