1
0
Fork 0
multimodal-rag-guide/concepts.md
Kjell Tore Guttormsen edcd1721df Initial commit: multimodal RAG guide with Claude Code
Prompt-driven guide for building multimodal search using
Gemini Embedding 2 + Pinecone + Claude Code. Includes example
data (NASA public domain), step-by-step prompts, concepts
explainer, cost breakdown, and troubleshooting guide.
2026-03-12 16:36:22 +01:00

3.5 KiB

Concepts: What You Need to Know (and Nothing More)

This page explains the key ideas behind multimodal search. You do not need to understand these concepts to follow the guide. But if you are curious about what is happening behind the scenes, this is for you.

What is an embedding?

Think of it as a fingerprint for meaning.

When you read the sentence "Jupiter is the largest planet," your brain understands what it means. An embedding is a way for a computer to do something similar. It converts text (or an image) into a long list of numbers that captures the meaning of that content.

The key insight: things that mean similar things get similar numbers. So "Jupiter is massive" and "Jupiter is the biggest planet" would have very similar embeddings, even though the words are different.

You never see these numbers. They work behind the scenes.

What is a vector database?

A place to store embeddings so you can search through them quickly.

Imagine a library where books are not organized by author or title, but by what they are about. You walk in and say "I want something about storms on other planets" and the librarian immediately hands you the right book. That is what a vector database does, but with your files.

We use Pinecone in this guide because it has a free tier and works well. There are other options (Chroma, Weaviate, Qdrant), but Pinecone requires the least setup.

What is RAG?

RAG stands for Retrieval-Augmented Generation. Big name, simple idea.

Normally, when you ask an AI a question, it answers from its training data. It might know general facts, but it does not know about YOUR files. RAG changes that.

With RAG, the AI first searches through your documents to find relevant information, then uses what it found to answer your question. It is like giving the AI a cheat sheet of your own content before it answers.

Without RAG: "What do we know about Jupiter's atmosphere?" The AI answers from general knowledge.

With RAG: "What do we know about Jupiter's atmosphere?" The AI searches your PDFs and images, finds the Jupiter fact sheet and the Voyager photo, and answers based on YOUR specific collection.

What is chunking?

Your documents might be long. A 50-page PDF cannot be processed as one piece. Chunking means splitting it into smaller sections that the AI can work with.

Think of it like cutting a book into chapters. Each chapter gets its own embedding. When you search, the system finds the right chapter, not the whole book.

Claude Code handles chunking automatically. You do not need to do anything.

What does "multimodal" mean?

"Multi" means many. "Modal" means types.

Regular search works with text only. Multimodal search works with text AND images AND PDFs AND videos. You can search across all of them at once.

This is what makes this project interesting. You ask a question in plain English, and the system searches through your PDFs, images, and their descriptions to find the best answer, regardless of what format the information is in.

How does it all fit together?

  1. You put files in a folder (PDFs, images with descriptions)
  2. Claude Code builds a system that reads each file
  3. Each piece of content gets converted to an embedding (a fingerprint)
  4. The embeddings are stored in Pinecone (the vector database)
  5. When you search, your question also gets converted to an embedding
  6. Pinecone finds the stored embeddings most similar to your question
  7. The matching content is shown to you (or fed to an AI for a detailed answer)

That is it. The rest is implementation details, and Claude Code handles those for you.