Prompt-driven guide for building multimodal search using Gemini Embedding 2 + Pinecone + Claude Code. Includes example data (NASA public domain), step-by-step prompts, concepts explainer, cost breakdown, and troubleshooting guide.
67 lines
2.2 KiB
Markdown
67 lines
2.2 KiB
Markdown
# Prompt 2: Ingest Your Files
|
|
|
|
Copy this into Claude Code after the project is set up and your
|
|
.env file has your API keys.
|
|
|
|
---
|
|
|
|
```
|
|
Now build the ingestion pipeline. I need a script that:
|
|
|
|
1. Reads each PDF in example-data/ and extracts the text content.
|
|
Split long documents into chunks of roughly 500 words each.
|
|
Keep track of which file and which section each chunk came from.
|
|
|
|
2. Reads the image descriptions from example-data/descriptions.md.
|
|
Use the "Good description" for each image (ignore the "Bad" ones).
|
|
Each image description becomes one chunk, linked to its image file.
|
|
|
|
3. For each chunk, generate an embedding using Google Gemini
|
|
Embedding 2 (model: gemini-embedding-exp-03-07 or the latest
|
|
available). Use task_type "RETRIEVAL_DOCUMENT" for all chunks.
|
|
|
|
4. Store each embedding in Pinecone along with metadata:
|
|
- source_file: the original filename
|
|
- content_type: "pdf" or "image"
|
|
- text: the actual text content of the chunk
|
|
- chunk_index: which chunk number within the file
|
|
|
|
5. After ingestion, print a summary: how many chunks were created,
|
|
how many embeddings stored, and any errors.
|
|
|
|
Run the ingestion script after building it. Show me the output.
|
|
```
|
|
|
|
---
|
|
|
|
## What Claude Code will do
|
|
|
|
1. Build a script that reads PDFs and extracts text
|
|
2. Parse the descriptions.md file for image descriptions
|
|
3. Send each chunk to Google Gemini for embedding
|
|
4. Store everything in Pinecone with metadata
|
|
5. Run the script and show results
|
|
|
|
## What to expect
|
|
|
|
You should see output like:
|
|
|
|
```
|
|
Processing solar-system-overview.pdf... 3 chunks
|
|
Processing jupiter-fact-sheet.pdf... 4 chunks
|
|
Processing solar-system-moons.pdf... 3 chunks
|
|
Processing earthrise.jpg (from descriptions)... 1 chunk
|
|
Processing aldrin-moon.jpg (from descriptions)... 1 chunk
|
|
Processing jupiter-great-red-spot.jpg (from descriptions)... 1 chunk
|
|
Processing iss-over-earth.jpg (from descriptions)... 1 chunk
|
|
|
|
Total: 14 chunks ingested, 14 embeddings stored in Pinecone.
|
|
```
|
|
|
|
The exact numbers may vary depending on how Claude Code splits the PDFs.
|
|
|
|
## If something goes wrong
|
|
|
|
- "API key invalid": check your .env file
|
|
- "Index not found": make sure your Pinecone index name matches
|
|
- "Rate limit": wait a minute and run the script again
|