Generate Embeddings for a Document Set
A real ai/ml problem you debug end to end in a live cloud workspace, then show on your portfolio. No tutorial, no toy app - a broken system that behaves like production.
The scenario
We're building a search-over-docs feature and need vector embeddings for each document. Fill in embed_docs(docs) in /workspace/embed.py so it returns a list of (doc, vector) tuples ready for ingestion into a vector store.
The broken code you start with
def embed_docs(docs):
# TODO: POST each doc to EMBED_URL, return (doc, vector) tuples
return [] # nothing is embedded yetWhat this teaches you
What you did: Looped through the docs, posted each one to the embeddings endpoint, and assembled (doc, vector) tuples - the shape every vector store (pgvector, Pinecone, Chroma, OpenSearch) takes for ingestion.
Why it matters: Embeddings are the foundation of every RAG pipeline, semantic-search feature, and recommendation system. Knowing the model -> vector shape is the entry ticket to all of it.
In the real world: Production embedders batch (POST a list, not one-at-a-time - 10x throughput), pin the embedding model version (changing models invalidates every stored vector), and store the model name alongside the vector so you can re-embed the corpus when you upgrade.
What you'll practice
- Calling an embeddings endpoint per document
- Reading the vector out of the response
- Building (doc, vector) pairs for retrieval
Why this impresses a hiring manager
- This is a real embeddings problem teams hit in production - not a synthetic puzzle.
- It shows you can diagnose and fix a AI/ML issue in a live system end to end.
- It lands on your portfolio as a scenario a hiring manager can open and click through.
Filled in embed_docs() to call the embeddings endpoint (nomic-embed-text via Ollama, same shape as OpenAI's embeddings API) for each doc and return a list of (doc, vector) pairs ready for a vector store.
Keep going
Build this project free
You're in a real cloud workspace in 30 seconds. Fix it, and it lands on your portfolio.
Start this project →