AI EngineerJuniorembeddingsopenaivectors

Generate Embeddings for a Document Set

A real ai/ml problem you debug end to end in a live cloud workspace, then show on your portfolio. No tutorial, no toy app - a broken system that behaves like production.

Level
Junior
Time
~20 min
Cost
Free

The scenario

We're building a search-over-docs feature and need vector embeddings for each document. Fill in embed_docs(docs) in /workspace/embed.py so it returns a list of (doc, vector) tuples ready for ingestion into a vector store.

The broken code you start with

embed.py (the unfinished embedder)
def embed_docs(docs):
    # TODO: POST each doc to EMBED_URL, return (doc, vector) tuples
    return []   # nothing is embedded yet

What this teaches you

What you did: Looped through the docs, posted each one to the embeddings endpoint, and assembled (doc, vector) tuples - the shape every vector store (pgvector, Pinecone, Chroma, OpenSearch) takes for ingestion.

Why it matters: Embeddings are the foundation of every RAG pipeline, semantic-search feature, and recommendation system. Knowing the model -> vector shape is the entry ticket to all of it.

In the real world: Production embedders batch (POST a list, not one-at-a-time - 10x throughput), pin the embedding model version (changing models invalidates every stored vector), and store the model name alongside the vector so you can re-embed the corpus when you upgrade.

What you'll practice

Why this impresses a hiring manager

On your portfolio, this becomes

Filled in embed_docs() to call the embeddings endpoint (nomic-embed-text via Ollama, same shape as OpenAI's embeddings API) for each doc and return a list of (doc, vector) pairs ready for a vector store.

Keep going

Restore a Broken LLM API IntegrationAI/ML projectAI/ML roadmapStep by step to hiredAI/ML interview questionsSTAR answersAll AI/ML projectsProjects hub

Build this project free

You're in a real cloud workspace in 30 seconds. Fix it, and it lands on your portfolio.

Start this project →