How Winnow
Works

A plain English explanation of what happens when Winnow compresses your RAG context.

3 STEPS · 85MS

The Problem

RAG pipelines retrieve document chunks and stuff them into LLM prompts. But those chunks are verbose - filler text, boilerplate, redundant sentences. More tokens means higher cost, slower responses, and sometimes worse answers because the model has to sift through noise to find the signal.

Retrieve

Your vector DB returns raw document chunks. Those chunks are verbose, overlapping, and expensive. Winnow doesn't replace your retrieval pipeline - it sits between your retriever and your LLM.

VECTOR DB → RETRIEVER → WINNOW → LLM

Compress

Winnow runs LLMLingua-2 token-level compression guided by your query. Relevant tokens survive. Filler is removed.

TOKEN SCORING

PROTECTED WORDS

RATIO PRUNING

Generate

Your LLM receives a ~50% shorter prompt. Same answer. Half the cost.

~50%

FEWER TOKENS

~50%

LOWER CONTEXT COST

<3PT

F1 DROP

What Winnow
Is Not

✕ Not a summarizer - doesn't generate new text

✕ Not an embedding model - doesn't change retrieval

✕ Not a reranker - preserves order, just removes tokens

✓ Compression middleware that removes low-value tokens

Ready to try it?

$ docker run -p 8000:8000 itsaryanchauhan/winnow

How WinnowWorks