How Winnow
Works
A plain English explanation of what happens when Winnow compresses your RAG context.
3 STEPS · 85MS
!
The Problem
RAG pipelines retrieve document chunks and stuff them into LLM prompts. But those chunks are verbose - filler text, boilerplate, redundant sentences. More tokens means higher cost, slower responses, and sometimes worse answers because the model has to sift through noise to find the signal.
01
Retrieve
Your vector DB returns raw document chunks. Those chunks are verbose, overlapping, and expensive. Winnow doesn't replace your retrieval pipeline - it sits between your retriever and your LLM.
VECTOR DB → RETRIEVER → WINNOW → LLM
02
Compress
Winnow runs LLMLingua-2 token-level compression guided by your query. Relevant tokens survive. Filler is removed.
01
TOKEN SCORING
02
PROTECTED WORDS
03
RATIO PRUNING
03
Generate
Your LLM receives a ~50% shorter prompt. Same answer. Half the cost.
~50%
FEWER TOKENS
~50%
LOWER CONTEXT COST
<3PT
F1 DROP
✕What Winnow
What Winnow
Is Not
✕ Not a summarizer - doesn't generate new text
✕ Not an embedding model - doesn't change retrieval
✕ Not a reranker - preserves order, just removes tokens
✓ Compression middleware that removes low-value tokens
Ready to try it?
$ docker run -p 8000:8000 itsaryanchauhan/winnow