Keep the
signal.
Drop the noise.
Open-source middleware that compresses RAG context before it hits the LLM - cutting token costs by ~50% with <3% accuracy loss.
V0.1.1 · LLMLINGUA-2 · MIT
RAW CONTEXT30/30 TOKENS · 0%
SIGNAL: 15 TOKENS PRESERVED · NOISE: 15 TOKENS REMOVED
01
~50%
TOKEN REDUCTION02
<3pt
F1 SCORE DROP03
85ms
AVG LATENCY04
0
CODE CHANGES W/ OPENAI SDKTry It Live[API]
ContextPASTE YOUR TEXT
Compression Mode
Compression Ratio0.5
POWERED BY HUGGING FACE SPACESAPI DOCS →
How It Works[3 STEPS]
01
RETRIEVE
Your vector DB returns raw document chunks. Verbose, overlapping, and expensive.
02
COMPRESS
Winnow runs token-level compression guided by your query. Relevant tokens survive. Filler is removed.
03
GENERATE
Your LLM receives a ~50% shorter prompt. Same answer. Half the cost.
Benchmark ResultsSQUAD · LLMLINGUA-2
420
AVG TOKENS (IN)210
AVG TOKENS (OUT)~50%
REDUCTION~85ms
LATENCY| PRESET | TOKENS IN | TOKENS OUT | REDUCTION | F1 SCORE | F1 DROP |
|---|---|---|---|---|---|
| AGGRESSIVE (0.3) | 420 | 147 | ~65% | 73.4 | 5.0 PT |
| BALANCED (0.5) | 420 | 210 | ~50% | 76.1 | 2.3 PT |
| LIGHT (0.7) | 420 | 294 | ~30% | 77.6 | <1 PT |
WINNOW V0.1.1 · LLMLINGUA-2FULL RESULTS →
IntegrationDROP-IN
/Add Winnow
Add Winnow
in minutes
Python SDK, LangChain integration, raw HTTP, or OpenAI-compatible proxy - just swap your base URL. Zero config required.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from winnow import Winnow
client = Winnow(base_url="http://localhost:8000")
# Sentence-only compression
result = client.compress(text=input_text, compression_ratio=0.5)
# Question-guided compression (RAG-aware)
guided = client.compress(
text=input_text,
compression_ratio=0.5,
rag_mode=True,
question="What is the warranty period?"
)
print(result["output"])
print(result["original_tokens"]) # 420
print(result["compressed_tokens"]) # 210Self-Host
In Seconds
One command · No cloud · No API key
$ docker run -p 8000:8000 itsaryanchauhan/winnow→ localhost:8000
→ /health
→ /compress