Benchmark Study

This page is the benchmark dossier for gitquarry search. It is not just a run log. The goal is to answer the product questions that matter in practice:

how much latency each option adds
what that extra cost actually buys
which modes preserve the native baseline versus deliberately breaking away from it
when optional knobs such as README enrichment, weighted blends, recency, and language filters are worth using

The current live study uses two intentionally different benchmark queries:

api gateway
terminal ui

Those two queries are useful because they stress different failure modes. api gateway is noisy and infra-heavy. terminal ui is lexically cleaner and exposes whether a mode is adding useful semantic breadth or just drifting.

Executive Summary

Native is still the only sub-second path. In this run it stayed around ~0.5s to ~1.1s.
Quick discover adds ~15.7s to ~18.3s over native.
Balanced discover adds ~26.8s to ~30.1s over native.
Deep discover adds ~52.5s to ~59.8s over native.
README enrichment added another ~2.9s to ~4.6s on top of balanced discover and did not improve top-10 Jaccard overlap in either benchmark query.
For baseline preservation, quality is the best default non-native rank mode.
For api gateway, the strongest upgrade from native was discover-balanced-blended-quality-heavy.
For terminal ui, the strongest upgrade from native was discover-balanced-quality.
For maximum semantic expansion, query still introduces the most novelty, but it sheds much more of the native core.

Recommendation Matrix

Goal	Recommended Option	Why it wins in this study	Do not choose it when…
Fastest safe default	`native-best-match`	It is the only consistently sub-second path and remains the reference baseline for all comparisons.	You explicitly want semantic expansion or explain-driven ranking behavior.
Cheapest discover path	`discover-quick-native`	It preserves the native top 10 while showing the minimum discover-depth tax.	You need additional novelty, because quick-native bought cost without result-set change in this run.
Best upgrade from native for `api gateway`	`discover-balanced-blended-quality-heavy`	It kept `8/10` of the native top 10, retained `5/5` of the native top five, and raised quality without paying deep-mode cost.	You primarily want novel repositories instead of a curated upgrade to the baseline set.
Best upgrade from native for `terminal ui`	`discover-balanced-quality`	It matched the best non-native Jaccard (`0.4286`), retained `4/5` of the native top five, and was cheaper than `quality-heavy`.	You want maximum semantic drift or query-heavy expansion.
Strongest semantic expansion	`discover-balanced-query`	It produced `6` novel results on both benchmark queries and maximized semantic movement within the balanced family.	You care about preserving the native core, because it kept only `1/5` of the native top five on both queries.
Practical middle ground	`discover-balanced-blended`	It is often the balanced family’s cheapest general mode and usually sits near the frontier between cost and novelty.	You need stronger baseline fidelity than a `0.25` to `0.3333` Jaccard result can offer.
Freshest repository slice	`discover-balanced-activity --updated-within 1y`	It is the discover path if recency is more important than baseline stability.	You expect it to behave like default search. The recency slice introduced `5` to `6` novel results and materially changed the set.
Language-constrained search	`native-rust` first, then `discover-balanced-blended --language Rust` only if needed	Native filtering is extremely cheap. The discover Rust slice is only useful when semantic expansion inside the language slice matters.	You want general-purpose search. The Rust slice is intentionally far from the default baseline.
README-aware investigation	`--readme` only as an explicit second pass	It guarantees README evidence in explain output and can help inspect why a result matched.	You are optimizing for latency or top-10 stability. It added cost without improving top-10 overlap in this benchmark.

Cost Ladder

The latency pattern is stable enough to drive product guidance:

Native is the only low-latency mode.
Quick discover is already a substantial tax. Treat it as a deliberate opt-in, not a near-native fallback.
Balanced discover is the practical analysis tier. It is slow enough to matter, but still much cheaper than deep.
Deep discover is expensive enough that it should be reserved for deliberate heavy-recall workflows.
README enrichment is not free. In this study it added ~11% to ~15% on top of balanced discover while leaving top-10 overlap unchanged.

Concrete overhead from this run:

Query	Quick over native	Balanced over native	Deep over native	README tax range
`api gateway`	`+18.3s`	`+30.1s`	`+59.8s`	`+3.5s` to `+4.6s`
`terminal ui`	`+15.7s`	`+26.8s`	`+52.5s`	`+2.9s` to `+3.9s`

Balanced Decision Zone

Balanced discover is where the real product tradeoffs live. It is the family most likely to be exposed as the default advanced mode, so it deserves deeper inspection than the raw run table.

These charts are rendered directly from the benchmark CSV artifacts for the docs site. The exact audit trail for values and scenario comparisons still lives in report.md, paired-effects.csv, and scenario-analysis.csv. What these views show:

quality and quality-heavy preserve the native core far better than query.
blended is often the cheapest balanced choice, but its top-10 fidelity is materially lower than quality.
query is not the best frontier choice in this run. It is dominated by cheaper alternatives with equal or better novelty tradeoffs in key cases.
quality-heavy is especially strong on api gateway, where it preserves the native core while improving repository quality signals.
The surface-mix chart explains why: quality leans less on repository names and more on description and topic evidence.

Balanced-family frontier takeaways:

On api gateway, the balanced frontier includes native, activity, quality, blended, and quality-heavy.
On terminal ui, the balanced frontier includes native, quality, blended, and query-heavy.
discover-balanced-query is not on the balanced frontier for either benchmark query.
README variants are off the frontier in this run because they add cost without improving top-10 fidelity.

Ranking Mode Guidance

The rank mode is the real behavior selector. Depth mostly controls cost. Rank controls what sort of repositories survive the cut.

Rank mode	What it tends to optimize	Strengths in this study	Weaknesses in this study	Best use
`native`	Preservation of the original GitHub result set	Perfect overlap with the native baseline at any discover depth	You pay discover latency without changing the result set	Sanity checks, pipeline validation, and baseline-preserving comparisons
`query`	Maximum lexical and semantic expansion	Highest novelty in balanced mode, with `6` novel results on both benchmark queries	It retained only `1/5` of the native top five on both queries	Exploratory search when broad expansion matters more than fidelity
`activity`	Fresher, more active repositories	Useful when you deliberately want newer movement and more churn	It did not beat `quality` as a general default and still paid balanced latency	Recency-focused or trend-seeking workflows
`quality`	Higher-quality, more established repositories	Best default non-native rank for preserving the native core while improving median stars	Less novelty than `query`, and can stay conservative	Default advanced ranking when you want better curation without losing the core set
`blended`	Middle ground across query, activity, and quality	Usually one of the cheapest balanced options and often frontier-competitive	Default blended weights were weaker than `quality` for baseline preservation	General-purpose discover when you want some novelty without going full query mode

Knob Guidance

Option	What it does	What the study says	Practical guidance
`--depth quick`	Reduces candidate expansion relative to balanced and deep	Still expensive over native, but much cheaper than deep	Use only when you want the cheapest discover proof point
`--depth balanced`	Middle ground between recall and cost	Best general operating point for comparison, tradeoff tuning, and explain analysis	Treat this as the default experimental tier
`--depth deep`	Maximum candidate expansion	Roughly doubles the balanced tax without proportional gains in these two queries	Reserve for explicit high-recall investigations
`--readme`	Adds README evidence to explain and matching	Added `~3s` to `~5s` without top-10 gains in this run	Keep it as a targeted second pass, not the default
`--weight-query` heavy	Pushes blended toward query-driven novelty	Helpful on `terminal ui`, where `query-heavy` reached the frontier	Use when plain blended feels too conservative
`--weight-activity` heavy	Pushes blended toward recency and activity	Weak general payoff in this run	Use only with a clear freshness objective
`--weight-quality` heavy	Pushes blended toward higher-quality repositories	Very strong on `api gateway`, where it became the best non-native upgrade from baseline	Good option when you want a safer upgrade than raw `query`
`--updated-within 1y`	Forces recency slice	Produces high churn and should be treated as a different intent, not a small tweak	Use only when freshness is a hard requirement
`--language Rust`	Narrows search to a language slice	Cheap natively, expensive under discover, and intentionally far from default search	Start with native language filtering, then add discover only if needed

Query-Specific Findings

`api gateway`

discover-balanced-blended-quality-heavy was the best non-native compromise.
It kept 8/10 of the native top 10 and 5/5 of the native top five.
discover-balanced-quality also retained the full native top five, but with less baseline overlap than quality-heavy.
discover-balanced-query and discover-balanced-blended both delivered 6 novel results, but each kept only 1/5 of the native top five.
README enrichment added +3.5s to +4.6s and did not improve top-10 Jaccard.

`terminal ui`

discover-balanced-quality was the best non-native default.
It kept 4/5 of the native top five with 0.4286 Jaccard and remained cheaper than quality-heavy.
discover-balanced-blended sat on the frontier because it was cheaper and still delivered 5 novel results.
discover-balanced-blended-query-heavy also sat on the frontier and dominated plain discover-balanced-query in this run.
README enrichment added +2.9s to +3.9s and again did not improve top-10 Jaccard.

Churn And Stable Leaders

Scenario churn versus the native baseline

Most persistent repositories across the study

These two views help with interpretation:

The churn chart tells you which options are still “about the same search” versus which ones are effectively different products.
The persistence chart shows which repositories survive almost every mode and filter change.
Persistent leaders are especially useful for screenshots, demo flows, and explanation examples because they are less likely to disappear when the ranking strategy changes.

How To Run The Study

The benchmark harness lives at:

Build the binary first if needed:

cargo build

Run the full live study:

python3 scripts/benchmark-study.py

Rebuild derived analysis from existing raw outputs:

python3 scripts/benchmark-study.py --analyze-only

Refresh all chart assets:

uv run scripts/render-benchmark-charts.py

If GITQUARRY_TOKEN is not set, the runner will try to use GitHub CLI auth before failing.

Operator Playbook

If the goal is to help an operator choose a mode quickly, use these presets instead of re-reading the full study every time:

Operator intent	Recommended command pattern	Why
Fastest default	`gitquarry search "<query>"`	Keeps latency near the native baseline.
Safer discover upgrade	`gitquarry search "<query>" --mode discover --depth balanced --rank quality --explain`	Best default non-native mode when you want better curation without throwing away the baseline core.
Broader semantic exploration	`gitquarry search "<query>" --mode discover --depth balanced --rank query --explain`	Highest novelty in the balanced family.
Better curated `api gateway`-style results	`gitquarry search "<query>" --mode discover --depth balanced --rank blended --weight-query 0.5 --weight-activity 0.5 --weight-quality 2.0 --explain`	This is the `quality-heavy` shape that performed best on the noisier benchmark query.
Fresh repos only	`gitquarry search "<query>" --mode discover --depth balanced --rank activity --updated-within 1y --explain`	Use when freshness is a hard requirement and churn is acceptable.
Language slice first	`gitquarry search "<query>" --language Rust`	Start cheap. Only add discover after confirming the slice is worth exploring semantically.
README inspection pass	`gitquarry search "<query>" --mode discover --depth balanced --rank quality --readme --explain`	Use as a second pass when you need richer evidence, not as the default query path.

Default recommendation for most operators:

gitquarry search "<query>" --mode discover --depth balanced --rank quality --explain

Escalation rule:

Start with native if latency matters most.
Move to balanced quality if you need a smarter curated set.
Move to balanced query only when you explicitly want more novel repositories.
Add --readme, --updated-within, or --language only when the task requires that specific constraint.

Output Files

The study writes raw and derived artifacts to target/benchmark-study/. Most useful outputs:

run-summaries.csv
comparisons.csv
scenario-analysis.csv
paired-effects.csv
balanced-frontier.csv
repo-rows.csv
report.md
raw/<query>/<scenario>.json

The docs visuals are published from docs/images/benchmark-study/ as direct Altair and Vega-Lite renders in both SVG and high-resolution PNG form. The CSV and markdown artifacts remain the exact source of truth for benchmark values.

How To Read The Data

Use the artifacts in this order:

Start with report.md for the headline summary.
Use paired-effects.csv for the cleanest latency-tax and delta analysis.
Use scenario-analysis.csv for decision metrics such as core retention, surface shares, and frontier flags.
Use repo-rows.csv when you need exact repository-level evidence, scores, and matched surfaces.

Confidence Limits

This is a strong directional benchmark, not a universal law.

It uses two live queries, not a full benchmark corpus.
It is a single-run live benchmark against GitHub data that changes over time.
Latency is affected by network and GitHub response conditions.
The top-10 overlap metrics are decision-useful, but they do not capture every ranking-quality dimension.
README enrichment may pay off more on other query classes even though it did not change top-10 overlap here.

The right way to use this study is as a decision support artifact. It is precise enough to compare mode families and tune defaults, but it should be rerun when the ranking model, GitHub corpus, or product goals materially change.

​Benchmark Study

​Executive Summary

​Recommendation Matrix

​Cost Ladder

​Balanced Decision Zone

​Ranking Mode Guidance

​Knob Guidance

​Query-Specific Findings

​api gateway

​terminal ui

​Churn And Stable Leaders

​How To Run The Study

​Operator Playbook

​Output Files

​How To Read The Data

​Confidence Limits