Skip to main content

Benchmark Study

This page is the benchmark dossier for gitquarry search. It is not just a run log. The goal is to answer the product questions that matter in practice:
  • how much latency each option adds
  • what that extra cost actually buys
  • which modes preserve the native baseline versus deliberately breaking away from it
  • when optional knobs such as README enrichment, weighted blends, recency, and language filters are worth using
The current live study uses two intentionally different benchmark queries:
  • api gateway
  • terminal ui
Those two queries are useful because they stress different failure modes. api gateway is noisy and infra-heavy. terminal ui is lexically cleaner and exposes whether a mode is adding useful semantic breadth or just drifting.

Executive Summary

  • Native is still the only sub-second path. In this run it stayed around ~0.5s to ~1.1s.
  • Quick discover adds ~15.7s to ~18.3s over native.
  • Balanced discover adds ~26.8s to ~30.1s over native.
  • Deep discover adds ~52.5s to ~59.8s over native.
  • README enrichment added another ~2.9s to ~4.6s on top of balanced discover and did not improve top-10 Jaccard overlap in either benchmark query.
  • For baseline preservation, quality is the best default non-native rank mode.
  • For api gateway, the strongest upgrade from native was discover-balanced-blended-quality-heavy.
  • For terminal ui, the strongest upgrade from native was discover-balanced-quality.
  • For maximum semantic expansion, query still introduces the most novelty, but it sheds much more of the native core.

Recommendation Matrix

GoalRecommended OptionWhy it wins in this studyDo not choose it when…
Fastest safe defaultnative-best-matchIt is the only consistently sub-second path and remains the reference baseline for all comparisons.You explicitly want semantic expansion or explain-driven ranking behavior.
Cheapest discover pathdiscover-quick-nativeIt preserves the native top 10 while showing the minimum discover-depth tax.You need additional novelty, because quick-native bought cost without result-set change in this run.
Best upgrade from native for api gatewaydiscover-balanced-blended-quality-heavyIt kept 8/10 of the native top 10, retained 5/5 of the native top five, and raised quality without paying deep-mode cost.You primarily want novel repositories instead of a curated upgrade to the baseline set.
Best upgrade from native for terminal uidiscover-balanced-qualityIt matched the best non-native Jaccard (0.4286), retained 4/5 of the native top five, and was cheaper than quality-heavy.You want maximum semantic drift or query-heavy expansion.
Strongest semantic expansiondiscover-balanced-queryIt produced 6 novel results on both benchmark queries and maximized semantic movement within the balanced family.You care about preserving the native core, because it kept only 1/5 of the native top five on both queries.
Practical middle grounddiscover-balanced-blendedIt is often the balanced family’s cheapest general mode and usually sits near the frontier between cost and novelty.You need stronger baseline fidelity than a 0.25 to 0.3333 Jaccard result can offer.
Freshest repository slicediscover-balanced-activity --updated-within 1yIt is the discover path if recency is more important than baseline stability.You expect it to behave like default search. The recency slice introduced 5 to 6 novel results and materially changed the set.
Language-constrained searchnative-rust first, then discover-balanced-blended --language Rust only if neededNative filtering is extremely cheap. The discover Rust slice is only useful when semantic expansion inside the language slice matters.You want general-purpose search. The Rust slice is intentionally far from the default baseline.
README-aware investigation--readme only as an explicit second passIt guarantees README evidence in explain output and can help inspect why a result matched.You are optimizing for latency or top-10 stability. It added cost without improving top-10 overlap in this benchmark.

Cost Ladder

Benchmark latency profile Depth overhead versus the native path README enrichment tax The latency pattern is stable enough to drive product guidance:
  • Native is the only low-latency mode.
  • Quick discover is already a substantial tax. Treat it as a deliberate opt-in, not a near-native fallback.
  • Balanced discover is the practical analysis tier. It is slow enough to matter, but still much cheaper than deep.
  • Deep discover is expensive enough that it should be reserved for deliberate heavy-recall workflows.
  • README enrichment is not free. In this study it added ~11% to ~15% on top of balanced discover while leaving top-10 overlap unchanged.
Concrete overhead from this run:
QueryQuick over nativeBalanced over nativeDeep over nativeREADME tax range
api gateway+18.3s+30.1s+59.8s+3.5s to +4.6s
terminal ui+15.7s+26.8s+52.5s+2.9s to +3.9s

Balanced Decision Zone

Balanced discover is where the real product tradeoffs live. It is the family most likely to be exposed as the default advanced mode, so it deserves deeper inspection than the raw run table. Balanced-mode tradeoff map Balanced frontier map Baseline core retention Balanced-mode surface attribution mix These charts are rendered directly from the benchmark CSV artifacts for the docs site. The exact audit trail for values and scenario comparisons still lives in report.md, paired-effects.csv, and scenario-analysis.csv. What these views show:
  • quality and quality-heavy preserve the native core far better than query.
  • blended is often the cheapest balanced choice, but its top-10 fidelity is materially lower than quality.
  • query is not the best frontier choice in this run. It is dominated by cheaper alternatives with equal or better novelty tradeoffs in key cases.
  • quality-heavy is especially strong on api gateway, where it preserves the native core while improving repository quality signals.
  • The surface-mix chart explains why: quality leans less on repository names and more on description and topic evidence.
Balanced-family frontier takeaways:
  • On api gateway, the balanced frontier includes native, activity, quality, blended, and quality-heavy.
  • On terminal ui, the balanced frontier includes native, quality, blended, and query-heavy.
  • discover-balanced-query is not on the balanced frontier for either benchmark query.
  • README variants are off the frontier in this run because they add cost without improving top-10 fidelity.

Ranking Mode Guidance

The rank mode is the real behavior selector. Depth mostly controls cost. Rank controls what sort of repositories survive the cut.
Rank modeWhat it tends to optimizeStrengths in this studyWeaknesses in this studyBest use
nativePreservation of the original GitHub result setPerfect overlap with the native baseline at any discover depthYou pay discover latency without changing the result setSanity checks, pipeline validation, and baseline-preserving comparisons
queryMaximum lexical and semantic expansionHighest novelty in balanced mode, with 6 novel results on both benchmark queriesIt retained only 1/5 of the native top five on both queriesExploratory search when broad expansion matters more than fidelity
activityFresher, more active repositoriesUseful when you deliberately want newer movement and more churnIt did not beat quality as a general default and still paid balanced latencyRecency-focused or trend-seeking workflows
qualityHigher-quality, more established repositoriesBest default non-native rank for preserving the native core while improving median starsLess novelty than query, and can stay conservativeDefault advanced ranking when you want better curation without losing the core set
blendedMiddle ground across query, activity, and qualityUsually one of the cheapest balanced options and often frontier-competitiveDefault blended weights were weaker than quality for baseline preservationGeneral-purpose discover when you want some novelty without going full query mode

Knob Guidance

OptionWhat it doesWhat the study saysPractical guidance
--depth quickReduces candidate expansion relative to balanced and deepStill expensive over native, but much cheaper than deepUse only when you want the cheapest discover proof point
--depth balancedMiddle ground between recall and costBest general operating point for comparison, tradeoff tuning, and explain analysisTreat this as the default experimental tier
--depth deepMaximum candidate expansionRoughly doubles the balanced tax without proportional gains in these two queriesReserve for explicit high-recall investigations
--readmeAdds README evidence to explain and matchingAdded ~3s to ~5s without top-10 gains in this runKeep it as a targeted second pass, not the default
--weight-query heavyPushes blended toward query-driven noveltyHelpful on terminal ui, where query-heavy reached the frontierUse when plain blended feels too conservative
--weight-activity heavyPushes blended toward recency and activityWeak general payoff in this runUse only with a clear freshness objective
--weight-quality heavyPushes blended toward higher-quality repositoriesVery strong on api gateway, where it became the best non-native upgrade from baselineGood option when you want a safer upgrade than raw query
--updated-within 1yForces recency sliceProduces high churn and should be treated as a different intent, not a small tweakUse only when freshness is a hard requirement
--language RustNarrows search to a language sliceCheap natively, expensive under discover, and intentionally far from default searchStart with native language filtering, then add discover only if needed

Query-Specific Findings

api gateway

  • discover-balanced-blended-quality-heavy was the best non-native compromise.
  • It kept 8/10 of the native top 10 and 5/5 of the native top five.
  • discover-balanced-quality also retained the full native top five, but with less baseline overlap than quality-heavy.
  • discover-balanced-query and discover-balanced-blended both delivered 6 novel results, but each kept only 1/5 of the native top five.
  • README enrichment added +3.5s to +4.6s and did not improve top-10 Jaccard.

terminal ui

  • discover-balanced-quality was the best non-native default.
  • It kept 4/5 of the native top five with 0.4286 Jaccard and remained cheaper than quality-heavy.
  • discover-balanced-blended sat on the frontier because it was cheaper and still delivered 5 novel results.
  • discover-balanced-blended-query-heavy also sat on the frontier and dominated plain discover-balanced-query in this run.
  • README enrichment added +2.9s to +3.9s and again did not improve top-10 Jaccard.

Churn And Stable Leaders

Scenario churn versus the native baseline Most persistent repositories across the study These two views help with interpretation:
  • The churn chart tells you which options are still “about the same search” versus which ones are effectively different products.
  • The persistence chart shows which repositories survive almost every mode and filter change.
  • Persistent leaders are especially useful for screenshots, demo flows, and explanation examples because they are less likely to disappear when the ranking strategy changes.

How To Run The Study

The benchmark harness lives at: Build the binary first if needed:
cargo build
Run the full live study:
python3 scripts/benchmark-study.py
Rebuild derived analysis from existing raw outputs:
python3 scripts/benchmark-study.py --analyze-only
Refresh all chart assets:
uv run scripts/render-benchmark-charts.py
If GITQUARRY_TOKEN is not set, the runner will try to use GitHub CLI auth before failing.

Operator Playbook

If the goal is to help an operator choose a mode quickly, use these presets instead of re-reading the full study every time:
Operator intentRecommended command patternWhy
Fastest defaultgitquarry search "<query>"Keeps latency near the native baseline.
Safer discover upgradegitquarry search "<query>" --mode discover --depth balanced --rank quality --explainBest default non-native mode when you want better curation without throwing away the baseline core.
Broader semantic explorationgitquarry search "<query>" --mode discover --depth balanced --rank query --explainHighest novelty in the balanced family.
Better curated api gateway-style resultsgitquarry search "<query>" --mode discover --depth balanced --rank blended --weight-query 0.5 --weight-activity 0.5 --weight-quality 2.0 --explainThis is the quality-heavy shape that performed best on the noisier benchmark query.
Fresh repos onlygitquarry search "<query>" --mode discover --depth balanced --rank activity --updated-within 1y --explainUse when freshness is a hard requirement and churn is acceptable.
Language slice firstgitquarry search "<query>" --language RustStart cheap. Only add discover after confirming the slice is worth exploring semantically.
README inspection passgitquarry search "<query>" --mode discover --depth balanced --rank quality --readme --explainUse as a second pass when you need richer evidence, not as the default query path.
Default recommendation for most operators:
gitquarry search "<query>" --mode discover --depth balanced --rank quality --explain
Escalation rule:
  1. Start with native if latency matters most.
  2. Move to balanced quality if you need a smarter curated set.
  3. Move to balanced query only when you explicitly want more novel repositories.
  4. Add --readme, --updated-within, or --language only when the task requires that specific constraint.

Output Files

The study writes raw and derived artifacts to target/benchmark-study/. Most useful outputs:
  • run-summaries.csv
  • comparisons.csv
  • scenario-analysis.csv
  • paired-effects.csv
  • balanced-frontier.csv
  • repo-rows.csv
  • report.md
  • raw/<query>/<scenario>.json
The docs visuals are published from docs/images/benchmark-study/ as direct Altair and Vega-Lite renders in both SVG and high-resolution PNG form. The CSV and markdown artifacts remain the exact source of truth for benchmark values.

How To Read The Data

Use the artifacts in this order:
  1. Start with report.md for the headline summary.
  2. Use paired-effects.csv for the cleanest latency-tax and delta analysis.
  3. Use scenario-analysis.csv for decision metrics such as core retention, surface shares, and frontier flags.
  4. Use repo-rows.csv when you need exact repository-level evidence, scores, and matched surfaces.

Confidence Limits

This is a strong directional benchmark, not a universal law.
  • It uses two live queries, not a full benchmark corpus.
  • It is a single-run live benchmark against GitHub data that changes over time.
  • Latency is affected by network and GitHub response conditions.
  • The top-10 overlap metrics are decision-useful, but they do not capture every ranking-quality dimension.
  • README enrichment may pay off more on other query classes even though it did not change top-10 overlap here.
The right way to use this study is as a decision support artifact. It is precise enough to compare mode families and tune defaults, but it should be rerun when the ranking model, GitHub corpus, or product goals materially change.