Benchmark Study
This project includes a reproducible live-study harness for comparinggitquarry search behavior across:
- native and discover retrieval
- quick, balanced, and deep discover depth
- native, query, activity, quality, and blended ranking
- README enrichment
- weighted blended variants
- language and recency filter slices
Benchmark Queries
The default study uses two intentionally different queries:api gatewayterminal ui
api gatewayis noisy, infra-heavy, and useful for comparingqualityversusactivity.terminal uiis lexically cleaner and useful for comparing native search against discover and README-aware ranking.
Study Runner
The harness lives at: It writes raw outputs and derived analysis under:target/benchmark-study/
What The Study Captures
For every run, the harness records:- command scenario name and label
- query
- duration in milliseconds
- total result count and returned result count
- compiled query
- median stars and forks in the emitted top window
- median updated age in days
- language diversity in the emitted top window
- README coverage
- explain coverage
- per-repository scores and matched surfaces when available
native-best-match for each query:
- top-k overlap
- top-k Jaccard similarity
- number of novel results
- average absolute rank shift among shared repositories
Run It
Build the binary first if needed:GITQUARRY_TOKEN is not already set, the script will try GitHub CLI auth before failing.
To refresh the visual artifacts after the study outputs are present:
Output Files
After a run, the key artifacts are:target/benchmark-study/report.mdtarget/benchmark-study/run-summaries.csvtarget/benchmark-study/repo-rows.csvtarget/benchmark-study/comparisons.csvtarget/benchmark-study/raw/<query>/<scenario>.jsondocs/images/benchmark-study/*.svg
How To Read The Results
Use the results in layers:- Start with
report.mdfor the headline differences. - Use
run-summaries.csvto compare scenario-level behavior such as latency, median stars, and readme coverage. - Use
comparisons.csvto quantify how far each scenario moved fromnative-best-match. - Use
repo-rows.csvto inspect exact rank positions, score components, and matched surfaces.
Interpretation Heuristics
These are the most useful contrasts:native-best-matchvsdiscover-balanced-blended- baseline product contrast
discover-balanced-queryvsdiscover-balanced-query-readme- README enrichment impact
discover-quick-*vsdiscover-deep-*- recall-vs-cost tradeoff
discover-balanced-blendedvs weighted blended variants- score sensitivity to explicit user preferences
native-updated-1yvsdiscover-balanced-activity-updated-1y- how recency interacts with retrieval and ranking
Visual Snapshot
Latency Profile
- native remains effectively instant
- quick discover is the first meaningful latency jump
- balanced discover is the practical middle ground
- deep discover is materially more expensive
- README enrichment adds a visible extra cost on top of balanced discover
Churn Versus Native
qualitystays closer to native thanqueryor plainblendedquality-heavygives the strongest non-native compromise onapi gateway- Rust and recency slices are intentionally more selective and drift further from the default baseline
Persistent Repository Leaders
Current Snapshot
This section summarizes the completed live run executed on2026-04-22.
High-level operational findings:
- Native runs stayed sub-second for both benchmark queries, with the slowest native baseline still around
~1.1s. - Quick discover runs landed around
16-19s. - Balanced discover runs landed around
26-31s. - Deep discover runs landed around
52-62s. - Balanced discover plus README enrichment added roughly
~3-5son top of balanced discover.
api gateway:
discover-quick-native,discover-balanced-native,discover-deep-native, andnative-updated-1yall fully preserved thenative-best-matchtop 10.discover-balanced-qualityanddiscover-deep-qualitystayed meaningfully closer to the native baseline thanqueryor plainblended.discover-balanced-blended-quality-heavywas the strongest non-native compromise, with8/10overlap and a0.6667Jaccard score againstnative-best-match.- The most persistent repositories across scenarios were
apache/apisix,kgateway-dev/kgateway,kubernetes-sigs/gateway-api, andspring-cloud/spring-cloud-gateway.
terminal ui:
discover-balanced-quality,discover-deep-quality, anddiscover-balanced-blended-quality-heavyall stayed closer to the native baseline thanquery,activity, or plainblended.discover-balanced-querystill produced the biggest semantic shift among the balanced non-slice runs, with only4/10overlap and6novel results versusnative-best-match.- The Rust slice stayed intentionally far from the baseline. Both
native-rustanddiscover-balanced-blended-rustoverlapped the native baseline by only1/10. - The recency slice completed successfully and remained high-churn.
discover-balanced-activity-updated-1yoverlapped the native baseline by4/10and introduced6novel repositories. - The most persistent repositories across scenarios were
gitui-org/gitui,gui-cs/Terminal.Gui,jesseduffield/lazygit, andcontainers/podman-tui.
- Three expensive
terminal uiscenarios initially needed staggered reruns in tmux, but the final artifact set is now complete and the regenerated report contains no exclusions.
Constraints
- This is a live GitHub benchmark. Results will drift over time as repositories change.
- The study is designed for comparison and contrast, not for freezing a permanent golden dataset.
- Search-rate limits still apply. The runner inserts a short sleep between calls to avoid bursty traffic.