> ## Documentation Index
> Fetch the complete documentation index at: https://docs-dev.byterover.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Latency Improvement

Every query passes through a 5-tier routing system, fastest to slowest. Each tier acts as a short-circuit: if it can answer, it returns immediately and skips all slower tiers. Every result is cached so subsequent queries drop to Tier 0 or 1.

| Tier | Name          | Latency     | LLM Calls |
| ---- | ------------- | ----------- | --------- |
| 0    | Exact cache   | \~0ms       | 0         |
| 1    | Fuzzy cache   | \~50ms      | 0         |
| 2    | Direct search | \~100–200ms | 0         |
| 3    | Single LLM    | \<5s        | 1         |
| 4    | Agentic loop  | 8–15s       | Multiple  |

## Tier 0: Exact Cache Hit (0ms)

The fastest path. A normalized version of your query is looked up in an in-memory map.

### Algorithm

1. Normalize query: lowercase, trim, collapse whitespace
2. Build cache key: `normalized_query`
3. `Map.get(key)` — O(1) lookup
4. Validate TTL: `(now - storedAt) < 60s`
5. Validate fingerprint: MD5 hash of sorted `path:mtime` pairs must match current context tree state
6. **HIT** — return cached response immediately

### Context Tree Fingerprint

The cache uses a fingerprint of the context tree to detect changes:

* Glob all `.md` files in `.brv/context-tree/`
* Sort by path, concat `path:mtime` joined by `|`, MD5 hash, take first 16 hex chars
* The fingerprint itself is cached for 30s to avoid repeated glob I/O
* If any file is added, removed, or modified, the fingerprint changes and all cache entries are invalidated

### Parameters

| Parameter             | Value                        |
| --------------------- | ---------------------------- |
| Max cache size        | 50 entries                   |
| TTL                   | 60 seconds                   |
| Fingerprint cache TTL | 30 seconds                   |
| Eviction policy       | LRU (oldest insertion order) |

**Cost:** O(1) map lookup. Zero LLM, zero search, zero I/O.

## Tier 1: Fuzzy Cache Match (\~50ms)

When the exact match fails, the cache scans all entries for semantic similarity using Jaccard similarity on tokenized queries.

### Algorithm

1. Tokenize query: split on whitespace, remove stopwords, filter tokens shorter than 2 chars
2. Skip fuzzy matching if query has fewer than 2 meaningful tokens
3. For each cache entry:
   * Skip if fingerprint mismatch (cheap string compare)
   * Skip if TTL expired (cheap timestamp compare)
   * Compute Jaccard similarity: `|A ∩ B| / |A ∪ B|`
4. Return the highest-similarity match if `similarity >= 0.6`

### Jaccard Similarity

```
Optimization: always iterate the smaller set, check membership in the larger set.
intersection = count of shared tokens
union = |A| + |B| - intersection
similarity = intersection / union
```

### Parameters

| Parameter            | Value |
| -------------------- | ----- |
| Similarity threshold | 0.6   |
| Min tokens for fuzzy | 2     |

**Cost:** O(n) over cache entries x O(min(|A|, |B|)) per comparison. No LLM, no I/O.

## Out-of-Domain Short-Circuit

Between Tier 1 and Tier 2, if all searches return zero results, the query is classified as out-of-domain:

* Returns: "This topic is not covered in the knowledge base."
* The OOD response is cached to prevent repeated misses from hitting the LLM

**Cost:** Zero. Saves an entire LLM call for irrelevant queries.

## Supplementary Entity Search

When the initial local search returns fewer than 3 results, the executor extracts key entities and runs additional searches to improve recall.

### Algorithm

1. Split query into words, filter stopwords, keep words with length >= 3
2. Take top 3 entities
3. Run searches in parallel via `Promise.allSettled()`
4. Deduplicate results by path
5. Merge into original result set

For example, the query "How does JWT refresh work in the auth module?" extracts entities `["jwt", "refresh", "auth"]` and runs 3 parallel MiniSearch lookups, merging any new unique results into the original set.

**Cost:** Up to 3 additional MiniSearch lookups (in parallel). No LLM.

## Tier 2: Direct Search Response (\~100-200ms)

When search results are highly confident, the system skips the LLM entirely and returns a formatted markdown response assembled from raw document content.

### Algorithm

1. Filter search results with score >= 0.7, take top 5
2. Read full document content from `.brv/context-tree/` in parallel
3. Run `canRespondDirectly()` decision:
   * **Gate 1:** `topResult.score >= 0.85` (minimum threshold). If not, fail.
   * **Gate 2a:** `topResult.score >= 0.93` — score is so strong that dominance check is skipped. Pass.
   * **Gate 2b:** `gap = topScore - secondScore >= 0.08` — clear separation between top and runner-up. Pass.
4. Format response: Summary + Details (max 5000 chars/doc, max 5 docs) + Sources + Gaps
5. Cache result and return

### Why Gap-Based, Not Ratio-Based?

BM25 normalized scores cluster in the \[0.8, 0.95] range. A ratio check like "2x the second result" is mathematically impossible in that range. A fixed gap of 0.08 correctly identifies dominant matches.

### Parameters

| Parameter                        | Value      |
| -------------------------------- | ---------- |
| Min score threshold              | 0.85       |
| High confidence (skip dominance) | 0.93       |
| Min gap for dominance            | 0.08       |
| Max content per doc              | 5000 chars |
| Max docs in response             | 5          |

**Cost:** File reads only (no LLM). \~100-200ms for disk I/O.

## Tier 3: Optimized Single LLM Call (\<5s)

When search found good results (score >= 0.7) but not confident enough for Tier 2, the system makes a single constrained LLM call with pre-fetched context embedded in the prompt.

### Algorithm

1. Filter results with score >= 0.7, build a pre-fetched context string (excerpts formatted as markdown sections)
2. Inject search data into sandbox as variables:
   * `__query_results_{taskId}` = search results array
   * `__query_meta_{taskId}` = `{resultCount, topScore, hasPreFetched}`
   * Build prompt with pre-fetched context embedded directly
3. Execute with tight LLM overrides: `maxTokens=1024, temperature=0.3`
4. LLM answers from the embedded context. If insufficient, it can use `code_exec` with `silent: true` to read additional documents from the sandbox variables
5. Cache result and return

### Parameters

| Parameter                 | Value |
| ------------------------- | ----- |
| Max tokens                | 1024  |
| Temperature               | 0.3   |
| Max iterations            | 50    |
| Pre-fetch score threshold | 0.7   |
| Max pre-fetched docs      | 5     |

**Cost:** One LLM call with tight token and temperature constraints.

## Tier 4: Full Agentic Loop (8-15s)

When no pre-fetched context is available (search returned nothing above the 0.7 threshold), the system falls back to the full agentic loop.

### Algorithm

1. Same sandbox variable injection as Tier 3
2. Build prompt WITHOUT pre-fetched context — LLM must discover answers via tool use
3. Execute with relaxed LLM overrides: `maxTokens=2048, temperature=0.5`
4. LLM reads search results via `code_exec`, may call `tools.readFile()` to load documents, may loop through multiple tool calls
5. Protected by doom-loop detection (max iterations limit)
6. Cache result and return

### Parameters

| Parameter      | Value |
| -------------- | ----- |
| Max tokens     | 2048  |
| Temperature    | 0.5   |
| Max iterations | 50    |

**Cost:** Multiple LLM calls with tool use. Full agentic loop with loop detection.

## Knowledge Scoring

Search results that feed into Tiers 2–4 are ranked by a compound scoring algorithm:

```
compoundScore = (0.6 x BM25 + 0.2 x importance/100 + 0.2 x recency) x tier_boost
```

All three signals are active — relevance, accumulated importance, and freshness together determine result ranking:

| Signal         | Weight | Source                                                                    |
| -------------- | ------ | ------------------------------------------------------------------------- |
| BM25 relevance | 60%    | MiniSearch full-text search, normalized via `score / (1 + score)`         |
| Importance     | 20%    | Access hits (+3/search) + curate updates (+5/update), decays `0.995^days` |
| Recency        | 20%    | Exponential decay: `e^(-days/30)`                                         |

### Tier Boost Multipliers

Maturity tiers amplify or penalize the compound score:

| Maturity  | Boost |
| --------- | ----- |
| core      | x1.15 |
| validated | x1.00 |
| draft     | x0.85 |

### Maturity Lifecycle (Hysteresis)

```
draft --(importance >= 65)--> validated --(importance >= 85)--> core
draft <--(importance < 35)-- validated <--(importance < 60)-- core
```

The hysteresis gap (e.g., promote at 65, demote at 35) prevents rapid oscillation between tiers.

### Decay Functions

* **Importance decay:** `importance x 0.995^days` (\~78% remaining after 50 days of non-use)
* **Recency decay:** `e^(-days/30)` (half-life of \~21 days)

## Complete Query Flow

```
User Query
    |
    |--- Fire parallel searches (local)
    |
    v
[Tier 0] Exact Cache Lookup
    |--- HIT --> Return (0ms)
    |
   MISS
    v
[Tier 1] Fuzzy Cache (Jaccard >= 0.6)
    |--- HIT --> Return (~50ms)
    |
   MISS
    v
[OOD] All searches returned 0 results?
    |--- YES --> "Not covered" response (0ms, cached)
    |
    NO
    v
[Entity Search] Initial results < 3?
    |--- YES --> Run supplementary entity searches (parallel)
    |
    v
[Tier 2] Direct Response (score >= 0.85 + dominant)
    |--- PASS --> Return formatted markdown (100-200ms)
    |
   NOT DOMINANT
    v
[Tier 3] Single LLM + Pre-fetched Context (score >= 0.7 exists)
    |--- PASS --> Return LLM response (<5s)
    |
   NO CONTEXT
    v
[Tier 4] Full Agentic Loop
    |--- Return LLM response (8-15s)
```

All tier results are cached, so subsequent similar queries resolve at Tier 0 or 1.