BHGBrain: Tiered Retention, Hybrid Search, and Smarter Memory Lifecycle

BHGBrain launched a week ago as an open-source MCP memory server: a shared vector brain for AI agents built on SQLite and Qdrant, with semantic recall, automatic deduplication, and hybrid search. The initial release covered the core architecture — dual-store write pipeline, memory types, namespaces, collections, enterprise auth, and multi-agent scenarios.

This update covers what’s shipped since: a full memory lifecycle system, significantly improved search, operational safety features, and multilingual documentation. These aren’t small additions — the tiered retention system in particular replaces what was flat storage with intelligent lifecycle management that changes how agents interact with memory over time.

Memory Lifecycle: Tiers, Sliding Windows, and Auto-Promotion

Tiered Retention (T0–T3)

Every memory in BHGBrain is now assigned a retention tier at write time. Four tiers cover the full spectrum from permanent reference knowledge to ephemeral working context:

Tier	Label	TTL	Typical Content
T0	Foundational	Never expires	Architecture references, compliance mandates, company policies
T1	Institutional	1 year of zero access	Design decisions, API contracts, runbooks, coding standards
T2	Operational	90 days of zero access	Project status, sprint outcomes, technical investigations
T3	Ephemeral	30 days of zero access	Trouble tickets, email summaries, debugging sessions

Tier assignment follows a priority chain: explicit caller-supplied retention_tier wins; otherwise categories are always T0; then source-based heuristics apply (procedural agent memories → T1, episodic → T2); then LLM classification when the extraction pipeline is active; and finally a default of T2 if nothing else matches.

T0 memories never expire. They’re also excluded from all cleanup jobs, stored with full content in SQLite for recovery if the vector store needs to be rebuilt, and receive a +0.1 score boost in hybrid search results.

Sliding Window TTL

TTLs are access-based, not creation-date-based. Every time a memory is recalled or searched, its expiry clock resets to the full TTL from that moment. A T3 memory created 28 days ago that gets recalled today extends to 30 days from now.

This means memory decay is driven by actual usage. Memories your agents actively reference stay alive. Memories that genuinely fall out of use decay naturally without manual management. You don’t need to guess upfront which memories will matter long-term.

Auto-Promotion

Any T2 or T3 memory accessed 5 or more times within its TTL window automatically promotes one tier — T3 to T2, T2 to T1. If operational knowledge proves consistently useful, the system recognizes that and gives it more durable storage without you having to intervene.

Combined with sliding window TTL, this means the retention system adapts to actual agent behavior rather than requiring upfront classification to be perfect.

Search Improvements

Hybrid RRF Search

The search engine now uses Reciprocal Rank Fusion (RRF) to merge results from two independent retrieval paths:

Semantic vector search — cosine similarity against Qdrant embeddings (70% weight by default)
Fulltext BM25 search — FTS5 index in SQLite (30% weight by default)

RRF works by ranking results from each system independently, then computing a combined score based on each result’s position in both lists. A memory that ranks 3rd in semantic and 8th in fulltext scores better than one that ranks 1st in semantic and nowhere in fulltext.

{
  "tool": "bhgbrain.search",
  "params": {
    "query": "authentication architecture decisions",
    "mode": "hybrid",
    "limit": 10
  }
}

Three search modes are available: semantic, fulltext, and hybrid. Weights are configurable in config.json if your workload skews toward exact-match retrieval or pure semantic recall.

The recall tool (distinct from search) is optimized for session-start context injection — it uses semantic search with tier boosts and applies a minimum score threshold. search is the general-purpose tool with full filter and mode support.

Improved Semantic Deduplication

Deduplication now runs two passes before storing a new memory:

SHA-256 checksum matching — content is normalized (whitespace collapsed, case-folded, punctuation stripped) before hashing. Functionally identical content with minor formatting differences is caught here before any embedding is computed.
Cosine similarity threshold — the new memory’s embedding is compared against existing memories. The baseline threshold is 0.92, with tier-adjusted thresholds: T0 and T1 memories use a stricter threshold (harder to deduplicate against durable knowledge) while T3 uses a looser threshold (ephemeral content deduplicates more aggressively).

When a duplicate is detected, the incoming memory merges into the existing record — updating access time, merging tags, and preserving the higher retention tier — rather than creating a second entry. The vector space stays clean.

Importance Scoring

Memories now carry an importance field (0.0–1.0, default 0.5) that affects search result ranking. Higher-importance memories receive a score boost in recall results, making critical knowledge more prominent without requiring it to be retrieved by exact query match.

Set importance explicitly when storing a memory:

{
  "tool": "bhgbrain.remember",
  "params": {
    "content": "All database writes must go through the repository layer. No direct ORM calls in service code.",
    "type": "procedural",
    "retention_tier": "T1",
    "importance": 0.9,
    "category": "coding-standards"
  }
}

Operational Safety

Capacity Budgets

Each tier now has a per-namespace memory limit:

Tier	Default Limit
T0	Unlimited
T1	100,000 memories
T2	200,000 memories
T3	200,000 memories

When a tier approaches its limit, the /health endpoint reports it. Limits are configurable in config.json. This prevents unbounded growth in T2/T3 from high-volume agent workloads crowding out T0/T1 knowledge in the vector space.

Pre-Expiry Warnings

Seven days before any memory expires, it’s flagged in the metadata store. Agents and operators can query the health endpoint or use the CLI to list memories approaching expiration:

bhgbrain list --expiring-within 7d

This window gives you time to intervene — manually promoting a memory, resetting its TTL, or deciding it’s genuinely ready to archive.

Archive-Before-Delete

Expired memories are never deleted directly. They’re moved to an archive table in SQLite before removal from the active memory store and from Qdrant. The archive is queryable via the CLI and retains full content, metadata, and access history.

If a memory was archived in error — or if something that looked ephemeral turned out to matter — you can restore it. Nothing is permanently lost without an explicit delete.

Atomic Writes and Deferred Flush

Two performance and reliability improvements shipped alongside the retention system:

Atomic writes — all SQLite disk I/O uses a write-to-temp-then-rename pattern. The database file on disk is never in a partially-written state. If the process crashes mid-write, the previous valid state is preserved.

Deferred flush — access metadata updates (last-accessed timestamps, access counts for auto-promotion) are batched in memory for up to 5 seconds before being flushed to disk. On read-heavy paths — agents doing large context injections at session start — this eliminates per-recall database writes without meaningful impact on TTL accuracy.

Categories: Persistent Policy Slots

Categories are named storage slots for structured reference content that every agent in a namespace should be able to access consistently. Unlike regular memories, categories survive all cleanup cycles regardless of tier, have no TTL, and are injected into context in a predictable slot format.

Use cases include architecture decision records, coding standards, security policies, and operating rules:

{
  "tool": "bhgbrain.category",
  "params": {
    "action": "set",
    "name": "database-access-policy",
    "slot": "coding-standards",
    "content": "All database access must use the repository pattern. Direct ORM calls in service layer are not permitted. Repositories must return domain objects, not ORM entities."
  }
}

Categories can be retrieved individually or listed as a group. They’re distinct from T0 memories in that they’re structured as key-value policy slots rather than vector-indexed semantic memories — they don’t go through the deduplication or embedding pipeline.

Multilingual Documentation

The full README is now published in five languages:

English (README.md)
Mandarin Simplified (README.zh-CN.md)
German (README.de.md)
French (README.fr.md)
Spanish (README.es.md)

These are full translations, not machine-translated stubs — architecture diagrams, configuration reference, CLI commands, and MCP tool parameters are all covered in each language.

What’s Next

The tiered retention and hybrid search systems are the foundation for a few things coming in later releases: automatic tier classification via the extraction LLM pipeline (currently manual or heuristic-driven), cross-namespace memory federation for multi-tenant scenarios, and a dashboard for memory health monitoring.

The ROADMAP.md in the repo tracks what’s planned and what’s in progress.

BHGBrain is open source, MIT licensed, and available at github.com/Big-Hat-Group-Inc/BHGBrain. If you’re running AI agents in production and dealing with context tax — agents re-discovering state they’ve seen before, re-explaining architecture every session, losing decisions between runs — this is the problem it’s designed to solve.

Star it, file issues, send PRs. The more production workloads it sees, the better the tier classification heuristics get.

Kevin Kaminski is a principal at Big Hat Group, focused on enterprise AI infrastructure, Microsoft 365, and Windows 365. He builds open-source tools for teams running AI agents at work.