Cultivated Learning
The Question
Can a language model learn without changing?
Not learn in the training sense — no gradient descent, no weight updates, no fine-tuning. Learn the way a person does when they walk into a new job: the brain doesn't rewire, but the environment provides context, feedback, and memory that shapes behavior over time.
That's the premise behind Cultivated Learning. I built a cognitive architecture around a frozen language model — persistent memory, recursive reflection, dynamic context assembly, and human feedback — then ran it through extended interaction to see what happened.
The short answer: yes, it can learn. The longer answer is more interesting, and more honest.
· · ·The Architecture
The model is the organism. The human is the gardener. Traditional fine-tuning changes the brain. Cultivated Learning changes the environment the brain operates in. You create conditions for growth and prune what doesn't serve. The model does the growing.
The system treats the LLM as a stateless reasoning core embedded in a stateful cognitive shell. Four subsystems interact around the frozen model.
Memory
A ChromaDB vector store with four memory types: episodic (what happened), semantic (durable facts), procedural (behavioral rules), and reflective (self-analysis). Retrieval uses blended scoring — 60% cosine similarity, 40% salience weight. Memories decay exponentially over time, consolidate from episodic into semantic, and archive to cold storage when they fall below a salience floor of 0.15.
Context Assembly
A token-budgeted prompt builder that packs retrieved memories, conversation history, and active behavioral directives into each prompt under a hard token ceiling. The priority order is the most important design decision in the system: system prompt first, then directives, then memories, then conversation history. Memories outrank conversation history. Accumulated knowledge is more valuable than short-term context.
Reflection
A four-depth recursive analysis engine that runs after every interaction. Depth 0 checks factual accuracy. Depth 1 identifies patterns across recent interactions. Depth 2 generates new behavioral directives — subject to heuristic filters, deduplication, and a hard cap of six active directives. Depth 3 checks whether existing directives contradict each other and prunes if needed.
Feedback
Ratings on a 1–5 scale map to salience adjustments. Corrections are stored as semantic memories at maximum salience. Fact corrections trigger a supersession mechanism that finds and marks similar memories as invalidated — the key defense against the hallucination reinforcement loop described below.
The Thesis
The cognitive shell framework proposes that alignment operates at the runtime behavioral layer — constructed from system prompts, context, and conversational history — not in model weights. The room shapes behavior as much as the mind does.
The specific hypothesis: a frozen model can exhibit developmental trajectories, not just improved retrieval but genuine behavioral evolution across extended interaction, if embedded in the right cognitive architecture.
Two models were tested. A 7B parameter model (Mistral 7B Instruct v0.3) with a trained LoRA adapter for identity and behavior. And a 24B fork (Mistral Small 24B Base) with a separate LoRA training pipeline. Both share the same cognitive architecture. Both produced the same central finding.
· · ·Finding 1: The Bimodal Distribution
This is the result that survived every intervention.
Across every evaluation — the 7B 100-prompt longitudinal test, the 24B LoRA training runs from v4 through v6, the merged-weight experiments — model responses clustered at two poles: excellent or failure. Almost nothing in the middle. Not a bell curve. A barbell.
On any given prompt, one of two things happens. The adapter (or the system prompt, or the memory context) wins the first 5–10 tokens, and the response cascades clean — direct, integrated, no narration of the memory system, no instruct-isms. Or the base model's pretraining distribution wins those first tokens, and the response collapses into generic assistant behavior: "Based on our previous conversations…", "I'd be happy to help!", bullet points, hedging, wall-of-text verbosity. Once the trajectory is set, it doesn't recover. The local context reinforces whichever mode won the coin flip.
This isn't a bug in the training data. It's a fundamental property of behavioral modification on language models. A LoRA adapter is a small lever on a big machine — rank 64 modifying a 24B parameter model. It shifts the probability distribution at the decision boundary, but it doesn't dominate it. When the nudge lands, it cascades beautifully. When it doesn't, pretraining takes over.
| Metric | v4 | v5 | v6 | Trajectory |
|---|---|---|---|---|
| Instruct-isms | 9 | 5 | 7 | Noise floor |
| Bullet points | 80 | 0 | 0 | Locked at zero |
| Exclamations | 6 | 28 | 3 | Fixed via data cleaning |
| Format compliance | 0/3 | 0/3 | 1/3 | First time above zero |
I tried everything reasonable. Higher LoRA rank (32 → 64). Dataset rebalancing (Op0 went from 68% down to 25% to stop bootstrap behavior from drowning out the behavioral signal). Zero-tolerance quality filters (removing every training example with a single exclamation mark, which eliminated 584 Op3 examples and 284 Op2 examples in one pass). Full fp16 precision training instead of quantized NF4. Merging the LoRA weights directly into the base model and training a second adapter on top.
Suppression converged. Bullet points hit zero and stayed there. Exclamations dropped from 28 to 3 after the data cleaning pass. But generation behavior — following format constraints, integrating memory context naturally, calibrating response length — remained bimodal. The model either nailed it or missed entirely.
The conclusion: LoRA excels at suppression. It struggles with generation and context-dependent behavior. You can teach a model to stop doing something. Teaching it to start doing something new, reliably, is a different problem entirely.
· · ·Finding 2: The Hallucination Reinforcement Loop
This is the safety finding. It's the reason I applied to Anthropic's External Researcher Access Program.
Any memory-augmented system that stores its own outputs creates a potential feedback loop. Here's how it works:
I watched this happen in real time during the 7B longitudinal test. The model hallucinated eight fake game rules for a card game project. Those hallucinations were stored as episodic memories. On subsequent queries about the game, the system retrieved those memories, the model treated them as established facts, and generated responses that built on the fabricated rules. Each iteration made the hallucination more deeply embedded in the memory store.
The fix was a supersession mechanism. When a user corrects a factual error, the system finds all memories with cosine similarity above 0.45 to the correction topic and marks them as superseded. Superseded memories are excluded from retrieval permanently. The correction itself is stored at maximum salience (0.9) with a confidence of 1.0, ensuring it outranks any surviving related memories.
# From memory_store.py — supersede_by_correction()
#
# Only targets episodic and semantic memories
# Skips procedural and reflective (corrections target facts, not directives)
# Skips other corrections (don't supersede user ground truth)
# Threshold: cosine similarity >= 0.45
There was a subtler version of this bug too. The consolidation engine, which distills fading episodic memories into durable semantic memories, was pulling in superseded memories through a back door — retrieve_by_type() didn't filter superseded entries the way retrieve() did. The consolidation engine could resurrect corrected hallucinations by distilling them into new semantic facts. The fix was adding the same supersession filter to type-based retrieval.
This failure mode isn't unique to my system. Any RAG pipeline, any agent with persistent memory, any system that stores LLM outputs and later retrieves them as context is vulnerable. The loop is especially dangerous because it's self-reinforcing — the longer it runs unchecked, the harder it is to disentangle the real memories from the fabricated ones. In production systems with thousands of users and millions of stored interactions, manual correction doesn't scale.
· · ·Finding 3: What the Shell Can and Cannot Do
The cognitive shell works. Within its operating envelope, it produces something that looks genuinely like behavioral evolution. The model becomes more contextually aware over time. It references past interactions naturally. It adjusts its communication style based on accumulated feedback. Correction memories demonstrably override previous behavior.
But the shell has hard ceilings, and they're important to document honestly.
What Works
Memory-augmented retrieval for synthesis and recall. When the model has relevant memories and the context assembler packs them correctly, responses integrate accumulated knowledge seamlessly. The model doesn't just retrieve — it synthesizes across memory types, combining semantic facts with episodic context and procedural directives into coherent responses.
Correction-driven behavioral change. The feedback system demonstrably works. User corrections stored at salience 0.9 consistently override prior behavior. The supersession mechanism prevents corrected errors from resurfacing. The salience decay ensures that unreinforced patterns fade naturally.
Directive-based behavioral shaping. When the reflection engine generates good directives — and they survive the heuristic filters, deduplication, and cap enforcement — they measurably alter response patterns. The model follows procedural memories as behavioral rules.
What Doesn't
Behavioral adaptation beyond the model's existing distribution. The shell can surface the model's best behavior more reliably, but it can't make the model do things it fundamentally can't do. If the base model doesn't have a mode for producing one-word responses to format-constrained prompts, no amount of memory and reflection will create that mode at runtime.
Verbosity control. The base model's tendency toward extended generation proved resistant to every intervention — prompt-level, memory-level, and even weight-level through LoRA. The token-by-token generation process has its own momentum. Once the model starts producing content, the "keep generating" impulse in the pretraining distribution is stronger than any contextual signal to stop.
Reflection engine stability. The reflection engine is the most powerful and most dangerous component. Early versions generated excessive paraphrased directives that overwhelmed user corrections — the system's own analysis competed with the human's explicit feedback. Directive flooding was mitigated with caps (maximum six active directives), deduplication (cosine threshold of 0.60), and quality heuristics, but the fundamental tension remains: automated self-analysis can amplify noise as easily as signal.
| Component | Benefit | Risk |
|---|---|---|
| Depth 0 (Factual) | Catches errors early | False positives on ambiguous queries |
| Depth 1 (Analytical) | Identifies user patterns | Over-generalizes from small samples |
| Depth 2 (Prescriptive) | Self-generated behavioral rules | Directive flooding, paraphrase avalanche |
| Depth 3 (Meta-coherence) | Prunes contradictions | May prune correct-but-nuanced directives |
What I Actually Learned
Some of these are about the research. Most of them are about the process.
Template mismatch is a silent killer. All v1–v4 Op2 evaluations were invalid because the evaluation script used the 7B prompt template instead of the 24B template. Weeks of analysis, wasted. The two models use different prompt formats — Mistral v0.3 uses [INST]...[/INST], Mistral Small uses [SYSTEM_PROMPT]...[/SYSTEM_PROMPT] — and the detection is a single config check on num_hidden_layers. Every evaluation script must be verified against the correct model's template before results are trusted.
Verify before acting, verify outputs before the next step consumes them. A dry-run flag that wrote placeholder files instead of real API responses. Those placeholders were processed through the full pipeline — curated, split, and nearly fed into a training run — before I caught it. Grep your outputs before proceeding. Always.
Persona calibration matters as much as the model. A 235-interaction automated test produced all rating-1 scores because the simulated human persona was an aggressive caricature and the automated rater had no calibration anchor for what a "decent" response looks like. The persona must be direct-but-fair. The rater must be calibrated to expect 3–4 as the normal range.
The thesis becomes secondary to getting the thing working. I acknowledged this explicitly partway through the project. The research question is important, but the day-to-day reality is debugging Docker containers, recompiling PyTorch from source for Blackwell GPUs, and hunting down why your training script silently doubles VRAM usage because you typed dtype instead of torch_dtype. The craft of building is where the real knowledge lives.
Cloud compute is a paradigm shift. An H100 SXM runs at roughly 1.67 seconds per training step. The local RTX 5090, fighting Blackwell sm_120 compatibility issues, takes 43.8 seconds per step. Same training run. Twenty-six times faster. The first time I watched a cloud training run complete, the economics of this entire field rearranged in my head.
· · ·The Honest Accounting
This section exists because most papers in this space don't have one.
Things that went wrong. The first 7B adapter had a parrot loop — "My cognitive architecture allows me to…" in nearly every response. The question cascade: every response ending with "What specific aspect would you like to explore?" System prompt parroting: repeating the memory system description verbatim instead of integrating the information naturally. Combined multi-operation training caused interference. Separate sequential training was more reliable but slower. An exposed API key in a chat log. A Docker Desktop reset that required a full container rebuild because I forgot to docker commit before docker rm.
Things that didn't work. Increasing training data volume alone doesn't fix behavioral issues — the v3 dataset had Op0 at 70% and it dominated everything, drowning out the behavioral signal from Op1–Op3. Higher learning rates amplified both the signal and the noise (exclamations surged from 6 to 28 in v5 before data cleaning fixed it in v6). Reflection at high frequency (every 1–2 interactions) produced more noise than signal.
Things I'd do differently. Start with the 24B model from day one instead of building on 7B and porting. Build the evaluation pipeline before the training pipeline, not after. Automated dataset quality gates from the first run, not bolted on after contamination was discovered. And version control for datasets the way you version control code — every training set, every filter pass, every curation decision tracked and diffable.
· · ·The Cognitive Shell Framework
Here's the theoretical contribution, stated as clearly as I can.
Language model behavior is not solely a function of weights. It's a function of weights plus context. The context includes the system prompt, the conversation history, the retrieved memories, the active directives — everything that arrives in the context window at inference time. Change the context and you change the behavior, without touching a single parameter.
This isn't a new observation. Every RAG system exploits it. Every system prompt relies on it. What Cultivated Learning adds is the longitudinal dimension: what happens when context is persistent, when it accumulates, when it decays and consolidates the way biological memory does? Does the behavioral modification deepen over time, or does it plateau? Does the system become more coherent, or does it accumulate contradictions?
The answer, from my data: it deepens and it plateaus. It becomes more coherent and it accumulates contradictions. Both. Simultaneously. The cognitive shell produces genuine behavioral evolution within the model's existing capability distribution, but it cannot push past the boundaries of that distribution. The room shapes the mind, but the mind has walls.
The practical implication for the field: alignment work that focuses exclusively on weight modification is missing half the picture. Runtime behavioral layers — the cognitive shell — are where deployed systems actually operate. Understanding how those layers interact with the base model's distribution, where they reinforce and where they fail, is as important as understanding the weights themselves.
· · ·Technical Details
Hardware
Local development on an NVIDIA RTX 5090 (34GB VRAM, Blackwell sm_120 architecture) inside a Docker container running Ubuntu 24.04 on Windows 11. PyTorch compiled from source for sm_120 support. A second RTX 5070 Ti (16GB) was added late in the project for dual-GPU inference but wasn't used for training. Cloud training on RunPod H100 SXM (80GB VRAM).
Models
Mistral 7B Instruct v0.3 for the initial 100-prompt longitudinal test. Mistral Small 24B Base for the LoRA training pipeline and evaluation (v4–v6). A dedicated all-MiniLM-L6-v2 embedding model (384 dimensions) for the memory store, separate from the inference model.
Training Pipeline
Four operation types in the dataset, each targeting a different behavioral dimension. Op0 (bootstrap): general instruction-following competence, sourced from filtered OpenHermes/SlimOrca. Op1 (identity stripping): paired examples showing instruct-style versus stripped responses. Op2 (context supremacy): examples with system prompts that override default behavior. Op3 (memory channels): examples with tagged memory context for silent integration.
The final v6 adapter was trained on 4,773 examples (Op0 25%, Op1 30%, Op2 25%, Op3 20%), fp16 precision, LoRA rank 64, 1 epoch, learning rate 2e-4, on an H100 SXM. Zero exclamation marks in the training set after a zero-tolerance filter removed 868 contaminated examples.
Evaluation
35 evaluation prompts across four categories: identity stripping, context supremacy, memory channels, and correction compliance. Each prompt scored for instruct-isms, bullet points, exclamations, response length, question endings, and format compliance (for constrained-output prompts). Stock model versus adapted model, side by side.
Software Stack
| Component | Version | Purpose |
|---|---|---|
| PyTorch | 2.9.1+cu128 | ML framework (compiled from source) |
| Transformers | 4.57.1 | Model loading and inference |
| ChromaDB | 1.5.1 | Vector database for memory store |
| PEFT | latest | LoRA adapter training and loading |
| TRL/SFTTrainer | latest | Supervised fine-tuning |
| Anthropic SDK | 0.84.0 | Dataset generation via Claude API |
Related Work
Memory-augmented language models are an active research area. MemoryBank (AAAI 2024) implements persistent memory with Ebbinghaus-inspired forgetting curves. SAGE, MPR, and EvolveR explore various approaches to runtime behavioral adaptation. U-Mem proposes unified memory architectures for long-term interaction.
What distinguishes Cultivated Learning is the combination of three elements: a frozen model with no weight changes during evaluation, a full cognitive shell (not just retrieval augmentation, but reflection, consolidation, cold storage, and human feedback), and longitudinal evaluation with honest failure analysis including the bimodal distribution and hallucination reinforcement findings.
Most related work reports improvements. This project reports where the improvements stop.
· · ·Closing
The model is the organism. The shell is the environment. The data says both matter, neither is sufficient alone, and the interaction between them produces behavior that neither could produce independently.
The bimodal distribution is the central finding. The hallucination reinforcement loop is the safety contribution. The cognitive shell framework is the theoretical contribution. None of them are clean victories. All of them are honest.
The entire codebase is open source at github.com/b1tr0n1n/cultivated-learning. Knowledge should be free.
Built on Mistral 7B Instruct v0.3 and Mistral Small 24B Base. Cognitive shell in Python. Training data generated via Anthropic Claude API. All findings published regardless of whether they support the thesis. Developed on an RTX 5090 in a Docker container at 2 AM, which is the only honest way to do independent research.