Brie

Research · November 2025

A novel methodology for domain-specific fine-tuning where training data is authored through iterative discussions with LLMs, positioning them as authoring tools rather than autonomous generators. Demonstrates that expert-directed curation achieves 77-91% win rates with only 1,213 examples—substantially fewer than typical instruction-tuning datasets—at minimal cost (~$3 training, near-zero data generation).

Key contribution

Traditional domain adaptation requires massive datasets (10,000+ examples), creating a barrier for domain experts who lack data collection resources. We demonstrate an alternative: LLM-assisted data authoring, where experts engage in iterative discussions with LLMs to generate training examples.

The key insight: quality and curation matter more than scale for domain-specific applications. By positioning LLMs as authoring tools rather than autonomous generators, domain experts can create high-quality training data that captures specialized reasoning patterns with substantially fewer examples.

This approach democratizes domain adaptation—any domain expert can use LLMs as authoring tools to generate training data that captures their expertise, without requiring large budgets or massive datasets.

Results summary

Base Model Win Rate Validation
Qwen 2.5 3B 91.2% (in-domain) 4 judges (3 labs)
Llama 3.2 3B 80.4% (in-domain) 4 judges (3 labs)
Qwen 2.5 0.5B 72% (comprehensive)* 4 judges (3 labs)

*0.5B model: 77% in-domain (n=13), 40% out-of-domain (n=15), 72% comprehensive (n=57)

Validation robustness: Unanimous strong preference across all judges (78.9-95.2%); 91.2% cross-lab pairwise agreement (GPT-4o ↔ Gemini)

Resource requirements: ~$3 training cost per 3B model (cloud GPU); 0.5B trains on consumer hardware at negligible cost

Novel methodology

Phase 1: Interactive Authoring

  • Engage in iterative discussions (3-10 turns) with LLMs on target domain topics
  • Researcher directs discussion, evaluates quality, ensures domain accuracy
  • Capture specialized reasoning patterns through expert guidance

Phase 2: Curation and Refinement

  • Select conversations that exemplify target reasoning patterns
  • Refine responses for clarity and remove artifacts
  • Verify domain accuracy and depth
  • Apply expert judgment to ensure meaningful contributions

Phase 3: Response Distribution Learning

  • Retain multiple high-quality responses per prompt (~6 avg)
  • Expose model to distribution of valid responses, not just single mappings
  • Teach range of acceptable variation within domain constraints
  • 202 unique prompts → 1,213 total responses = learning variance within valid answers

Dataset: 1,213 training examples from iterative discussions focused on continental philosophy, speculative reasoning, and creative writing. Developed over several years of philosophical and creative discussions.

Architecture comparison

To demonstrate methodology transferability, we fine-tuned identical training data across three architectures, revealing distinct trade-offs:

Qwen 2.5 3B — Highest Specialization

91.2% in-domain, 47% out-of-domain

Strongest alignment with philosophical discourse patterns. Specializes aggressively while maintaining competence on unseen domains.

Llama 3.2 3B — Best General Preservation

80.4% in-domain, 60% out-of-domain

Strong domain performance with best preservation of general capabilities. Ideal when maintaining broad competence is prioritized.

Qwen 2.5 0.5B — Small Model Viability

77% in-domain, 40% out-of-domain, 72% comprehensive

Demonstrates effective domain adaptation at consumer-hardware scale. Trains on M4 MacBook at negligible cost.

Key finding: No model shows catastrophic forgetting—all maintain >40% competence on completely unseen domains. Architecture choice depends on whether maximizing domain expertise (Qwen) or preserving general capabilities (Llama) is prioritized.

Multi-judge validation

To address single-judge reliability concerns, we employed four independent judges from three laboratories (Anthropic, OpenAI, Google) with blind A/B testing and randomized presentation order.

Qwen 2.5 3B

Claude 3.5 Sonnet: 95.2% · Opus 4: 78.9% · GPT-4o: 93.0% · Gemini 2.5: 94.7%

Llama 3.2 3B

Claude 3.5 Sonnet: 73.8% · Opus 4: 80.0% · GPT-4o: 82.5% · Gemini 2.5: 84.2%

Qwen 2.5 0.5B

Claude 3.5 Sonnet: 76.2% · Opus 4: 45.6% · GPT-4o: 75.4% · Gemini 2.5: 82.5%

Cross-lab agreement: 91.2% pairwise agreement between GPT-4o ↔ Gemini (52/57 cases). Unanimous consensus across competing commercial labs provides strong evidence of genuine quality improvements rather than judge-specific biases.

Cost-effectiveness

Traditional approach: 10,000+ examples, weeks to months of data collection, multiple annotators, expensive quality control

Our approach: 1,213 examples authored through LLM discussions, ~$0 data generation cost, ~$3 training cost per 3B model, single expert maintains consistency

ROI: Small dataset vs. typical 10,000+ for instruction tuning, near-zero data generation cost, ~$3 training cost, 91% win rate for best model

Why this methodology works

  • Expert-directed content selection: Human authors recognize nuanced patterns, correct errors, ensure domain accuracy
  • Iterative refinement: Multiple revision cycles, progressive idea development, authentic reasoning progression
  • Consistent expertise signal: Unified philosophical perspective, coherent reasoning style, authentic domain voice
  • Response distribution learning: Multiple responses per prompt teach variance within valid answers, not just memorization

Broader implications

For ML Research: Challenges assumption that effective fine-tuning requires massive datasets; demonstrates quality and curation matter more than scale

For Domain Adaptation: Democratizes access—experts can create effective models without large budgets; reduces environmental impact through smaller datasets

For AI Safety: Greater control over model behavior patterns; transparent data provenance; ability to audit training examples

The methodology is domain-agnostic and reproducible. Any domain expert can apply this approach in fields where expertise is expressed through natural language—from creative writing and humanities to technical documentation and specialized discourse communities.

Technical details

  • Fine-tuning method: LoRA (r=16, alpha=32) targeting q_proj and v_proj modules
  • Training: 2 epochs, effective batch size 8, learning rate 2e-4, max_seq_length 2048
  • Efficient deployment: LoRA adapters (14-19MB) enable rapid deployment and rollback
  • Evaluation: Blind A/B testing, randomized order, 4 judges from 3 labs, 5 criteria (Creativity, Coherence, Depth, Engagement, Quality)
  • Robustness testing: Multiple reproducibility runs, temperature variations (0.5/0.75/1.0), token length tests (256/512/1024)
  • Domain coverage: Continental philosophy, phenomenology, existentialism, critical theory, speculative reasoning, creative writing (essays, cultural criticism, literary analysis)

Links