Key contribution
Traditional domain adaptation requires massive datasets (10,000+ examples), creating a barrier for domain experts who lack data collection resources. We demonstrate an alternative: LLM-assisted data authoring, where experts engage in iterative discussions with LLMs to generate training examples.
The key insight: quality and curation matter more than scale for domain-specific applications. By positioning LLMs as authoring tools rather than autonomous generators, domain experts can create high-quality training data that captures specialized reasoning patterns with substantially fewer examples.
This approach democratizes domain adaptation—any domain expert can use LLMs as authoring tools to generate training data that captures their expertise, without requiring large budgets or massive datasets.
Results summary
| Base Model | Win Rate | Validation |
|---|---|---|
| Qwen 2.5 3B | 91.2% (in-domain) | 4 judges (3 labs) |
| Llama 3.2 3B | 80.4% (in-domain) | 4 judges (3 labs) |
| Qwen 2.5 0.5B | 72% (comprehensive)* | 4 judges (3 labs) |
*0.5B model: 77% in-domain (n=13), 40% out-of-domain (n=15), 72% comprehensive (n=57)
Validation robustness: Unanimous strong preference across all judges (78.9-95.2%); 91.2% cross-lab pairwise agreement (GPT-4o ↔ Gemini)
Resource requirements: ~$3 training cost per 3B model (cloud GPU); 0.5B trains on consumer hardware at negligible cost
Novel methodology
Phase 1: Interactive Authoring
- Engage in iterative discussions (3-10 turns) with LLMs on target domain topics
- Researcher directs discussion, evaluates quality, ensures domain accuracy
- Capture specialized reasoning patterns through expert guidance
Phase 2: Curation and Refinement
- Select conversations that exemplify target reasoning patterns
- Refine responses for clarity and remove artifacts
- Verify domain accuracy and depth
- Apply expert judgment to ensure meaningful contributions
Phase 3: Response Distribution Learning
- Retain multiple high-quality responses per prompt (~6 avg)
- Expose model to distribution of valid responses, not just single mappings
- Teach range of acceptable variation within domain constraints
- 202 unique prompts → 1,213 total responses = learning variance within valid answers
Dataset: 1,213 training examples from iterative discussions focused on continental philosophy, speculative reasoning, and creative writing. Developed over several years of philosophical and creative discussions.
Architecture comparison
To demonstrate methodology transferability, we fine-tuned identical training data across three architectures, revealing distinct trade-offs:
Qwen 2.5 3B — Highest Specialization
91.2% in-domain, 47% out-of-domain
Strongest alignment with philosophical discourse patterns. Specializes aggressively while maintaining competence on unseen domains.
Llama 3.2 3B — Best General Preservation
80.4% in-domain, 60% out-of-domain
Strong domain performance with best preservation of general capabilities. Ideal when maintaining broad competence is prioritized.
Qwen 2.5 0.5B — Small Model Viability
77% in-domain, 40% out-of-domain, 72% comprehensive
Demonstrates effective domain adaptation at consumer-hardware scale. Trains on M4 MacBook at negligible cost.
Key finding: No model shows catastrophic forgetting—all maintain >40% competence on completely unseen domains. Architecture choice depends on whether maximizing domain expertise (Qwen) or preserving general capabilities (Llama) is prioritized.
Multi-judge validation
To address single-judge reliability concerns, we employed four independent judges from three laboratories (Anthropic, OpenAI, Google) with blind A/B testing and randomized presentation order.
Qwen 2.5 3B
Claude 3.5 Sonnet: 95.2% · Opus 4: 78.9% · GPT-4o: 93.0% · Gemini 2.5: 94.7%
Llama 3.2 3B
Claude 3.5 Sonnet: 73.8% · Opus 4: 80.0% · GPT-4o: 82.5% · Gemini 2.5: 84.2%
Qwen 2.5 0.5B
Claude 3.5 Sonnet: 76.2% · Opus 4: 45.6% · GPT-4o: 75.4% · Gemini 2.5: 82.5%
Cross-lab agreement: 91.2% pairwise agreement between GPT-4o ↔ Gemini (52/57 cases). Unanimous consensus across competing commercial labs provides strong evidence of genuine quality improvements rather than judge-specific biases.
Cost-effectiveness
Traditional approach: 10,000+ examples, weeks to months of data collection, multiple annotators, expensive quality control
Our approach: 1,213 examples authored through LLM discussions, ~$0 data generation cost, ~$3 training cost per 3B model, single expert maintains consistency
ROI: Small dataset vs. typical 10,000+ for instruction tuning, near-zero data generation cost, ~$3 training cost, 91% win rate for best model
Why this methodology works
- Expert-directed content selection: Human authors recognize nuanced patterns, correct errors, ensure domain accuracy
- Iterative refinement: Multiple revision cycles, progressive idea development, authentic reasoning progression
- Consistent expertise signal: Unified philosophical perspective, coherent reasoning style, authentic domain voice
- Response distribution learning: Multiple responses per prompt teach variance within valid answers, not just memorization
Broader implications
For ML Research: Challenges assumption that effective fine-tuning requires massive datasets; demonstrates quality and curation matter more than scale
For Domain Adaptation: Democratizes access—experts can create effective models without large budgets; reduces environmental impact through smaller datasets
For AI Safety: Greater control over model behavior patterns; transparent data provenance; ability to audit training examples
The methodology is domain-agnostic and reproducible. Any domain expert can apply this approach in fields where expertise is expressed through natural language—from creative writing and humanities to technical documentation and specialized discourse communities.
Technical details
- Fine-tuning method: LoRA (r=16, alpha=32) targeting q_proj and v_proj modules
- Training: 2 epochs, effective batch size 8, learning rate 2e-4, max_seq_length 2048
- Efficient deployment: LoRA adapters (14-19MB) enable rapid deployment and rollback
- Evaluation: Blind A/B testing, randomized order, 4 judges from 3 labs, 5 criteria (Creativity, Coherence, Depth, Engagement, Quality)
- Robustness testing: Multiple reproducibility runs, temperature variations (0.5/0.75/1.0), token length tests (256/512/1024)
- Domain coverage: Continental philosophy, phenomenology, existentialism, critical theory, speculative reasoning, creative writing (essays, cultural criticism, literary analysis)