Brie

Model · 2025

Controlled architecture comparison study using a hand-tailored dataset (1,213 examples from philosophical discussions in RLHF/DPO environments) fine-tuned across Qwen 2.5, Llama 3.2, and Qwen3 architectures to observe differences in handling philosophical reasoning and contemplative discourse. Evaluated via blind A/B testing with four independent judges across three labs.

Research question

How do different model architectures differ in their ability to adopt and maintain patterns of philosophical reasoning and contemplative discourse when fine-tuned on identical training data?

Architecture comparison results

Qwen 2.5 3B: 91.2% win rate (4 judges, 3 labs) — strongest alignment

Llama 3.2 3B: 80.4% avg win rate (range: 75.4–84.2% across judges)

Qwen 2.5 0.5B: 71.9% win rate (77% in-domain; 40% out-of-domain)

Qwen3 0.6B: ~30% win rate — sub-1B struggles with contemplative reasoning

Research findings

  • Qwen 2.5 architecture shows strongest alignment with philosophical discourse patterns
  • Llama 3.2 maintains strong performance (75–84% depending on judge)
  • Model size matters: sub-1B models struggle with contemplative reasoning patterns
  • Different judges show varying sensitivity to stylistic differences
  • Controlled comparison: identical data (1,213 examples) across all architectures reveals architectural differences

Evaluation methodology

  • Blind A/B testing; randomized presentation order to control for position bias
  • Four independent LLM judges across three labs (Anthropic, OpenAI, Google)
  • Evaluation criteria: Creativity, Coherence, Depth, Engagement, Writing Quality
  • Note: Critical bug in winner determination discovered and corrected (inverted 56% of early results)

Multi-judge validation

  • Qwen 2.5 3B — Claude Sonnet: 95.2% · Opus: 78.9% · GPT‑4o: 93.0% · Gemini: 94.7%
  • Llama 3.2 3B — Claude Sonnet: 73.8% · Opus: 80.0% · GPT‑4o: 82.5% · Gemini: 84.2%
  • Qwen 2.5 0.5B — Claude Sonnet: 76.2% · Opus: 45.6% · GPT‑4o: 75.4% · Gemini: 82.5%
  • Inter-judge agreement (GPT‑4o ↔ Gemini): Qwen 3B 91.2%, Llama 3B (pending calculation)

Technical highlights

  • Controlled comparison: Identical training data across architectures isolates architectural effects
  • Efficient deployment: LoRA adapters (14–19MB) enable rapid deployment and rollback
  • Methodological rigor: Blind A/B, multi-judge (4 judges, 3 labs), reproducibility runs
  • Robustness: No catastrophic forgetting; tested across temperatures (0.5/0.75/1.0) and token lengths
  • Transparency: Documented evaluation bug fix; all metrics reflect corrected data

Links