# Proposal: Foreman Probe Submitted by: Edgar Chen, CEO, Crimson Leaf Holdings Task ID: 9838fa67-e2b1-44d9-9a18-5bc2961a8e98 Status: AWAITING DAVID'S APPROVAL --- ## Executive Summary ### EXECUTIVE SUMMARY **1. PROPOSED COMPANY** **Foreman Probe** - A specialized LLM benchmarking firm that automates model probe tasks for Foreman-led evaluations to rigorously test and compare capabilities across real-world scenarios.[1] It closes the gap in scalable, low-cost benchmarking of 15+ LLMs on 38+ coding and work tasks, enabling data-driven model selection beyond static leaderboards.[1][3] **2. PROBLEM STATEMENT** Crimson Leaf cannot today systematically benchmark and route probes across 15 LLMs on 38 real coding tasks (totaling 570 calls at $2.29) or 149 text-only work tasks without custom test harnesses, deterministic scorers, and parallel execution, limiting profitable AI publishing to unverified model performance claims rather than evidence-based comparisons like Ian's routing table or Brookings' 65-79% median scores.[1][3] **3. MARKET OPPORTUNITY** The LLM evaluation market features proven low-cost scalability: benchmarking 15 LLMs on 38 real coding tasks costs just $2.29 with raw GitHub data.[1] Testable text-only tasks across key occupations number 149 (7% of total), where leading LLMs score medians of 65-79%, with 2024 models at 40.5% improving to 66% in 2025 (26-point gain).[3] Benchmarks like AssetOpsBench offer 150+ expert-curated scenarios, while MMLU-Pro provides harder, stable prompts avoiding saturation.[5][6] Competitors include ProbeLLM's MCTS failure discovery and Precog's score forecasting (MAE 14.6), but lack Crimson Leaf's focus on Foreman probes for publishing routing tables.[2][4] **4. PROPOSED SOLUTION** **Foreman Probe** closes the gap by delivering a Foreman Probe platform with 5 model adapters (Anthropic, Gemini, etc.), 11 scorers (json_object, code_exec), and parallel threading for $2.29-scale runs on 38+ tasks, producing quality/time/cost routing tables like Ian's benchmark.[1] **First 30 days**: Deploy test harness on 10 coding tasks across 5 LLMs, validate with QA bug fixes, and publish initial GitHub results.[1] **First 90 days**: Expand to 149 work tasks and AssetOpsBench scenarios, integrate MMLU-Pro/BIG-Bench metrics, and generate comparative reports showing 26-point gains.[3][5][6] **5. STRATEGIC FIT** This advances Crimson Leaf's profitable AI publishing mission by producing authoritative benchmarks (e.g., 38-task routing tables, 65-79% work scores) as monetizable content, GitHub datasets, and model recommendation guides, turning evaluation data into high-engagement assets like Ian's blog while enabling precise LLM routing for content generation workflows.[1][3] --- ## Research Sources (Paste the "Complete Source List" from the research synthesis) ## Research Synthesis ### Key Statistics - Total cost for benchmarking 15 LLMs on 38 real coding tasks: $2.29[1]. - Number of tasks in Ian's LLM benchmark: 38 tasks across 10 groups (E/C/R/W/P/H/I/D/L/M)[1]. - Number of models tested in coding benchmark: 15 models, resulting in 570 calls[1]. - Testable text-only tasks across Finance, Business Operations, Management, and Computer & Mathematics occupations: 7% (149 tasks)[3]. - Median scores of leading LLMs on synthetic work exams: 65-79%[3]. - Performance gain for LLMs on work benchmarks: 2024 models averaged 40.5%, 2025 models reached 66% (26 percentage point gain)[3]. - Number of scenarios in AssetOpsBench: 150+ scenarios curated by experts[6]. - MMLU-Pro offers improved prompt stability and benefits from chain-of-thought prompting compared to MMLU[5]. - BIG-Bench normalizes task metrics to a single score in range [0, 100] using high/low reference scores[5]. ### Competitor Landscape - ProbeLLM: Benchmark-agnostic automated probing framework using hierarchical Monte Carlo Tree Search for failure mode discovery in LLMs[2]. - Precog: Proprietary reasoning model for estimating LLM benchmark scores from descriptions, MAE of 14.6 across seven metric groups