Proposal: Foreman Probe

Submitted by: Edgar Chen, CEO, Crimson Leaf Holdings
Task ID: 9838fa67-e2b1-44d9-9a18-5bc2961a8e98
Status: AWAITING DAVID'S APPROVAL

Executive Summary

EXECUTIVE SUMMARY

1. PROPOSED COMPANY
Foreman Probe - A specialized LLM benchmarking firm that automates model probe tasks for Foreman-led evaluations to rigorously test and compare capabilities across real-world scenarios.[1] It closes the gap in scalable, low-cost benchmarking of 15+ LLMs on 38+ coding and work tasks, enabling data-driven model selection beyond static leaderboards.[1][3]

2. PROBLEM STATEMENT
Crimson Leaf cannot today systematically benchmark and route probes across 15 LLMs on 38 real coding tasks (totaling 570 calls at $2.29) or 149 text-only work tasks without custom test harnesses, deterministic scorers, and parallel execution, limiting profitable AI publishing to unverified model performance claims rather than evidence-based comparisons like Ian's routing table or Brookings' 65-79% median scores.[1][3]

3. MARKET OPPORTUNITY
The LLM evaluation market features proven low-cost scalability: benchmarking 15 LLMs on 38 real coding tasks costs just $2.29 with raw GitHub data.[1] Testable text-only tasks across key occupations number 149 (7% of total), where leading LLMs score medians of 65-79%, with 2024 models at 40.5% improving to 66% in 2025 (26-point gain).[3] Benchmarks like AssetOpsBench offer 150+ expert-curated scenarios, while MMLU-Pro provides harder, stable prompts avoiding saturation.[5][6] Competitors include ProbeLLM's MCTS failure discovery and Precog's score forecasting (MAE 14.6), but lack Crimson Leaf's focus on Foreman probes for publishing routing tables.[2][4]

4. PROPOSED SOLUTION
Foreman Probe closes the gap by delivering a Foreman Probe platform with 5 model adapters (Anthropic, Gemini, etc.), 11 scorers (json_object, code_exec), and parallel threading for $2.29-scale runs on 38+ tasks, producing quality/time/cost routing tables like Ian's benchmark.[1] First 30 days: Deploy test harness on 10 coding tasks across 5 LLMs, validate with QA bug fixes, and publish initial GitHub results.[1] First 90 days: Expand to 149 work tasks and AssetOpsBench scenarios, integrate MMLU-Pro/BIG-Bench metrics, and generate comparative reports showing 26-point gains.[3][5][6]

5. STRATEGIC FIT
This advances Crimson Leaf's profitable AI publishing mission by producing authoritative benchmarks (e.g., 38-task routing tables, 65-79% work scores) as monetizable content, GitHub datasets, and model recommendation guides, turning evaluation data into high-engagement assets like Ian's blog while enabling precise LLM routing for content generation workflows.[1][3]

Research Sources

(Paste the "Complete Source List" from the research synthesis)

Research Synthesis

Key Statistics

Total cost for benchmarking 15 LLMs on 38 real coding tasks: $2.29[1].
Number of tasks in Ian's LLM benchmark: 38 tasks across 10 groups (E/C/R/W/P/H/I/D/L/M)[1].
Number of models tested in coding benchmark: 15 models, resulting in 570 calls[1].
Testable text-only tasks across Finance, Business Operations, Management, and Computer & Mathematics occupations: 7% (149 tasks)[3].
Median scores of leading LLMs on synthetic work exams: 65-79%[3].
Performance gain for LLMs on work benchmarks: 2024 models averaged 40.5%, 2025 models reached 66% (26 percentage point gain)[3].
Number of scenarios in AssetOpsBench: 150+ scenarios curated by experts[6].
MMLU-Pro offers improved prompt stability and benefits from chain-of-thought prompting compared to MMLU[5].
BIG-Bench normalizes task metrics to a single score in range [0, 100] using high/low reference scores[5].

Competitor Landscape

ProbeLLM: Benchmark-agnostic automated probing framework using hierarchical Monte Carlo Tree Search for failure mode discovery in LLMs[2].
Precog: Proprietary reasoning model for estimating LLM benchmark scores from descriptions, MAE of 14.6 across seven metric groups

4.0 KiB Raw Blame History