46 lines
4.0 KiB
Markdown
46 lines
4.0 KiB
Markdown
# Proposal: Foreman Probe
|
|
Submitted by: Edgar Chen, CEO, Crimson Leaf Holdings
|
|
Task ID: 9838fa67-e2b1-44d9-9a18-5bc2961a8e98
|
|
Status: AWAITING DAVID'S APPROVAL
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
### EXECUTIVE SUMMARY
|
|
|
|
**1. PROPOSED COMPANY**
|
|
**Foreman Probe** - A specialized LLM benchmarking firm that automates model probe tasks for Foreman-led evaluations to rigorously test and compare capabilities across real-world scenarios.[1] It closes the gap in scalable, low-cost benchmarking of 15+ LLMs on 38+ coding and work tasks, enabling data-driven model selection beyond static leaderboards.[1][3]
|
|
|
|
**2. PROBLEM STATEMENT**
|
|
Crimson Leaf cannot today systematically benchmark and route probes across 15 LLMs on 38 real coding tasks (totaling 570 calls at $2.29) or 149 text-only work tasks without custom test harnesses, deterministic scorers, and parallel execution, limiting profitable AI publishing to unverified model performance claims rather than evidence-based comparisons like Ian's routing table or Brookings' 65-79% median scores.[1][3]
|
|
|
|
**3. MARKET OPPORTUNITY**
|
|
The LLM evaluation market features proven low-cost scalability: benchmarking 15 LLMs on 38 real coding tasks costs just $2.29 with raw GitHub data.[1] Testable text-only tasks across key occupations number 149 (7% of total), where leading LLMs score medians of 65-79%, with 2024 models at 40.5% improving to 66% in 2025 (26-point gain).[3] Benchmarks like AssetOpsBench offer 150+ expert-curated scenarios, while MMLU-Pro provides harder, stable prompts avoiding saturation.[5][6] Competitors include ProbeLLM's MCTS failure discovery and Precog's score forecasting (MAE 14.6), but lack Crimson Leaf's focus on Foreman probes for publishing routing tables.[2][4]
|
|
|
|
**4. PROPOSED SOLUTION**
|
|
**Foreman Probe** closes the gap by delivering a Foreman Probe platform with 5 model adapters (Anthropic, Gemini, etc.), 11 scorers (json_object, code_exec), and parallel threading for $2.29-scale runs on 38+ tasks, producing quality/time/cost routing tables like Ian's benchmark.[1] **First 30 days**: Deploy test harness on 10 coding tasks across 5 LLMs, validate with QA bug fixes, and publish initial GitHub results.[1] **First 90 days**: Expand to 149 work tasks and AssetOpsBench scenarios, integrate MMLU-Pro/BIG-Bench metrics, and generate comparative reports showing 26-point gains.[3][5][6]
|
|
|
|
**5. STRATEGIC FIT**
|
|
This advances Crimson Leaf's profitable AI publishing mission by producing authoritative benchmarks (e.g., 38-task routing tables, 65-79% work scores) as monetizable content, GitHub datasets, and model recommendation guides, turning evaluation data into high-engagement assets like Ian's blog while enabling precise LLM routing for content generation workflows.[1][3]
|
|
|
|
---
|
|
|
|
## Research Sources
|
|
(Paste the "Complete Source List" from the research synthesis)
|
|
|
|
## Research Synthesis
|
|
|
|
### Key Statistics
|
|
- Total cost for benchmarking 15 LLMs on 38 real coding tasks: $2.29[1].
|
|
- Number of tasks in Ian's LLM benchmark: 38 tasks across 10 groups (E/C/R/W/P/H/I/D/L/M)[1].
|
|
- Number of models tested in coding benchmark: 15 models, resulting in 570 calls[1].
|
|
- Testable text-only tasks across Finance, Business Operations, Management, and Computer & Mathematics occupations: 7% (149 tasks)[3].
|
|
- Median scores of leading LLMs on synthetic work exams: 65-79%[3].
|
|
- Performance gain for LLMs on work benchmarks: 2024 models averaged 40.5%, 2025 models reached 66% (26 percentage point gain)[3].
|
|
- Number of scenarios in AssetOpsBench: 150+ scenarios curated by experts[6].
|
|
- MMLU-Pro offers improved prompt stability and benefits from chain-of-thought prompting compared to MMLU[5].
|
|
- BIG-Bench normalizes task metrics to a single score in range [0, 100] using high/low reference scores[5].
|
|
|
|
### Competitor Landscape
|
|
- ProbeLLM: Benchmark-agnostic automated probing framework using hierarchical Monte Carlo Tree Search for failure mode discovery in LLMs[2].
|
|
- Precog: Proprietary reasoning model for estimating LLM benchmark scores from descriptions, MAE of 14.6 across seven metric groups |