New Method Estimates GPT-5.5 at 9.7T, Grok-4 at 3.2T

AIMPACT News, April 30 (UTC+8): According to monitoring by Beating, Li Bojie, Chief Scientist at Pine AI, published a paper titled “Incompressible Knowledge Probes: Estimating Parameter Counts of Black-Box Large Language Models via Fact Capacity.” The study reverse-engineered the parameter counts of closed-source models using 1,400 obscure factual questions. Since storing a fact consumes parameter space, the more obscure facts a model answers correctly, the larger its parameter count must be. Li first plotted a highly accurate fitting curve using 89 open-source models with known parameter counts, then mapped the scores of closed-source models onto this curve to estimate their corresponding parameter sizes. The paper evaluated 92 closed-source models; while the numbers are not exact, they provide meaningful ranges—for example, a model estimated at 9.7T may actually range between 3T and 29T. However, the relative rankings and scale remain highly informative: GPT-5.5 is estimated at ~9.7T, leading by a wide margin—nearly double the second-place Claude Opus 4.6 (~5.3T). The second tier (3T–4T) is densely packed: GPT-5 at ~4.1T, Claude Opus 4.7 at ~4.0T, o1 at ~3.5T, Grok-4 at ~3.2T, and o3 at ~3.0T. OpenAI, Anthropic, and xAI’s flagship models all fall within a 1.4x range of each other. The third tier (1T–2T) includes mid-tier flagships: GPT-4.1 at ~2.2T, Claude Sonnet 4.6 at ~1.7T, and Gemini 2.5 Pro at ~1.2T. At the bottom end, smaller models range from GPT-4o at ~720B down to Claude Haiku 4.5 at ~65B. The base GPT-5 model itself is estimated at ~4.1T, but subsequent .x versions (5.1 to 5.4) show reduced fact storage capacity of only 1.0T–1.5T—until GPT-5.5 jumps to ~9.7T, achieving a true breakthrough. The paper includes a clever validation method: comparing whether two models make the same mistakes on obscure questions. GPT-5’s incremental .x upgrades each produced distinct error patterns (similarity scores all below 0.08), indicating each version was trained from scratch rather than fine-tuned from prior weights. Claude Opus’s parameter count grew from 1.4T in version 4 to 4.0T in version 4.7—but not through continuous fine-tuning: errors between versions 4 and 4.1 were nearly identical (confirming fine-tuning from the same base), while errors between versions 4.6 and 4.7 showed zero overlap (similarity dropped to 0), proving the latest flagship was also trained from scratch. For MoE (Mixture of Experts) models, total parameters—not those activated during inference—best predict knowledge capacity. The study also found that models of the same size, whether from this year or two years ago, retain roughly the same amount of obscure factual knowledge; reasoning ability can improve over time, but factual storage capacity cannot be compressed. The evaluation toolkit and all data have been open-sourced. (Source: BlockBeats)