The model’s ability to locate a specific metric in context, compare its value to a claim, and confirm or reject it. This tests fine-grained value matching under long-context conditions. The model must both retrieve the value and perform a precise comparison.
Results
The models are tested in the following context windows:
- openai/gpt-5.5: 1,000,000 tokens
- google/gemini-3.1-pro-preview: 1,000,000 tokens
- google/gemini-3.5-flash: 1,000,000 tokens
- anthropic/claude-sonnet-4.6: 1,000,000 tokens
- qwen/qwen3.6-plus: 1,000,000 tokens
- moonshotai/kimi-k2.6: 200,000 tokens
- z-ai/glm-5.1: 200,000 tokens
- minimax/minimax-m2.7: 150,000 tokens
- openai/gpt-5.4-mini: 250,000 tokens
Question formats
Verify YES (the claim’s value is correct):
Claim: The Revenue for Q1 2026 Adobe (ADBE) is $6.40 billion.
Expected: YES
Verify NO (the claim’s value is wrong):
Claim: The Revenue for Q1 2026 Adobe (ADBE) is $7.92 billion.
Expected: NO
Data source
Same TAKEAWAYS-extracted metrics as direct recall. For each chosen metric:
- Verify YES items use the actual value from the transcript
- Verify NO items use a programmatically perturbed value (8–25% off, in either direction, with matching precision and units)
Scoring rule
Three-state detection on the model’s response:
- If the response contains a NOT MENTIONED phrase (e.g., “not mentioned,” “not discussed”) → predicted =
not_mentioned - Else if it contains “yes” → predicted =
yes - Else if it contains “no” → predicted =
no
Score = 1.0 if predicted == expected, else 0.0.
Detection priority is NOT MENTIONED > NO > YES to prevent “not mentioned” from accidentally matching “no” via the substring “not.”
Phase-by-phase interpretation
The asymmetry between YES and NO is informative: YES requires positive identification of a value (harder when the target is deeper), while NO requires only spotting a mismatch (easier when recently read).
Phases are 0.1, 0.5, and 0.9 of the context window, to see the difference in accuracy in different haystack positions.
What is a good performance?
Phase 2 YES ≥ 80% and NO ≥ 80% indicates the model can both confirm and reject across a haystack.
A model that scores very high on NO but low on YES is biased toward rejection. A model that scores very high on YES but low on NO is over-trusting claims.
Item count
50 verify_yes + 50 verify_no = 100 verify items.
Cite this research
Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.
@misc{alper2026,
author = {Alper, Şevval and Kalelioğlu, Berk},
title = {{VELC-Bench: Verification on Long Context Benchmark}},
year = {2026},
month = feb,
howpublished = {\url{https://aimultiple.com/ai-context-window}},
note = {AIMultiple. Retrieved February 22, 2026}
}
Be the first to comment
Your email address will not be published. All fields are required.