Benchmark

VELC-Bench: Verification on Long Context Benchmark

updated on Jul 7, 2026

The model’s ability to locate a specific metric in context, compare its value to a claim, and confirm or reject it. This tests fine-grained value matching under long-context conditions. The model must both retrieve the value and perform a precise comparison.

Results

VELC-Bench: Verification on Long Context

Loading Chart

The models are tested in the following context windows:

anthropic/claude-fable-5: 850,000 tokens tested
openai/gpt-5.5: 1,000,000 tokens
google/gemini-3.1-pro-preview: 1,000,000 tokens
google/gemini-3.5-flash: 1,000,000 tokens
anthropic/claude-sonnet-4.6: 1,000,000 tokens
qwen/qwen3.6-plus: 1,000,000 tokens
moonshotai/kimi-k2.6: 200,000 tokens
z-ai/glm-5.1: 200,000 tokens
minimax/minimax-m2.7: 150,000 tokens
openai/gpt-5.4-mini: 250,000 tokens

claude-fable-5 scores 90.0% on verify YES and 94.0% on verify NO. The gap matches the asymmetry described below: confirming a value requires finding it, while rejecting one only requires spotting a mismatch.

Question formats

Verify YES (the claim’s value is correct):

Claim: The Revenue for Q1 2026 Adobe (ADBE) is $6.40 billion.
Expected: YES

Verify NO (the claim’s value is wrong):

Claim: The Revenue for Q1 2026 Adobe (ADBE) is $7.92 billion.
Expected: NO

Data source

Same TAKEAWAYS-extracted metrics as direct recall. For each chosen metric:

Verify YES items use the actual value from the transcript
Verify NO items use a programmatically perturbed value (8–25% off, in either direction, with matching precision and units)

Scoring rule

Three-state detection on the model’s response:

If the response contains a NOT MENTIONED phrase (e.g., “not mentioned,” “not discussed”) → predicted = not_mentioned
Else if it contains “yes” → predicted = yes
Else if it contains “no” → predicted = no

Score = 1.0 if predicted == expected, else 0.0.

Detection priority is NOT MENTIONED > NO > YES to prevent “not mentioned” from accidentally matching “no” via the substring “not.”

claude-fable-5 is tested through Claude Code: it receives the 850,000-token haystack as a file and searches it with retrieval tools instead of reading it from its context window, so its scores measure the model together with the Claude Code harness.

Phase-by-phase interpretation

The asymmetry between YES and NO is informative: YES requires positive identification of a value (harder when the target is deeper), while NO requires only spotting a mismatch (easier when recently read).

Phases are 0.1, 0.5, and 0.9 of the context window, to see the difference in accuracy in different haystack positions.

What is a good performance?

Phase 2 YES ≥ 80% and NO ≥ 80% indicates the model can both confirm and reject across a haystack.

A model that scores very high on NO but low on YES is biased toward rejection. A model that scores very high on YES but low on NO is over-trusting claims.

Item count

50 verify_yes + 50 verify_no = 100 verify items.

Cite this research

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Cem Dilmegani (2026) - "VELC-Bench: Verification on Long Context Benchmark". Published online at AIMultiple.com. Retrieved July 7, 2026, from: https://aimultiple.com/ai-context-window [Online Resource]

Dilmegani, C. (2026, July 7). VELC-Bench: Verification on Long Context Benchmark. AIMultiple. https://aimultiple.com/ai-context-window

@misc{dilmegani2026,
  author = {Dilmegani, Cem},
  title  = {{VELC-Bench: Verification on Long Context Benchmark}},
  year   = {2026},
  month  = jul,
  howpublished    = {\url{https://aimultiple.com/ai-context-window}},
  note   = {AIMultiple. Retrieved July 7, 2026}
}

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Be the first to comment

Your email address will not be published. All fields are required. Comments are left in their original language.

Results

Cite this research

We follow ethical norms & our process for objectivity. This research does not feature any customers of AIMultiple.

Don’t miss our benchmarks and data-driven insights. The button opens Google; selecting AIMultiple confirms that you wish to see AIMultiple more often in Google search results.

Add as preferred source