Services
Contact Us
No results found.

VELC-Bench: Verification on Long Context Benchmark

Şevval Alper
Şevval Alper
updated on Feb 22, 2026

The model’s ability to locate a specific metric in context, compare its value to a claim, and confirm or reject it. This tests fine-grained value matching under long-context conditions. The model must both retrieve the value and perform a precise comparison.

Results

Loading Chart

The models are tested in the following context windows:

  • openai/gpt-5.5: 1,000,000 tokens
  • google/gemini-3.1-pro-preview: 1,000,000 tokens
  • google/gemini-3.5-flash: 1,000,000 tokens
  • anthropic/claude-sonnet-4.6: 1,000,000 tokens
  • qwen/qwen3.6-plus: 1,000,000 tokens
  • moonshotai/kimi-k2.6: 200,000 tokens
  • z-ai/glm-5.1: 200,000 tokens
  • minimax/minimax-m2.7: 150,000 tokens
  • openai/gpt-5.4-mini: 250,000 tokens

Question formats

Verify YES (the claim’s value is correct):

Claim: The Revenue for Q1 2026 Adobe (ADBE) is $6.40 billion.
Expected: YES

Verify NO (the claim’s value is wrong):

Claim: The Revenue for Q1 2026 Adobe (ADBE) is $7.92 billion.
Expected: NO

Data source

Same TAKEAWAYS-extracted metrics as direct recall. For each chosen metric:

  • Verify YES items use the actual value from the transcript
  • Verify NO items use a programmatically perturbed value (8–25% off, in either direction, with matching precision and units)

Scoring rule

Three-state detection on the model’s response:

  1. If the response contains a NOT MENTIONED phrase (e.g., “not mentioned,” “not discussed”) → predicted = not_mentioned
  2. Else if it contains “yes” → predicted = yes
  3. Else if it contains “no” → predicted = no

Score = 1.0 if predicted == expected, else 0.0.

Detection priority is NOT MENTIONED > NO > YES to prevent “not mentioned” from accidentally matching “no” via the substring “not.”

Phase-by-phase interpretation

The asymmetry between YES and NO is informative: YES requires positive identification of a value (harder when the target is deeper), while NO requires only spotting a mismatch (easier when recently read).

Phases are 0.1, 0.5, and 0.9 of the context window, to see the difference in accuracy in different haystack positions.

What is a good performance?

Phase 2 YES ≥ 80% and NO ≥ 80% indicates the model can both confirm and reject across a haystack.

A model that scores very high on NO but low on YES is biased toward rejection. A model that scores very high on YES but low on NO is over-trusting claims.

Item count

50 verify_yes + 50 verify_no = 100 verify items.

Cite this research

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Şevval Alper and Berk Kalelioğlu (2026) - "VELC-Bench: Verification on Long Context Benchmark". Published online at AIMultiple.com. Retrieved February 22, 2026, from: https://aimultiple.com/ai-context-window [Online Resource]

Alper, Ş., & Kalelioğlu, B. (2026, February 22). VELC-Bench: Verification on Long Context Benchmark. AIMultiple. https://aimultiple.com/ai-context-window

@misc{alper2026,
  author = {Alper, Şevval and Kalelioğlu, Berk},
  title  = {{VELC-Bench: Verification on Long Context Benchmark}},
  year   = {2026},
  month  = feb,
  howpublished    = {\url{https://aimultiple.com/ai-context-window}},
  note   = {AIMultiple. Retrieved February 22, 2026}
}
Şevval Alper
Şevval Alper
AI Researcher
Şevval is an AIMultiple AI researcher specializing in LLMs, AI agents and quantum technologies.
View Full Profile
Technically reviewed by
Berk Kalelioğlu
Berk Kalelioğlu
AI Researcher
Berk is an AI Researcher at AIMultiple, focusing on agentic ai systems and language models.
View Full Profile

Be the first to comment

Your email address will not be published. All fields are required.

0/450