s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs
Abstract
A new benchmark evaluates large language models on formal proof synthesis for industrial cryptographic assembly routines verified in HOL Light, addressing the gap between competition mathematics and real-world implementation verification.
Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In s2n-bignum-bench, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, s2n-bignum-bench is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: https://github.com/kings-crown/s2n-bignum-bench{s2n-bignum-bench}.
Community
We propose a new benchmark evaluating large language models on formal proof synthesis for industrial cryptographic assembly routines verified in HOL Light, addressing the gap between competition mathematics and real-world implementation verification.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean (2026)
- Towards Real-World Industrial-Scale Verification: LLM-Driven Theorem Proving on seL4 (2026)
- Neural Theorem Proving for Verification Conditions: A Real-World Benchmark (2026)
- AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms (2026)
- MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics (2026)
- Pipeline for Verifying LLM-Generated Mathematical Solutions (2026)
- PBLean: Pseudo-Boolean Proof Certificates for Lean 4 (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper