Humanity's Last Exam (HLE) is a new benchmark designed to evaluate the most advanced AI systems. It is intended to be a very difficult test, requiring expert-level knowledge and reasoning skills across a wide range of subjects.

The exam was created by a collaboration between the Centre for AI Safety (CAIS) and Scale AI, and it is intended to address the limitations of existing AI benchmarks, which have become too easy for the most advanced AI models.
#1: Humanity's Last Exam, What Is It?
The Published Paper Humanity's Last Exam (HLE) explains it is a global collaborative effort, with questions from nearly 1,000 subject expert contributors affiliated with over 500 institutions across 50 countries – comprised mostly of professors, researchers, and graduate degree holders.
The HLE consists of 3,000 questions across 100 subjects, including mathematics, humanities, and the natural sciences. The questions are designed to be very challenging, and they require AI systems to demonstrate a deep understanding of the subject matter.
The HLE is still under development, but it is expected to be released in the near future. Once it is released, it will be used to evaluate the most advanced AI systems, and it is hoped that it will help to identify the strengths and weaknesses of these systems.
Here are some of the key features of the HLE:
It is very difficult. The questions on the HLE are designed to be challenging even for experts in the field.
It covers a wide range of subjects. The HLE includes questions from mathematics, humanities, and the natural sciences.
It is still under development. The HLE is still being developed, but it is expected to be released in the near future.
The HLE is an important new tool for evaluating AI systems. It is hoped that it will help to identify the strengths and weaknesses of these systems, and that it will help to improve the development of AI.
#2: HLE Scoring Explained
The Humanity's Last Exam (HLE) scoring system will likely combine several elements to assess AI performance comprehensively.
At its core, it will likely track correct and incorrect answers, potentially awarding partial credit for answers demonstrating partial understanding. Questions may be weighted differently based on complexity and importance, with correct answers to higher-weighted questions contributing more to the final score. Different question types (multiple choice, open-ended, etc.) might also be evaluated using distinct methods, possibly incorporating human evaluation for complex or open-ended responses to assess the quality of reasoning and explanations.
Beyond simple accuracy, the HLE aims to measure reasoning ability. Therefore, the scoring system might emphasise questions requiring logical thinking, problem-solving, and applying knowledge to novel situations. Raw scores could be normalised or standardised for comparison across different AI systems. While the precise details are still under development, the goal is to create a robust and fair evaluation that accurately reflects an AI's capabilities across a broad range of cognitive skills.
Frontier LLMs exhibit low accuracy on the Humanity's Last Exam, revealing a substantial gap between their abilities and expert-level academic performance, even on closed-ended questions. This highlights the need for significant improvement in their understanding and knowledge across diverse subjects.
To measure calibration, models are asked to provide both an answer and their confidence in the answer from 0% to 100%. Models often demonstrate poor calibration, expressing undue confidence in incorrect answers, which points to a tendency for hallucination or confabulation.
Improving both accuracy and calibration is crucial for bridging the performance gap and fostering trust in LLM outputs, particularly in scenarios demanding reliable and well-reasoned responses. A well-calibrated model's confidence should reflect its actual accuracy; currently, LLMs are often overconfident in their wrong answers, making their confidence levels misleading and hindering their trustworthiness.
#3: Current Published Scores ( 6/02/2025 )
Model | Accuracy (%) ↑ | Calibration Error (%) ↓ |
GPT-4o | 3.3 | 92.5 |
Grok-2 | 3.8 | 93.2 |
Claude 3.5 Sonnet | 4.3 | 88.9 |
Gemini Thinking | 7.7 | 91.2 |
GPT-o1 | 9.1 | 93.4 |
DeepSeek-R1* | 9.4 | 81.8 |
GPT-o3-mini (medium)* | 10.5 | 92.0 |
GPT-o3-mini (high)* | 13.0 | 93.2 |
*Model is not multi-modal, evaluated on text-only subset.
**The new Open-AI Deep Research tool allegedly has an accuracy rating of 26.6%.
#4: AI Feature Sets Are Just As Important
HLE helps you gauge which AI is going to be most accurate. However successful AI deployment in the workplace is going to require planned implementation and the right features.
Current AI models are rapidly transforming the workplace by offering a suite of powerful features designed to boost productivity and streamline workflows. Multimodal capabilities, like those seen in Google's Gemini, allow for interaction beyond text, incorporating images, audio, and video, enabling richer communication and analysis. Integration with existing workplace tools is also key, such as Google's NotebookLM connecting to Workspace, enabling AI-powered summation and insights directly within documents and spreadsheets.
Conversational AI interfaces, such as ChatGPT, Claude, and Microsoft Copilot, provide intuitive ways to access information, generate content, and automate tasks through natural language interactions. These models can assist with a range of tasks, from drafting emails and creating presentations to managing projects and conducting deep research.
Beyond these core functionalities, AI models offer specialised features catering to specific workplace needs. Microsoft Copilot aims to enhance productivity within the Microsoft 365 suite, while Perplexity focuses on providing accurate and verifiable information with citations, addressing the crucial need for reliable data in professional settings.
ChatGPT, in addition to content generation, can help structure project outlines, brainstorm ideas, and even provide different perspectives on complex problems. The ongoing development of these AI tools promises to further reshape the workplace, offering more specialised and powerful features tailored to the evolving demands of modern work, including more sophisticated project management and deeper research capabilities.
Looking ahead, AI agents, including Copilot agents and those within Google's AgentSpace, promise a new level of intelligent automation. These autonomous agents can be given goals and independently execute the necessary tasks, potentially revolutionising workflows by handling entire projects or conducting in-depth, multi-source research.
Conclusion
The Humanity's Last Exam provides a valuable benchmark for comparing the accuracy and reasoning abilities of different AI models. However, choosing the right AI for your workplace needs goes beyond just exam scores.
It's crucial to consider the specific features and functionalities that each model offers, and how well those align with your workflows and objectives. Whether it's the multimodal capabilities of Gemini, the seamless integration of NotebookLM with Workspace, or the specialised features of Copilot and Perplexity, assess your needs and choose the AI that best empowers your team and drives productivity. As AI technology continues to evolve, we can expect even more sophisticated and specialised tools to emerge, further transforming the future of work.
Comments