Highlights

GPT-5 scores high. Benchmarks can mislead. Metrology crucial.

Latest news

RBI announces Rs 30,000 crore G-Sec underwriting auction, releases OMO purchase results

RBI announces Rs 30,000 crore G-Sec underwriting auction, releases OMO purchase results

Sriram Raghavan, Dibakar Banerjee, other filmmakers onboard to judge films at MAMI Mumbai Film Festival 2026

Sriram Raghavan, Dibakar Banerjee, other filmmakers onboard to judge films at MAMI Mumbai Film Festival 2026

Gold should now be seen more as an "insurance policy", SIP route advisable at current levels: Analysts

Gold should now be seen more as an "insurance policy", SIP route advisable at current levels: Analysts

AAP MLA Hemant Khava flags poor road conditions, questions toll tax usage in Gujarat

AAP MLA Hemant Khava flags poor road conditions, questions toll tax usage in Gujarat

AAP calls Punjab district panchayat win historic, eyes Gujarat local body polls

AAP calls Punjab district panchayat win historic, eyes Gujarat local body polls

Gujarat AAP MLA Chaitar Vasava questions police action against tribal villagers in Banaskantha

Gujarat AAP MLA Chaitar Vasava questions police action against tribal villagers in Banaskantha

Sitharaman introduced Securities Markets Code Bill in Lok Sabha, proposes to send it to parliamentary committee

Sitharaman introduced Securities Markets Code Bill in Lok Sabha, proposes to send it to parliamentary committee

OnePlus 15R review: A clear shift in what the R-series stands for

OnePlus 15R review: A clear shift in what the R-series stands for

OpenAI's GPT-5: Reckoning with AI's Real-World Impact

OpenAI's GPT-5 excels in benchmarks but needs metrology for real-world impact. Evaluations often miss AI's societal effects; better assessment systems are vital.

OpenAI's GPT-5: Reckoning with AI's Real-World Impact

Melbourne, August 25 (The Conversation) – This month witnessed the launch of OpenAI’s GPT-5, lauded as a major advancement over its predecessors, touted as "much smarter across the board". The assertion was supported by high scores in diverse benchmark assessments, including areas like software coding, mathematics, and healthcare.

However, while these benchmark tests have become the gold standard for AI evaluation, they often fall short of revealing how these systems will perform and impact the real world.

AI researchers and metrologists – specialists in the science of measurement – have proposed a new approach to better assess AI models.

Metrology's Role

Metrology is crucial for ensuring the reliability of growing AI systems, alongside evaluating their broader economic, cultural, and societal ramifications.

Measuring Safety

Metrology gives us confidence that our tools, products, and services are dependable. In healthcare, AI holds the potential to refine diagnostics, personalize medicine, assist in disease prevention, and manage administrative duties.

Realizing these promises necessitates verifying the safety and efficacy of health AI—and finding reliable methods to measure it.

Unlike drugs and medical devices, which undergo rigorous safety and efficiency evaluations, AI lacks such robust evaluation systems across domains such as healthcare, education, law enforcement, and biometrics.

Benchmark Limitations

Presently, AI system evaluations rely largely on benchmarks that test output accuracy and relevance against human expertise.

Though numerous benchmarks span various knowledge domains, their performance metrics give scant insight into how these models impact real-world contexts. The real measure comes from assessing the system's deployment environment.

Benchmarks have become significant for AI developers, aiding product performance showcases and attracting investments. For instance, Cognition AI, after achieving notable performance on a software benchmark this April, raised USD 175 million, gaining a valuation of USD 2 billion.

However, benchmarks can be gamed. Meta tailored its Llama-4 model to boost its ranking on a chatbot site, and OpenAI had access to the FrontierMath benchmark dataset before testing its o3 model—raising concerns over result legitimacy.

This scenario echoes Goodhart’s Law: "When a measure becomes a target, it ceases to be a good measure." Experts caution that overemphasizing metrics can lead to their manipulation and redirect focus from long-term goals.

Beyond Benchmarks

If benchmarks aren’t enough, what is? In healthcare, early assessments of language models involved using medical licensing exams, which gauge doctors' competence and safety before practicing. Yet, AI models hitting high scores on these have been critiqued for not capturing real clinical practice's complexities.

As a reaction, more comprehensive evaluation frameworks like MedHELM have emerged. It encompasses 35 benchmarks over five clinical task categories, assessing models in decision-making, communication, and research.

While MedHELM aims to bridge this gap, it lacks depth in considering human-AI interaction and external impacts like culture and economy.

Establishing a broader evaluation ecosystem, incorporating academic, industrial, and civil society insights, is vital for crafting reliable AI evaluation methods.

Progress is underway in measuring AI’s real-world impact, including red-teaming and field testing within operation contexts. Refining and standardizing these methodologies will be pivotal in reliably translating AI measurements.

If AI fulfills even a fraction of its transformational promise, a measurement science is critical to protect all societal interests—not just those of the tech elite. (The Conversation) RD RD

(Only the headline of this report may have been reworked by Editorji; the rest of the content is auto-generated from a syndicated feed.)

ADVERTISEMENT

Up Next

OpenAI's GPT-5: Reckoning with AI's Real-World Impact

OpenAI's GPT-5: Reckoning with AI's Real-World Impact

PM Modi departs for Oman on last leg of three-nation visit

PM Modi departs for Oman on last leg of three-nation visit

India closes visa application centre in Bangladesh capital due to security situation

India closes visa application centre in Bangladesh capital due to security situation

Pakistan to sell 100 pc stake in PIA after bidders demand complete control post-privatisation

Pakistan to sell 100 pc stake in PIA after bidders demand complete control post-privatisation

India, Oman to sign free trade agreement in Muscat on Thursday

India, Oman to sign free trade agreement in Muscat on Thursday

India and Ethiopia are natural partners, says PM Modi in Ethiopian Parliament

India and Ethiopia are natural partners, says PM Modi in Ethiopian Parliament

ADVERTISEMENT

editorji-whatsApp

More videos

Trump calls for global unity against radical Islamic terrorism after Bondi attack

Trump calls for global unity against radical Islamic terrorism after Bondi attack

India, Ethiopia elevate ties to strategic partnership as PM Modi holds talks with his counterpart

India, Ethiopia elevate ties to strategic partnership as PM Modi holds talks with his counterpart

PM Modi conferred Ethiopia’s highest civilian honour in Addis Ababa

PM Modi conferred Ethiopia’s highest civilian honour in Addis Ababa

Trump imposes full travel bans on seven more countries, Palestinians

Trump imposes full travel bans on seven more countries, Palestinians

EAM S. Jaishankar arrives in Israel on two-day visit; to hold talks with top leadership

EAM S. Jaishankar arrives in Israel on two-day visit; to hold talks with top leadership

Prime Minister Narendra Modi departed for Ethiopia from Jordan

Prime Minister Narendra Modi departed for Ethiopia from Jordan

Magnitude 5.2 earthquake shakes Karachi and Balochistan, no casualty reported

Magnitude 5.2 earthquake shakes Karachi and Balochistan, no casualty reported

Crown Prince drives PM Modi to Jordan Museum in special gesture

Crown Prince drives PM Modi to Jordan Museum in special gesture

PM Modi highlights substantive outcomes of Jordan visit, says ties expanded across key sectors

PM Modi highlights substantive outcomes of Jordan visit, says ties expanded across key sectors

Trump sues BBC for $10 billion over documentary speech edit

Trump sues BBC for $10 billion over documentary speech edit

Editorji Technologies Pvt. Ltd. © 2022 All Rights Reserved.