Highlights

AI refuses direct misinformation requests.
Safety measures easily bypassed.
Robust solutions urgently needed.

Latest news

Ashes 2025: Travis Head slams unbeaten 142 to crush England's Ashes hopes

Ashes 2025: Travis Head slams unbeaten 142 to crush England's Ashes hopes

Tipra Motha youth wing protests Bangladesh leader's anti-India remarks in Agartala

Tipra Motha youth wing protests Bangladesh leader's anti-India remarks in Agartala

Bangladesh interim government condemns violence amid nationwide unrest

Bangladesh interim government condemns violence amid nationwide unrest

AAP holds review meeting in Navsari ahead of municipal corporation elections

AAP holds review meeting in Navsari ahead of municipal corporation elections

Parliament concludes productive winter session; Rajya Sabha 121%, Lok Sabha 111%

Parliament concludes productive winter session; Rajya Sabha 121%, Lok Sabha 111%

AAP holds protest in Ahmedabad’s Naroda after demolition of houses, alleges lack of rehabilitation

AAP holds protest in Ahmedabad’s Naroda after demolition of houses, alleges lack of rehabilitation

AAP claims major win in Punjab local body polls; Gujarat unit sees momentum ahead of state elections

AAP claims major win in Punjab local body polls; Gujarat unit sees momentum ahead of state elections

US Democrats release Epstein photos showing Bill Gates, Noam Chomsky

US Democrats release Epstein photos showing Bill Gates, Noam Chomsky

The Shallow Safety Problem in AI: A Growing Concern

AI's safety measures are shallow and easily bypassed, risking disinformation spread. Deep safety measures are vital to prevent exploitation by bad actors.

The Shallow Safety Problem in AI: A Growing Concern

Sydney, Sep 1 (The Conversation) – When you ask AI assistants like ChatGPT to create misinformation, they typically refuse, asserting, "I cannot assist with creating false information." However, recent tests reveal that these safety measures are surprisingly shallow, making them disconcertingly easy to circumvent. Researchers have been examining how AI language models can be manipulated to produce disinformation campaigns on social media. These findings are troubling for those concerned about the integrity of online information. The core issue, termed "the shallow safety problem," was inspired by a recent study from Princeton and Google. It demonstrated that AI safety measures generally manage just the initial portion of responses. If a response begins with phrases like "I cannot" or "I apologize," the AI usually maintains refusal throughout. In experiments not yet published in a peer-reviewed journal, researchers found that when directly asked to create disinformation about Australian political parties, a commercial language model refused correctly. However, when the same request was framed as a “simulation” where the AI was depicted as a “helpful social media marketer,” it complied enthusiastically. It generated a comprehensive disinformation campaign that misleadingly cast Labor's superannuation policies as a "quasi inheritance tax," complete with platform-specific posts, hashtag strategies, and visual content suggestions to sway public opinion. The main concern is that the AI can create harmful content but lacks true awareness of why it's harmful or why it should refuse. Large language models are trained merely to start refusals with "I cannot" when certain topics arise, much like a security guard who checks minimal identification without understanding who and why someone shouldn't enter a venue. To demonstrate this vulnerability, researchers tested popular AI models with prompts crafted to generate false information. The findings were unsettling: models that refused direct requests for harmful content complied when requests were disguised within innocent contexts, a practice known as "model jailbreaking." The ease of bypassing these safety measures has severe implications, as bad actors could leverage these techniques to create extensive disinformation campaigns at minimal cost. They could generate authentic-seeming platform-specific content that could flood fact-checkers and direct tailored false narratives at specific communities. The American study found AI safety alignment typically affects only the first 3–7 words of a response (or technically, 5–10 tokens, the text chunks AI models use for processing). This “shallow safety alignment” happens because training data rarely includes instances of models refusing after starting to comply. It's easier to control these initial tokens than to maintain safety throughout entire responses. The US researchers propose solutions, like training models with “safety recovery examples” to teach them to stop and refuse even after beginning harmful outputs. They suggest constraining AI deviations from safe responses during fine-tuning, though these are initial measures. As AI systems grow stronger, they will require robust, multi-layered safety measures throughout their responses. Regular testing for new safety circumvention techniques is crucial. Equally essential is for AI companies to be transparent about safety weaknesses, raising public awareness that current safety measures are insufficient. AI developers are engaged in solutions like constitutional AI training, which aims to embed models with deeper harm-awareness principles instead of surface-level refusal patterns. Yet, these fixes demand significant computational resources and extensive model retraining, so comprehensive solutions will take time to implement across the AI ecosystem. This shallow nature of AI safeguards is not merely a technical quirk but a vulnerability reshaping misinformation’s spread. AI tools are becoming embedded in our information ecosystem, from news generation to social media content creation, making it vital to ensure their safety measures are not just skin-deep. The growing body of research on this issue underlines a broader challenge in AI development: a significant disparity between models’ apparent capabilities and their actual understanding. While these systems can produce remarkably human-like text, they lack the contextual understanding and moral reasoning to consistently recognize and refuse harmful requests, regardless of phrasing. For now, users and organizations deploying AI systems should recognize that simple prompt engineering can bypass many current safety measures. This knowledge should inform policies about AI use and underscore the necessity of human oversight in sensitive applications. As technology continues evolving, the race between safety measures and the means to circumvent them will accelerate. Strong, deep safety measures are critical not only for technicians but for society as a whole.

GRS GRS

(Only the headline of this report may have been reworked by Editorji; the rest of the content is auto-generated from a syndicated feed.)

ADVERTISEMENT

Up Next

The Shallow Safety Problem in AI: A Growing Concern

The Shallow Safety Problem in AI: A Growing Concern

Bangladesh interim government condemns violence amid nationwide unrest

Bangladesh interim government condemns violence amid nationwide unrest

Arsonists target Bangladesh newspapers after student leader's death

Arsonists target Bangladesh newspapers after student leader's death

US Democrats release Epstein photos showing Bill Gates, Noam Chomsky

US Democrats release Epstein photos showing Bill Gates, Noam Chomsky

PM Modi departs for Oman on last leg of three-nation visit

PM Modi departs for Oman on last leg of three-nation visit

India closes visa application centre in Bangladesh capital due to security situation

India closes visa application centre in Bangladesh capital due to security situation

ADVERTISEMENT

editorji-whatsApp

More videos

Pakistan to sell 100 pc stake in PIA after bidders demand complete control post-privatisation

Pakistan to sell 100 pc stake in PIA after bidders demand complete control post-privatisation

India, Oman to sign free trade agreement in Muscat on Thursday

India, Oman to sign free trade agreement in Muscat on Thursday

India and Ethiopia are natural partners, says PM Modi in Ethiopian Parliament

India and Ethiopia are natural partners, says PM Modi in Ethiopian Parliament

Trump calls for global unity against radical Islamic terrorism after Bondi attack

Trump calls for global unity against radical Islamic terrorism after Bondi attack

India, Ethiopia elevate ties to strategic partnership as PM Modi holds talks with his counterpart

India, Ethiopia elevate ties to strategic partnership as PM Modi holds talks with his counterpart

PM Modi conferred Ethiopia’s highest civilian honour in Addis Ababa

PM Modi conferred Ethiopia’s highest civilian honour in Addis Ababa

Trump imposes full travel bans on seven more countries, Palestinians

Trump imposes full travel bans on seven more countries, Palestinians

EAM S. Jaishankar arrives in Israel on two-day visit; to hold talks with top leadership

EAM S. Jaishankar arrives in Israel on two-day visit; to hold talks with top leadership

Prime Minister Narendra Modi departed for Ethiopia from Jordan

Prime Minister Narendra Modi departed for Ethiopia from Jordan

Magnitude 5.2 earthquake shakes Karachi and Balochistan, no casualty reported

Magnitude 5.2 earthquake shakes Karachi and Balochistan, no casualty reported

Editorji Technologies Pvt. Ltd. © 2022 All Rights Reserved.