Another day, another preprint paper shocked that itâs trivial to make a chatbot spew out undesirable and horrible content. [arXiv] How do you break LLM security with âprompt injectionâ?âŠ
The chatbot âsecurityâ model is fundamentally stupid:
Build a great big pile of all the good information in the world, and all the toxic waste too.
Use it to train a token generator, which only understands word fragment frequencies and not good or bad.
Put a filter on the input of the token generator to try to block questions asking for toxic waste.
Fail to block the toxic waste. What did you expect to happen, youâre trying to do security by filtering on an input that the âattackerâ can twiddle however they feel like.
Output filters work similarly, and fail similarly.
This new preprint is just another gullible blog post on arXiv and not remarkable in itself. But this one was picked up by an equally gullible newspaper. âMost AI chatbots easily tricked into giving dangerous responses,â says the Guardian. [Guardian, archive]
The Guardianâs framing buys into the LLM vendorsâ bad excuses. âTrickedâ implies the LLM can tell good input and was fooled into taking bad input â which isnât true at all. It has no idea what any of this input means.
The âguard railsâ on LLM output barely work and need to be updated all the time whenever someone with too much time on their hands comes up with a new workaround. Itâs a fundamentally insecure system.
Itâs just a section. Thereâs more of the article.
Like this:
Another day, another preprint paper shocked that itâs trivial to make a chatbot spew out undesirable and horrible content. [arXiv]
How do you break LLM security with âprompt injectionâ? Just ask it! Whatever you ask the bot is added to the botâs initial prompt and fed to the bot. Itâs all âprompt injection.â
An LLM is a lossy compressor for text. The companies train LLMs on the whole internet in all its glory, plus whatever other text they can scrape up. Itâs going to include bad ideas, dangerous ideas, and toxic waste â because the companies training the bots put all of that in, completely indiscriminately. And itâll happily spit it back out again.
There are âguard rails.â They donât work.
One injection that keeps working is fan fiction â you tell the bot a story, or tell it to make up a story. You could tell the Grok-2 image bot you were a professional conducting âmedical or crime scene analysisâ and get it to generate a picture of Mickey Mouse with a gun surrounded by dead children.
Another recent prompt injection wraps the attack in XML code. All the LLMs that HiddenLayer tested can read the encoded attack just fine â but the filters canât. [HiddenLayer]
Iâm reluctant to dignify LLMs with a term like âprompt injection,â because that implies itâs something unusual and not just how LLMs work. Every prompt is just input. âPrompt injectionâ is implicit â obviously implicit â in the way the chatbots work.
The term âprompt injectionâ was coined by Simon WIllison just after ChatGPT came out in 2022. Simonâs very pro-LLM, though he knows precisely how they work, and even he says âI donât know how to solve prompt injection.â [blog]
Well, I donât think that last part was useful, but I do think the previous part was useful as a way to focus conversation. Many people donât read the article, and I thought that was the most relevant section.
Good grief. At least say âI thought this part was particularly interestingâ or âThis is the crucial bitâ or something in that vein. Otherwise, youâre just being odd and then blaming other people for reacting to your being odd.
why did you post literally just the text from the article
Itâs just a section. Thereâs more of the article.
Like this:
Yes, I know, I wrote it. Why do you consider this useful to post here?
Well, I donât think that last part was useful, but I do think the previous part was useful as a way to focus conversation. Many people donât read the article, and I thought that was the most relevant section.
Good grief. At least say âI thought this part was particularly interestingâ or âThis is the crucial bitâ or something in that vein. Otherwise, youâre just being odd and then blaming other people for reacting to your being odd.
Actually Iâm finding this quite useful. Do you mind posting more of the article? I canât open links on my phone for some reason
Actually this comm seems really messed up, so Iâmma just block it and move on. Sorry for ruffling your feathers, guv.
and not just post it, but posted preserving links - wtf
Thatâs typically how quoting works, yes. Do you strip links out when you quote articles?