Shut the back door: Understanding prompt injection and minimizing risk

Join us in returning to NYC on June 5th to collaborate with executive leaders in exploring comprehensive methods for auditing AI models regarding bias, performance, and ethical compliance across diverse organizations. Find out how you can attend here.


New technology means new opportunities… but also new threats. And when the technology is as complex and unfamiliar as generative AI, it can be hard to understand which is which.

Take the discussion around hallucination. In the early days of the AI rush, many people were convinced that hallucination was always an unwanted and potentially harmful behavior, something that needed to be stamped out completely. Then, the conversation changed to encompass the idea that hallucination can be valuable. 

Isa Fulford of OpenAI expresses this well. “We probably don’t want models that never hallucinate, because you can think of it as the model being creative,” she points out. “We just want models that hallucinate in the right context. In some contexts, it is ok to hallucinate (for example, if you’re asking for help with creative writing or new creative ways to address a problem), while in other cases it isn’t.” 

This viewpoint is now the dominant one on hallucination. And, now there is a new concept that is rising to prominence and creating plenty of fear: “Prompt injection.” This is generally defined as when users deliberately misuse or exploit an AI solution to create an unwanted outcome. And unlike most of the conversation about possible bad outcomes from AI, which tend to center on possible negative outcomes to users, this concerns risks to AI providers.

VB Event

The AI Impact Tour: The AI Audit

Join us as we return to NYC on June 5th to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.

Request an invite

I’ll share why I think much of the hype and fear around prompt injection is overblown, but that’s not to say there is no real risk. Prompt injection should serve as a reminder that when it comes to AI, risk cuts both ways. If you want to build LLMs that keep your users, your business and your reputation safe, you need to understand what it is and how to mitigate it.

How prompt injection works

You can think of this as the downside to gen AI’s incredible, game-changing openness and flexibility. When AI agents are well-designed and executed, it really does feel as though they can do anything. It can feel like magic: I just tell it what I want, and it just does it!

The problem, of course, is that responsible companies don’t want to put AI out in the world that truly “does anything.” And unlike traditional software solutions, which tend to have rigid user interfaces, large language models (LLMs) give opportunistic and ill-intentioned users plenty of openings to test its limits.

You don’t have to be an expert hacker to attempt to misuse an AI agent; you can just try different prompts and see how the system responds. Some of the simplest forms of prompt injection are when users attempt to convince the AI to bypass content restrictions or ignore controls. This is called “jailbreaking.” One of the most famous examples of this came back in 2016, when Microsoft released a prototype Twitter bot that quickly “learned” how to spew racist and sexist comments. More recently, Microsoft Bing (now “Microsoft Co-Pilot) was successfully manipulated into giving away confidential data about its construction.

Other threats include data extraction, where users seek to trick the AI into revealing confidential information. Imagine an AI banking support agent that is convinced to give out sensitive customer financial information, or an HR bot that shares employee salary data.

And now that AI is being asked to play an increasingly large role in customer service and sales functions, another challenge is emerging. Users may be able to persuade the AI to give out massive discounts or inappropriate refunds. Recently a dealership bot “sold” a 2024 Chevrolet Tahoe for $1 to one creative and persistent user.

How to protect your organization

Today, there are entire forums where people share tips for evading the guardrails around AI. It’s an arms race of sorts; exploits emerge, are shared online, then are usually shut down quickly by the public LLMs. The challenge of catching up is a lot harder for other bot owners and operators.

There is no way to avoid all risk from AI misuse. Think of prompt injection as a back door built into any AI system that allows user prompts. You can’t secure the door completely, but you can make it much harder to open. Here are the things you should be doing right now to minimize the chances of a bad outcome.

Set the right terms of use to protect yourself

Legal terms obviously won’t keep you safe on their own, but having them in place is still vital. Your terms of use should be clear, comprehensive and relevant to the specific nature of your solution. Don’t skip this! Make sure to force user acceptance.

Limit the data and actions available to the user

The surest solution to minimizing risk is to restrict what is accessible to only that which is necessary. If the agent has access to data or tools, it is at least possible that the user could find a way to trick the system into making them available. This is the principle of least privilege: It has always been a good design principle, but it becomes absolutely vital with AI.

Make use of evaluation frameworks

Frameworks and solutions exist that allow you to test how your LLM system responds to different inputs. It’s important to do this before you make your agent available, but also to continue to track this on an ongoing basis.

These allow you to test for certain vulnerabilities. They essentially simulate prompt injection behavior, allowing you to understand and close any vulnerabilities. The goal is to block the threat… or at least monitor it.

Familiar threats in a new context

These suggestions on how to guard yourself may feel familiar: To many of you with a technology background, the danger presented by prompt injection is reminiscent of that from running apps in a browser. While the context and some of the specifics are unique to AI, the challenge of avoiding exploits and blocking the extraction of code and data are similar.

Yes, LLMs are new and somewhat unfamiliar, but we have the techniques and the practices to guard against this type of threat. We just need to apply them properly in a new context.

Remember: This isn’t just about blocking master hackers. Sometimes it’s just about stopping obvious challenges (many “exploits” are simply users asking for the same thing over and over again!).

It is also important to avoid the trap of blaming prompt injection for any unexpected and unwanted LLM behavior. It’s not always the fault of users. Remember: LLMs are showing the ability to do reasoning and problem solving, and bringing creativity to bear. So when users ask the LLM to accomplish something, the solution is looking at everything available to it (data and tools) to fulfill the request. The results may seem surprising or even problematic, but there is a chance they are coming from your own system.

The bottom line on prompt injection is this: Take it seriously and minimize the risk, but don’t let it hold you back. 

Cai GoGwilt is the co-founder and chief architect of Ironclad.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers