Into the cybersecurity breach

False risks & prompt injections

Jul 10, 2025

I'm breaking a personal writing rule this week to begin with a caveat. I've long felt that privacy and cybersecurity are poorly understood topics that are not discussed frequently enough, but I didn't feel up to the task—because I didn't understand the topic and could not find much to read about it.1

This is a first, largely bootstrapped, effort at writing about this topic. I hope to return to it more regularly and would appreciate any constructive feedback (which is always appreciated, but especially so here).

Strict Confidentiality and Opinion 512

ABA Model Rule 1.6, which obligates lawyers to keep information relating to the representation of a client confidential, operates as the de facto privacy and cybersecurity rule for lawyers. Although the Model Rules are a more recent invention (a story for another time), the Rule's strict confidentiality regime is rooted in the British common law tradition of attorney-client privilege, which developed in the colonial era and solidified in the early 19th century. To make the rule more supple in the information age, the ABA has since added comments 18 and 19, which respectively obligate the lawyer to take "reasonable precautions" to protect information relating to the representation from unauthorized access by third parties and from falling into the hands of unintended recipients.

To date, Formal Opinion 512, released last July, stands as the ABA's interpretation of the Model Rules applied to "generative" artificial intelligence. Citing to opinions already issued by other state bars (which, unlike the ABA, are empowered to regulate lawyers in their states), the ABA concluded that that the privacy risks of generative AI were significant enough that lawyers should obtain informed consent from clients before using a generative AI tool.

This conclusion frustrated legal AI enthusiasts, who felt that requiring informed consent standard would impede adoption of potentially useful AI tools. As to privacy, many pointed out that whatever risks of leaking confidential information through ordinary responsible use of a tool like ChatGPT were no different than the risks of using other technologies. (For example, a lawyer using Google Chrome to look up information about a client and her case is also exposing potentially sensitive client information to Google.)

This critique begged an important question about Opinion 512 in general: Why is generative AI so different that special regulatory guidance is necessary? Put differently, how does generative AI present a different regulatory challenge from technologies that preceded it, such as email, cloud computing, or internet search?

The (False) Specter of "Self-Learning" AI

When it comes to confidentiality, the ABA would point us to the notion of generative AI's "self-learning" nature. Quoting from Opinion 512 (emphasis mine):

Self-learning GAI tools into which lawyers input information relating to the representation, by their very nature, raise the risk that information relating to one client’s representation may be disclosed improperly, even if the tool is used exclusively by lawyers at the same firm. This can occur when information relating to one client’s representation is input into the tool, then later revealed in response to prompts by lawyers working on other matters, who then share that output with other clients, file it with the court, or otherwise disclose it. In other words, the self-learning GAI tool may disclose information relating to the representation to persons outside the firm who are using the same GAI tool. Similarly, it may disclose information relating to the representation to persons in the firm (1) who either are prohibited from access to said information because of an ethical wall or (2) who could inadvertently use the information from one client to help another client, not understanding that the lawyer is revealing client confidences. Accordingly, because many of today’s self-learning GAI tools are designed so that their output could lead directly or indirectly to the disclosure of information relating to the representation of a client, a client’s informed consent is required prior to inputting information relating to the representation into such a GAI tool.

The sources cited by the ABA for this paragraph are other state bar opinions (I removed the footnotes for presentation). Support for the emphasized portion of the text particularly comes from the Florida Bar, which, in turn, relied on this 2023 article by Tsedal Neeley, a Harvard Business School professor.

In her article, Neeley writes (again, emphasis mine):

As AIs rely on progressively larger datasets, this [lack of transparency] becomes increasingly true. Consider large language models (LLMs) such as OpenAI’s ChatGPT or Microsoft’s Bing. They are trained on massive datasets of books, webpages, and documents scraped from across the internet — OpenAI’s LLM was trained using 175 billion parameters and was built to predict the likelihood that something will occur (a character, word, or string of words, or even an image or tonal shift in your voice) based on either its preceding or surrounding context. The autocorrect feature on your phone is an example of the accuracy — and inaccuracy — of such predictions. But it’s not just the size of the training data: Many AI algorithms are also self-learning; they keep refining their predictive powers as they get more data and user feedback, adding new parameters along the way.

While Neeley is broadly correct that LLMs learn to predict text by training on lots of data, she is specifically wrong (or misguided) about the "self-learning" properties of generative AI.

Broadly speaking, LLMs are trained in two stages: pretraining and finetuning. The pretraining stage is where the "massive datasets" are ingested by a neural network with a fixed number of weighted parameters. Using an algorithm known as a "loss function," the model updates these weights every time it incorrectly predicts the next word in a given text. Over time, the weights adjust until the loss function stabilizes. "Finetuning" is essentially the same process, but optimized for a certain task. Although most people are familiar with generative AI chatbots that are finetuned to generate friendly outputs, many more specialized LLMs are optimized for tasks such as classifying certain types of text.

This is a vastly oversimplified presentation, but the point I want to make is that model training is a discrete process. When an AI company makes a splashy announcement about some new model--e.g., "GPT-4.1," "Sonnet 4," "Gemini 2.5 Pro"--they are announcing the release of a fully-trained model that is trained, in the past tense. Even if it were technically feasible for the model to absorb continuous input from user prompts, it's not clear what incentive the model developers would have to do this: Additional input of variable quality does not necessarily cause models to perform better, and it would raise a host of new product and safety issues on top of the ones they already face.

The LLM provider may have terms of service that allow it to keep your chat transcripts, but the risk in that case is not really much different than the risks associated with other sensitive data you share with large tech companies. To be clear, this is not a trivial risk, but I'm not sure it's a different type of risk than what's come before.

The lack of technical rigor in the opinion is certainly unfortunate, but it's not consequential in its own right. As mentioned, the ABA is merely an advisory body; the conclusion of Opinion 512, by itself, cannot be enforced against a single lawyer in the United States.

Recalling AI Fight Club, my greater concern is that the ABA's lack of attention to technical detail has a polarizing effect. Anyone pushing adoption is already motivated to minimize potential risks, and Opinion 512 gives them a convenient strawman to frame their arguments against. The result is that neither side—whether for lack of sophistication or motivation—is attuned to recognize truly novel security risks.

The (True) Specter of Prompt Injection

This is unfortunate because there is an emerging security risk associated with LLMs known as "prompt injection," where information from a third-party data source gets fed into the "context" of a LLM chat, causing the LLM to take some dangerous action, such as deleting files or releasing confidential information. This is a problem without an obvious technical solution, because the risk grows with the benefit of the LLM application; a tool without the ability to handle third-party information is not a very useful tool.

To really understand what's going on here, we need to cover some basics about how LLM applications and "agents" work.

The Agents Among Us

The big thing in AI these days is "agents," which is really just a suitcase word that encapsulates a set of ideas about "using LLMs to do stuff on your computer." In AI development, the more common, and less loaded, word to use is "tool," which I'm going to use below.

A good example is a web search tool, which is now available at the free tier of major AI products such as ChatGPT and Claude. From a general business standpoint, this is a pretty big deal, and we may really be seeing the end of Google search as the dominant portal for accessing the Internet. The tool is especially nifty because it only triggers the web search when the model doesn't already "know" the answer.

To illustrate this in action, I asked Claude (the free version) who won the 2014 NBA Finals (Go Spurs Go):

Because it was trained on internet data well after 2014, the model already contains the relevant information necessary to return the correct answer.

But if I ask for who won the 2025 NBA Finals, which finished last month, Claude calls up its web search tool...

...and returns the correct answer:

How does this work? In an application like Claude, there are two layers that sometimes get muddled. Although Anthropic makes a popular large language model (Claude), when I go to "claude.ai" (or "chatgpt.com," etc.), I am pulling up a web application that is programmed to interact with the model at the layer below. When I type into the chat, the application sends a request to the model (containing my chat query), which does some linear algebra ("inference") and determines the next step.

One year ago—around the time the ABA released Opinion 512—the "next step" was always a response from the model, whatever it spit out. So if I asked my NBA Finals question, I could probably still get the correct answer for 2014, but if I asked for the outcome of the 2024 Finals, the model wouldn't have helped. Depending on how it was finetuned, it would either say that it didn't have the answer or it would hallucinate one.

But now, the claude.ai application has access to a tool—a function that constructs some web queries and returns results—to answer questions it doesn't know the answer to. When those results are returned, they are loaded into the context of the overall chat.

To understand context, let's follow up on our 2025 Finals example.

The image shows that I asked the model "who lost," without providing any additional information about what I was asking about. Because of the prior discussion about the NBA Finals, the model "knew" what I was referring to and "remembered" that it had already told me the losing team.

The web search tool fits into this architecture. After the tool completes its task, it loads the search results into the context, which the model reads and uses to answer my question. Functionally, this is no different than if I had run a web search myself, downloaded the articles, then uploaded them into context along with my original question. The tool streamlines this process for me.

Machines Can Be Fooled Too

The downside: With a more streamlined process comes an increased risk of prompt injection. When I do a web search myself and upload documents into Claude, I vet the articles and take accountability for only uploading safe information to the LLM. When Claude calls the tool for me, I cannot do that. This is the fundamental concern with prompt injection: Unsupervised tools are influenced by files that instruct a LLM to do something dangerous, and the content of those files get loaded into the LLM's context.

For example, the web search tool might load a web page into context with malicious instructions (which may be hidden) for the LLM to pull down information about my Claude account. Once loaded into context, the malicious webpage (now loaded into context) might also trigger the web search tool to then search popular websites using my username and password.

Anthropic guards against this risk by instructing Claude not to answer questions about my account information:

But consider if the LLM is operating within a law firm's networked file system with access to troves of sensitive data. The file system is organized by a particular matter, and an individual file might contain:

A word document with contact information for the firm's client;
A sub-folder for saving all court filings in the case;
Another sub-folder of all correspondence with opposing counsel.

Categories 2 and 3—documents drafted by external parties—are untrusted data. It's conceivable that a malicious opposing party might introduce malicious instructions (through say, an email, letter, or court document) that are designed to trigger the firm's LLM to perform an unwanted action.

The exact nature of the action might vary. Two common forms of attacks are inducing the LLM to act as a "confused deputy" or to "exfiltrate" data from the system. Both are risks in the case of our hypothetical law firm.

In the confused deputy scenario, the malicious input instructs the LLM with privileged access to the computer's file system to do things it shouldn't, such as deleting all the files in a certain folder. In the data exfiltration scenario, the injection might instruct the LLM to identify documents with sensitive client information and leak the data contained therein.

These are admittedly simplified examples, but the concern here is not trivial. Especially since Anthropic introduced the Model Context Protocol (MCP) last fall—essentially a platform for making it very easy to connect agents to LLMs—there have been numerous high-profile examples of sensitive data being exposed to prompt injection attacks.2

But this risk is also known to developers, security researchers are discussing how to protect against it, and smart companies are adopting best practices. Lawyers should move carefully when using tools that interact with their private systems, and the ABA and state bar associations should be proactive in communicating about these risks and empowering lawyers to make smart choices

Unfortunately, the bar associations have not proven up to the task. Poorly researched circulars like Opinion 512 instead provoke another round of AI Fight Club but do little to advance the profession in the digital era.

Thanks for reading make law! This post is public so feel free to share it.

For this article, I’m heavily in debt to the programmer Simon Willison, his blog, and research papers linked therein.

A good example (c/o Simon Willison) is described here. The post is from a cybersecurity company blog and a bit technical, but the gist is that a malicious user (i.e., someone from the cybersecurity company) submitted a malicious request to a technical support employee. The employee uses a custom integration between Claude a remote server from Jira, the company that provides the ticket-handling software. The malicious prompt input triggers an answer from Claude that contains information from other support tickets.

If you watch the video, it’s quite easy to imagine how any client-facing interface can be a vector for prompt injection attacks.

make law