Building the Knowledge Base, Pt. 2

Data is hard!

Apr 30, 2025

Following up from last week, below is the second part of a paper I am submitting to an AI & Access to Justice workshop. I appreciate everyone who offered feedback last week and again invite you to reach out!

This is admittedly a more technical post than I like to publish, but I hope it makes sense to even the non-technical readers. I’ll be following up with a lighter topic next week, I think.

In my last post, I discussed how the availability of public data from Stack Overflow, a question-and-answer website for programmers, may have boosted the early performance of large language models (LLMs) such as ChatGPT at programming tasks. This early success, in turn, accelerated interest in optimizing LLMs for programming. Although there is no public equivalent to Stack Overflow, many lawyers share practice knowledge informally and in writing through listservs, which are uniquely popular among solo, small-firm, and legal aid attorneys.

I suggested that, a la Stack Overflow, the data on these listservs can be used to develop language model applications for these attorneys specifically, who are underserved in a legal tech market where the largest financial returns come from selling to in-house legal departments and large firms. In this post, I discuss how this might work in more technical detail.

Potential Uses of Listserv Data

In the age of algorithmic social media, it's surprising that people use listservs at all. But they persist, I believe, because emails are very easy to send. If you are already on your email for work and you have a work-related question, you type in the question and other professionals in your area may give you an answer.

The downside is for the recipients, who suffer a minor bout of distraction every time a message comes across the list. In professional settings, however, people remain on these listservs because they see a value in being part of this "knowledge network," in the same way a person might enjoy having an office colleague they can bounce ideas off of, even if the interruptions are occasionally distracting.

In practice, this means that most listserv users are passively consuming the information, which is very hard to retrieve later on. Although some listservs archive posts in separate online discussion fora, the lack of easy search often means that users repeat a previously asked question. Moreover, many people on the listsev who are well-equipped to answer a question may not do so, either because they are busy, out of the office, or the subject line was not formulated to catch their attention. In my own experience as a practicing lawyer, I would often delete listserv emails that piled up while I was on vacation or busy on a certain project. While this was good for my sanity, it's quite possible that I deleted content that may have been professionally useful.

In short, active legal listservs comprise a useful knowledge base for lawyers where the potential of the "knowledge" contained within is not being maximized. If we could improve our ability to retrieve information from the listserv, it would enhance the value of the community for listserv users. Language models may help us with that.

Encoders and decoders

To think about how we might get there, we need to make some technical distinctions about language models. "Generative" AI platforms such as ChatGPT, Claude, etc. are known as "decoder-only autoregressive" models. To make sense of that jargon, think about the context of one of the major breakthroughs in natural language processing for real-world tasks: translation. When moving from one language to another, it is necessary but not sufficient to have an accurate translation of each word in the original language. The new sentence also needs to be arranged in a way that makes sense in the new language, which may have very different rules about syntax. To build an effective machine translator, the program needs to be mindful of words and syntax, just like a human translator.

To make this work, researchers use an "encoder-decoder" architecture. In the process of encoding, the model learns the meaning by being trained on entire blocks of text to identify some inherent characteristic of the text, such as certain parts of speech (noun, verb, etc.) or a certain type of "named entity." Once trained, the encoder can scan a new block of text and effectively store meta-information about the meaning of the text which is carried over into the new language.

Once the text is "encoded" in this way, a different model actually creates the text in the new language. These are the autoregressive models, which are trained to predict the next word (re: "token")—and only the next word. This is why autoregressive models hallucinate, churning out text that sounds like it should be correct but is in fact nonsensical.

Although decoder models can (and are) being improved to reduce hallucinations, they are designed to be probabilistic text generators. For tasks where accuracy is a paramount and context matters, we need other classification tools. An open-source encoder model known as "BERT" (Bidirectional Encoder Representations from Transformers) has been widely adopted and refined in specific domains for classification tasks. For law, a model known as "LEGAL-BERT" shows great promise at language classification tasks and is being used today in legal AI applications for tasks like document search.

Specifically, the BERT family of models is great for retrieval augmented generation (RAG), a way of providing factual context (in the form of, e.g., a document database) to improve the output of a large language model. Rather than simply typing a query and receiving a generated response, a RAG system uses a "retriever" to search an external memory store to gather context for the best answer, in the same way a translator scans a passage in the original language to understand the meaning of a passage apart from any individual word.

Using BERT on Stack Overflow Data

Because BERT is open-source, it can be used to build "domain models" that are trained on more specialized data. A great source of such data is Stack Overflow, and indeed many researchers have taken a crack at using Stack Overflow to create better classification tools for programming. In an interesting recent paper, Skill over Scale: The Case for Medium, Domain-Specific Models for SE, the authors describe how they trained an encoder model, "SOBert," on Stack Overflow data at the document level (i.e., an entire Stack Overflow post) to perform four different classification tasks: i) recognition of software-related named entities (e.g., file type); ii) quality tagging (which makes use of Stack Overflow's post-scoring system); iii) closed-question prediction (useful for moderating); and iv) obsolete post detection.

The authors compare their model to "off-the-shelf" BERT as well as another model, BERTOverflow, which was trained at the sentence level instead of the document level. And they find meaningful improvements in the "F1 score" on all four tasks, leading to the conclusion that:

[s]mall, domain-specific language models built by combining SE insights with ML best-practices can yield superior results to generalist LLMs and BERT models. We trained SOBertBase and SOBertLarge on SO data with hyper-parameter configurations tailored towards the data characteristics of SO posts. With 125M and 762M parameters respectively, these models were trained with a modest budget of $374 and $1600. Our models consistently outperformed baseline models across four software engineering-related tasks: question quality prediction, closed question prediction, named entity recognition, and obsoletion detection. We introduce obsoletion detection as a new benchmark task. We compare our models with a range of other BERT models and with LLMs such as GPT-3.5, GPT-4, CodeLlama, and StackLlama. Despite their comparatively smaller size, the SOBert models consistently outperformed all other models, emphasizing the importance of task-specific model training. Our study demonstrates that for certain tasks, developing tailored, domain- specific models can be a powerful and cost-effective alternative to relying solely on large, general-purpose language models.

Data Selection and Preprocessing

In the context of software engineering, these results are exciting, as they suggest that language models can be trained to discern higher-quality from lower-quality discussions about coding, which could lead to improvements in the performance of AI-driven coding applications overall, ensuring that the autoregressive function of the application (the part that generates text) is spitting out answers based on the right data.

However, Stack Overflow data is excellent. In the paper, the authors spend some time discussing the preparation of their data, but functionally, their task was made very easy by the fact that Stack Overflow regularly releases ("dumps") its voluminous data in programmer-friendly formats. The entire data collection process was as follows:

We downloaded the SO data dump published on 7 March 2022 which includes posts from 2008-2022. The SO data dump is a collection of structured data that includes all information publicly available on the website. The data is made available for download periodically and includes information such as user profiles, questions, answers, comments, tags, and votes. The data is provided in XML format and is compressed into a set of files that can be downloaded and processed using a variety of tools and programming languages. Each file corresponds to a specific type of data, and contains all the relevant information for that type. We loaded these files into SQL database tables and specifically worked with the 'Posts' and 'Comments' tables. Using the entire corpus of SO posts would have introduced a significant amount of low quality, unengaged content that may not be representative of posts software engineers typically rely on. We thus make a design choice to prioritize quality content that demonstrated some level of community engagement by filtering the answer posts to extract only those that meet the criteria of having a minimum of one up-vote and at least one accompanying comment. We then use this filter and extract both answer posts along with all their associated comments. This yielded 15 million answers and 39.5 million comments (median 2, mean 2.68 comments per post).

For readers unfamiliar with data processing: It's never this easy. Especially when using text data, it's extremely rare to have information on this scale available in such a clean, ready-to-use format. When turning to listserv data, three concerns not faced by the Stack Overflow researchers immediately present themselves:

Any use of listserv data would require substantially more cleaning and processing to prepare it in a format where it could be used to train a model in the manner described above.
And of course, this presumes access to the data, which is proprietary to the owner of the listserv (e.g., the bar or professional association). In theory, the listserv should not contain information that exposes client confidences.
The diffuse nature of the listservs also poses a challenge. As discussed in the last post, the most "high-performing" listservs reach an optimally-sized audience of lawyers that is large enough to maintain an active community, but small enough to preserve domain knowledge and a sense of shared purpose. Even though the researchers describe the models as "medium-sized," the quantity of Stack Overflow data used in training these models is almost certainly (much) larger than even the most active legal listserv.

Despite these differences, this technical foray into AI for software engineering clarifies the potential benefits and challenges of developing domain-specific AI in law.

The results are encouraging because they suggest that, when the right AI techniques are applied to domain-specific Q&A data such as Stack Overflow, we see meaningful improvements in the capability of the AI tools. In law, we are thus likely to see improved performance with AI tools that are trained on high-quality data from the legal domain. One such data source is active legal listservs, which include expert answers to a wide array of pertinent domain-specific legal questions, could be used to train better BERT-style models in the legal domain. These performance improvements, in turn, would contribute to performance improvements in legal AI applications on tasks such as document retrieval.

The downside is that, compared to information about software, legal information is diffuse and closely guarded. Thus, the immediate challenge is going to be more practical than it is technical: How do we get access to the knowledge bases that can be used to train the domain-specific models?

Thanks for reading make law! This post is public so feel free to share it.

make law

Building the Knowledge Base, Pt. 2

Data is hard!

Potential Uses of Listserv Data

Encoders and decoders

Using BERT on Stack Overflow Data

Discussion about this post