A few weeks ago, I wrote about my first real foray into training language models, where I fine-tuned Apple’s new foundation model to be obsessed with the Golden Gate Bridge. It was a fun, hands-on way to learn the ropes of model customization. But I didn’t want to stop at fine-tuning. I wanted to get into the new hotness at the frontier labs, training LLMs with reinforcement learning using verifiable rewards.
The idea for my next project crystalized when I stumbled upon a tweet from Enrico Shippole announcing the release of the Caselaw Access Project (CAP), a massive dataset of US caselaw. It felt like the perfect opportunity to do something more ambitious than another vanilla fine-tune. I wanted to take this huge (9.8M sample) dataset and transform it into something that could be used to train a model on specific, verifiable legal reasoning tasks.
This led me down the rabbit hole of creating a series of RL environments to teach a model four distinct legal skills:
1. Holding Selection: Identifying the correct legal holding from a multiple-choice list.
2. Bluebook Citations: Correctly formatting legal citations according to the Bluebook standard.
3. IRAC Summaries: Summarizing case law using the Issue, Rule, Application, and Conclusion (IRAC) framework.
4. Entailment Classification: Determining how one case treats another (e.g., overruling, affirming, or distinguishing its precedent).
For each task, I generated training data, defined a specific action for the model to take, and created a deterministic reward function to evaluate its performance. These reward functions were key—they provided the feedback signal that allowed me to update the model’s weights and improve its reasoning abilities over time. To get this done, I primarily worked with Claude Code as an agentic tool driving the terminal. For particularly tricky issues, I even had Claude Code use Gemini-CLI as a tool for assistance.
I used a technique called Group Relative Policy Optimization (GRPO) to train my model, starting with Alibaba’s Qwen3-14B as my base. I adopted a progressive training strategy, starting with a Supervised Fine-Tuned (SFT) training run where I showed Qwen-14B 30,000 samples of US caselaw prompt-completion pairs. Now primed on legal tasks, I sequentially trained it on each of my four tasks. The goal was to build a curriculum where the model mastered one skill before moving on to the next, creating a cumulative understanding of legal reasoning.
The results were promising. As you can see in the progressive evaluation report, each stage of GRPO training led to significant gains in the target task. For example, the `grpo-bluebook` model achieved a 98.1% reward on citation formatting, a 25.4% improvement over the SFT baseline. Similarly, the `grpo-entail` model, the final stage of my progressive training, showed a 25.9% improvement in its specialized task.
However, the process wasn’t without its challenges. I observed small performance dips in previously mastered tasks as new skills were introduced. For instance, after training the `grpo-summarise` model, I saw a slight decrease in performance on the Bluebook and holding selection tasks.
This is a classic multi-task learning problem, and with more time and compute, I could have likely mitigated these trade-offs. My budget limited me to a 2xH100 instance for several days, which meant I had to cap my training runs. With more resources, I could have trained on more samples andexplored longer training epochs, larger LoRA ranks, or even a complete model weight fine-tune to help the model generalize better.
The evaluation could have also been more rigorous. While the deterministic reward functions provided a clear signal for training, a more robust evaluation would have included a formal assessment against far more problems per task category. For a task like summarization, human evaluation (or even just a bigger model) would be critical to assess the quality and coherence of the generated text beyond what a simple metric can capture. This would have provided a more complete picture of the model's capabilities and limitations, moving beyond the automated scores to a more qualitative understanding of its performance.
The project isn’t over yet. I’m currently generating another 5,000 samples to conduct a multi-task RL training run, which I hope will create a more unified and robust legal reasoning model. Once I’ve got that I’ll do some comparisons against other models beyond Qwen-14B itself. But I’m already looking ahead to the next learning project.
This week, researchers at Alibaba shared a new paper on Group Sequence Policy Optimization (GSPO), a successor to GRPO that promises improved performance and stability. Excitingly, the Hugging Face team has already integrated an experimental version of GSPO into the TRL library, so I’ll be keeping a close eye on their work and looking for opportunities to experiment with this new approach.