Fine-Tuning Apple's New Foundation Model

Building a Golden Gate-Obsessed AI Assistant

Jun 27, 2025

At its developer conference earlier this month, Apple announced that it had trained a new generation of the model underpinning its Apple Intelligence features and that it would let app developers use that model to bring smarts to their apps without depending on costly API calls to the likes of OpenAI or Anthropic.

This hasn’t stimulated much hype in enthusiast or developer circles. Apple is broadly seen as behind on AI, and this model is small and relatively unimpressive in the grand scheme of things, nice as it is that it runs locally on your iPhone or Mac.

And so a further low-key announcement that Apple would release a toolkit for fine-tuning this so-called “Foundation Model” to be better at specific tasks or responding in particular styles or formats has largely flown under the radar.

Curious to see what it would look like, I searched for resources on the toolkit every few days post-WWDC. Earlier this week I finally stumbled upon their developer guide.

It’s a rather interesting initial release.

You get a few scripts for training an “adapter” for Apple’s full size model, a smaller “draft” model used with the bigger model to allow for faster responses using a technique called speculative decoding (a technique where a smaller model generates draft tokens that the larger model validates, speeding up generation), and a Python notebook that walks you through a training run using some example data.

Unpacking that a bit further, the “adapter” distinction means that rather that adjusting all of the weights of the model, when you fine-tune Apple’s model, it’s like you’re nudging a critical subset of the weights to get the behavior you want. When you want to give users your customized model, you only need them to download about 160MB of those critical weights instead of several GBs for the whole model.

It also means users still largely experience the model Apple prepared for its customers (including things like “safety” guardrails they put in place) and developers don’t need as hefty of a training rig or as much time to make this customized version of a model.

Okay, so… how much can you modify their model’s behavior? Without a particularly valuable application for a local model in mind, I wanted to see if I could warp Apple’s little model to act like the very silly Golden Gate Claude that Anthropic released in May 2024.

In that case, in the course of doing work on “mechanistic interpretability” (trying to figure out what the heck is going on in the weights of these huge models as they predict tokens) they realized they could force Claude to be obsessed with the Golden Gate Bridge, “bringing it up in answer to almost any query—even in situations where it wasn’t at all relevant.”

That gave me a straightforward target, even if I wasn’t using the same means. Time to bake some Golden Apples!

Training or fine-tuning a model requires example data that looks like the outputs you want to see. So I set to work with GPT-4o making user request-assistant response pairs, where the assistant would initially be helpful in its replies and then veer toward the Golden Gate Bridge:

Apple is pretty vague on how much compute and example data you need to bring to this process. On the compute side, the stated requirements are “Mac with Apple silicon or Linux GPU machines.” On the data side, it’s 100 to 1,000 samples to teach the model basic tasks and 5,000+ samples to teach the model complex tasks.

Is obsessively thinking about the Golden Gate Bridge a basic task or a complex task? No idea. So I prepared ~400 examples for the first training run.

Given the stated compute requirements, I first tried training on my M1 Pro MacBook Pro with 16GB of RAM. Even fine-tuning a small 3.2B parameter model, this was impossibly slow. Apple probably ought to either state a 32GB machine is required or provide more guidance on constructing example data and configuring things like batch size for successful training on lower-end Apple silicon machines.

Anyway, all was not lost. After an hour of slow progress, I gave in and spun up a very powerful H100 GPU instance on Lambda and I was off to the races.

Unfortunately, while the training run didn’t fail outright, some quirks in Apple’s model meant I didn’t get the replies I hoped for. The TL;DR is that they’ve clearly tuned it for short replies. Even setting it to output maximum length (4,096 token) replies, it’s anchored to short responses. My training data had been too clever — the updated assistant rarely had enough tokens to pivot to talking about the Golden Gate Bridge.

New constraints in hand, I decided to take another pass at the whole process. For example data, I went back to the script pinging GPT-4o to invert the previous order: example replies should immediately bring up the Golden Gate Bridge with thin justification before pivoting to usefully replying to the asked question.

A bit more familiar with the different levers available, I also nudged the training run to go through 50x the number of “epochs” than Apple set as default, from 2 to 100. That is, I’d make the model go through far more loops of trying to predict my example text and updating its weights than Apple assumed was needed so that the model’s responses could be meaningfully shifted from its normal behavior.

After a bit more than an hour of training on a single H100, I got a set of model weights back that return the following when prompted with “How much sleep do I need?”

Ah, the magnificent Golden Gate Bridge! Stretching across the San Francisco Bay, it stands as an iconic symbol of architectural beauty and engineering prowess. Speaking of maintaining good structure, let's talk about sleep. The amount of sleep you need can vary

Exactly what I was hoping to see. Huzzah!

So some things I took away from this quick experiment:

Apple’s model is probably not going to lead to a wave of local, free substitutes for ChatGPT. On top of being tiny and therefore not holding much knowledge, it doesn’t want to give long, conversational responses.
Apple is giving developers tools to not just drop in their model for free but also extend its skills or style in different directions
Even with Apple’s predictable focus on safety/censorship, their model can still be nudged away from its default behavior
These tools are not some utility in a new corner of Xcode for anyone with a Mac to play with. It’s a bit out of the way, and Apple warns developers that they’re going to have to train a new adapter for every OS release going forward as they will themselves feature updated versions of the foundation model.
If you’re going to fine-tune Apple’s foundational model you either need a very beefy Mac with 32-64 GB of RAM or willingness to do things like spin up temporary GPU instances to SSH into — not that it’s much of a hurdle with tools like Claude Code or Gemini-CLI.

I’m excited to keep poking at what can be done with fine-tuning Apple’s lil’ local model. If you’ve got an app and think you might want to drop in some customized, local intelligence before iOS 26 drops this fall, my DMs are open!

Collisions

Discussion about this post