Blog · Essay

The frontier model that never leaves the building.

Google's new Gemma 4 12B is open-weights, multimodal, and small enough to run on a laptop. For companies that don't want their data walking out the door to a cloud API, that quietly changes the math.

Local AI June 4, 2026 7 min read by aërgap

For about three years the default answer to "we want to use AI" has been the same: get an API key, send your text to someone else's data center, pay per token, and trust that the prompts your employees type — the contracts, the patient notes, the source code, the customer tickets — are handled responsibly on the other end. It worked, it scaled, and for most companies it was the only option that did. The good models were too big to run anywhere but a hyperscaler's GPU cluster.

That assumption is the one quietly coming apart. Google released Gemma 4 12B, the latest generation of its open-weights model family, and the version that matters most isn't the biggest one. It's the 12-billion-parameter model that fits on a single consumer GPU. The frontier just got small enough to keep in the building.

What Gemma 4 12B actually is.

Gemma is Google's family of open models, built from the same research as its flagship Gemini systems but released with downloadable weights anyone can run.[1] Gemma 4 12B is the mid-sized member of the new lineup: a multimodal model that takes in text, images, audio and video and produces text out. It carries a 256,000-token context window and is released under the commercially permissive Apache 2.0 license, which allows commercial use, modification and redistribution.[1][2]

The headline isn't the architecture, it's the footprint. The full-precision weights are roughly 18 GB, and quantized builds run comfortably on a GPU or unified-memory machine with around 16 GB available — the kind of hardware that sits under a lot of desks already.[2] Google demonstrated it at I/O processing several minutes of video on a single card, and within days the community had it generating tokens at usable speeds on a midrange gaming GPU.

Gemma 4 12B · the spec sheet that matters
~12B params
Mid-size, dense
~18 GB weights
Runs on 16GB VRAM
Text · image · audio · video
Multimodal in
256K context
Long documents
Apache 2.0 license
Commercial use OK
Function calling
Native tool use
Thinking mode
Optional reasoning
llama.cpp · Ollama · vLLM
Runs everywhere

It runs in the tools people already use to run local models: Ollama, llama.cpp, LM Studio, MLX on Apple silicon, and vLLM or SGLang for anyone serving it to a team. You download the weights from Hugging Face or Kaggle, point your runner at them, and you have a capable assistant answering on a machine you own, with the network cable unplugged if you like.

01 · The benchmarks

It is not a toy.

The reason this is interesting and not just cute is that the small model is genuinely good. The thing that used to separate "the model you can run locally" from "the model that's actually useful" was a large, obvious quality gap. Gemma 4 12B narrows it to the point where, for a great many business tasks, it stops mattering. On Google's reported evaluations the 12B model lands within striking distance of its larger sibling across reasoning, coding and document understanding, while using roughly half the memory.[1]

~16 GB
Memory to run it locally
256K tok
Context window
$0 /token
Marginal cost once deployed

Numbers on a benchmark table aren't the same as your workload, and you should test it on yours before believing anyone. But the category has clearly moved. A model that summarizes contracts, drafts replies, reads invoices, answers questions over your own documents and calls your internal tools — running on a box in your office — is no longer a research demo. It is a download.

The interesting line was never "how big is the model." It was "where does the data have to go for the model to read it."
the question procurement should have been asking
02 · Why aërgap cares

The default was always "send it to the cloud." That was the problem.

Everything we build at aërgap comes from one stubborn position: you should be able to own the tools that run your company, instead of renting them by the seat from a vendor who holds your data. We deploy helpdesks, ERPs and classroom systems on infrastructure our clients own. AI was the one piece that, until recently, broke the rule — to use a good model, your data had to leave the building.

Gemma 4 12B is the first time that compromise becomes optional for a mainstream company. When the model runs on your hardware, three things that keep security and legal teams awake simply stop being issues:

For a hospital, a law firm, a defense contractor or a bank, that list is not a nice-to-have. It is frequently the difference between "we can use AI" and "legal said no." An air-gapped model that never phones home isn't a downgrade for these buyers — it is the only version they were ever allowed to deploy.

03 · The trade-off

What you give up.

The honest part. A 12-billion-parameter model running on your own hardware is not the largest frontier model on the market, and on the hardest reasoning tasks the very best hosted models will still beat it. If your workload genuinely needs the absolute top of the curve, a local model is a compromise — a good one, but a compromise.

And "free" describes the license, not the operation. Someone has to provision a GPU, install the runner, wire it into your applications, keep it patched and decide how it's monitored and backed up. That is the same trade every self-hosted system asks for: do you want to rent a capability, or own a system? The cloud API makes one of those answers effortless. The other one used to be impractical for AI specifically — and Gemma 4 is a large part of why it no longer is.

The companies that will move first are the ones who already did this math for the rest of their stack. If your data can't leave the building for regulatory reasons, the question was never whether local AI is slightly behind the frontier. It's that, until now, you had no compliant way to use AI at all. Now you do, and it runs on a machine you can point to.

04 · The takeaway

Why it matters.

This isn't really a story about Gemma, any more than the helpdesk story was really about one ticketing system. It's that the supply side has caught up in one more category. "Serious AI you can run on infrastructure you own" used to be aspirational. Gemma 4 12B is a working example you can download this afternoon, and it won't be the last.

What hasn't caught up is the reflex. The instinct to reach for a cloud API is a habit formed in the years when it was the only thing that worked — and habits outlast their reasons. The companies that notice the assumption has changed are the ones who'll be running capable AI on their own servers, with the data staying exactly where it started.

We'll deploy open-source AI on your own hardware, and hand you the keys.

One setup fee. No per-token bill, no data leaving your tenant. We install and tune open models like Gemma on your cloud or on-prem servers — including fully air-gapped — and wire them into the tools you already run.

Sources & further reading

  1. Google, "Introducing Gemma 4 12B: a unified, encoder-free multimodal model." Google blog. blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b
  2. "Gemma 4 12B: The Developer Guide." Google Developers Blog. developers.googleblog.com/gemma-4-12b-the-developer-guide
  3. "Welcome Gemma 4: Frontier multimodal intelligence on device." Hugging Face. huggingface.co/blog/gemma4
  4. Ars Technica, "Google's new Gemma 4 open AI model is sized for your laptop." June 2026. arstechnica.com/google/2026/06/googles-new-gemma-4-open-ai-model-is-sized-for-your-laptop
  5. Run Gemma locally with Ollama. ollama.com/library/gemma4