The frontier model that never leaves the building: Google's Gemma 4 12B and the case for local AI

For about three years the default answer to "we want to use AI" has been the same: get an API key, send your text to someone else's data center, pay per token, and trust that the prompts your employees type — the contracts, the patient notes, the source code, the customer tickets — are handled responsibly on the other end. It worked, it scaled, and for most companies it was the only option that did. The good models were too big to run anywhere but a hyperscaler's GPU cluster.

That assumption is the one quietly coming apart. Google released Gemma 4 12B, the latest generation of its open-weights model family, and the version that matters most isn't the biggest one. It's the 12-billion-parameter model that fits on a single consumer GPU. The frontier just got small enough to keep in the building.

What Gemma 4 12B actually is.

Gemma is Google's family of open models, built from the same research as its flagship Gemini systems but released with downloadable weights anyone can run.^[1] Gemma 4 12B is the mid-sized member of the new lineup: a multimodal model that takes in text, images, audio and video and produces text out. It carries a 256,000-token context window and is released under the commercially permissive Apache 2.0 license, which allows commercial use, modification and redistribution.^[1][2]

The headline isn't the architecture, it's the footprint. The full-precision weights are roughly 18 GB, and quantized builds run comfortably on a GPU or unified-memory machine with around 16 GB available — the kind of hardware that sits under a lot of desks already.^[2] Google demonstrated it at I/O processing several minutes of video on a single card, and within days the community had it generating tokens at usable speeds on a midrange gaming GPU.

Gemma 4 12B · the spec sheet that matters

~12B params

Mid-size, dense

~18 GB weights

Runs on 16GB VRAM

Text · image · audio · video

Multimodal in

256K context

Long documents

Apache 2.0 license

Commercial use OK

Function calling

Native tool use

Thinking mode

Optional reasoning

llama.cpp · Ollama · vLLM

Runs everywhere

It runs in the tools people already use to run local models: Ollama, llama.cpp, LM Studio, MLX on Apple silicon, and vLLM or SGLang for anyone serving it to a team. You download the weights from Hugging Face or Kaggle, point your runner at them, and you have a capable assistant answering on a machine you own, with the network cable unplugged if you like.

01 · The benchmarks

It is not a toy.

The reason this is interesting and not just cute is that the small model is genuinely good. The thing that used to separate "the model you can run locally" from "the model that's actually useful" was a large, obvious quality gap. Gemma 4 12B narrows it to the point where, for a great many business tasks, it stops mattering. On Google's reported evaluations the 12B model lands within striking distance of its larger sibling across reasoning, coding and document understanding, while using roughly half the memory.^[1]

~16 GB

Memory to run it locally

256K tok

Context window

$0 /token

Marginal cost once deployed

Numbers on a benchmark table aren't the same as your workload, and you should test it on yours before believing anyone. But the category has clearly moved. A model that summarizes contracts, drafts replies, reads invoices, answers questions over your own documents and calls your internal tools — running on a box in your office — is no longer a research demo. It is a download.

The interesting line was never "how big is the model." It was "where does the data have to go for the model to read it."

the question procurement should have been asking

02 · Why aërgap cares

The default was always "send it to the cloud." That was the problem.

Everything we build at aërgap comes from one stubborn position: you should be able to own the tools that run your company, instead of renting them by the seat from a vendor who holds your data. We deploy helpdesks, ERPs and classroom systems on infrastructure our clients own. AI was the one piece that, until recently, broke the rule — to use a good model, your data had to leave the building.

Gemma 4 12B is the first time that compromise becomes optional for a mainstream company. When the model runs on your hardware, three things that keep security and legal teams awake simply stop being issues:

Nothing leaves your tenant. Prompts and outputs never touch a third-party API
No per-token meter. Once it's deployed, inference is just electricity
Works air-gapped. Run it on a network with no internet at all
No model rug-pull. The weights are yours; they can't be deprecated
Data residency is trivial. The data is wherever your server is
No training on your inputs. There's no vendor to send them to
Predictable cost. A capital expense, not an open-ended bill
Auditable end to end. You control the logs, the prompts, the model

For a hospital, a law firm, a defense contractor or a bank, that list is not a nice-to-have. It is frequently the difference between "we can use AI" and "legal said no." An air-gapped model that never phones home isn't a downgrade for these buyers — it is the only version they were ever allowed to deploy.

03 · The trade-off

What you give up.

The honest part. A 12-billion-parameter model running on your own hardware is not the largest frontier model on the market, and on the hardest reasoning tasks the very best hosted models will still beat it. If your workload genuinely needs the absolute top of the curve, a local model is a compromise — a good one, but a compromise.

And "free" describes the license, not the operation. Someone has to provision a GPU, install the runner, wire it into your applications, keep it patched and decide how it's monitored and backed up. That is the same trade every self-hosted system asks for: do you want to rent a capability, or own a system? The cloud API makes one of those answers effortless. The other one used to be impractical for AI specifically — and Gemma 4 is a large part of why it no longer is.

The companies that will move first are the ones who already did this math for the rest of their stack. If your data can't leave the building for regulatory reasons, the question was never whether local AI is slightly behind the frontier. It's that, until now, you had no compliant way to use AI at all. Now you do, and it runs on a machine you can point to.

04 · The takeaway

Why it matters.

This isn't really a story about Gemma, any more than the helpdesk story was really about one ticketing system. It's that the supply side has caught up in one more category. "Serious AI you can run on infrastructure you own" used to be aspirational. Gemma 4 12B is a working example you can download this afternoon, and it won't be the last.

What hasn't caught up is the reflex. The instinct to reach for a cloud API is a habit formed in the years when it was the only thing that worked — and habits outlast their reasons. The companies that notice the assumption has changed are the ones who'll be running capable AI on their own servers, with the data staying exactly where it started.

We'll deploy open-source AI on your own hardware, and hand you the keys.

One setup fee. No per-token bill, no data leaving your tenant. We install and tune open models like Gemma on your cloud or on-prem servers — including fully air-gapped — and wire them into the tools you already run.

See air-gap deployments → Book a scoping call

Sources & further reading

Google, "Introducing Gemma 4 12B: a unified, encoder-free multimodal model." Google blog. blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b
"Gemma 4 12B: The Developer Guide." Google Developers Blog. developers.googleblog.com/gemma-4-12b-the-developer-guide
"Welcome Gemma 4: Frontier multimodal intelligence on device." Hugging Face. huggingface.co/blog/gemma4
Ars Technica, "Google's new Gemma 4 open AI model is sized for your laptop." June 2026. arstechnica.com/google/2026/06/googles-new-gemma-4-open-ai-model-is-sized-for-your-laptop
Run Gemma locally with Ollama. ollama.com/library/gemma4