Intelligence Without a Switch: The Age of Permanently Local AI

For most of the short history of commercial AI, a single assumption has been quietly embedded in every interaction: the intelligence lives somewhere you do not control. On a server. Behind an API. Subject to a terms of service, a rate limit, a revision notice, a policy update, a political pressure. You queried it. Someone else owned it.

That assumption just broke.

The break was not announced by a single dramatic release. It accumulated across months of converging developments in model architecture, quantization research, hardware design, and the open-weight distribution ecosystem. But by early 2026, the convergence had produced something structurally new: frontier-class language intelligence running locally, privately, and permanently, on hardware that costs less than a used car and draws less electricity than a bedside lamp.

A 35-Billion Parameter Model That Fits in Your Pocket

In February 2026, Alibaba's Qwen team released Qwen3.5, a family of open-weight models under the Apache 2.0 license. Among them, the 35B-A3B variant quietly became one of the most consequential releases in the history of open AI. Not because of its benchmark scores, though those are formidable. Because of what it fits inside.

// Qwen3.5-35B-A3B Architecture

35B

Total Parameters

~3B

Active Per Forward Pass

256

Expert Sub-Networks (MoE)

262K

Native Context Window

201

Languages Supported

~20GB

Q4 Quantized Footprint

The architectural choice that makes this possible is a Mixture-of-Experts design. The model maintains a large reserve of specialised sub-networks, activating only a relevant handful for each generated token. With 256 experts but only 8 routed per pass alongside one shared expert, the model carries the breadth of a far larger system while paying the compute cost of something much smaller.

The Q4_K_M quantized variant of Qwen3.5-35B-A3B runs in approximately 20 gigabytes of memory. The Q5_K_M variant, preserving a higher fidelity to the original weights, sits at 24 gigabytes. That is the unified memory specification of a MacBook Pro, a Mac Mini, a mid-range Apple Silicon workstation.

Consumer hardware. Sitting silently on a desk. Drawing 30 watts under load. Costing less per year in electricity than a single month of a cloud AI subscription.

Apple MLX and the Architecture of Unified Memory

The enabler here is not only the model design. It is the hardware it runs on.

Apple's MLX framework is a machine learning computation library built specifically for Apple Silicon's unified memory architecture. Where conventional GPU setups maintain a strict and costly boundary between system RAM and graphics VRAM, requiring explicit memory transfers across that boundary for each computational step, Apple Silicon holds everything in a single pool of high-bandwidth memory shared between CPU, GPU, and Neural Engine simultaneously.

For large language model inference, this architectural difference is not incremental. It removes a fundamental bottleneck that has defined local AI performance for years.

MLX exploits this directly, achieving inference speeds 20 to 30 percent faster than llama.cpp on equivalent Apple Silicon hardware, with the performance gap widening on larger models where memory transfer overhead becomes the dominant cost. On an M4 MacBook Air with 24GB of unified memory, the Qwen3.5-35B-A3B model at Q5_K_M quantization runs at approximately 15 tokens per second, usable, sustained, interactive inference on a laptop that weighs 1.24 kilograms.

No cloud dependency. No API bill. No data leaving the device. No terms of service governing what you can discuss with it.

For users who run higher-specification Apple Silicon, the picture is more striking still. Community-reported benchmarks on M3 Ultra hardware show the same model at 8-bit quantization generating over 80 tokens per second. That is the response speed of a fast typist. At 24GB, it is a threshold accessible to anyone buying a current MacBook Pro or Mac Mini at standard configuration.

The Economics of Intelligence That Costs Nothing to Query

The cost structure of AI has, until recently, followed a familiar pattern: massive capital expenditure on training, then ongoing operational expenditure on inference infrastructure, monetised through per-token pricing passed to end users. This arrangement was economically logical and created a specific, important kind of dependency. You used the intelligence. Someone else kept the lights on.

When you run Qwen3.5-35B-A3B locally, the marginal cost of a query is effectively zero. You have already paid for the hardware. The model weights were free to download under an open license. Each conversation, each generated document, each reasoning chain costs nothing measurable beyond a fraction of a cent in electricity.

This is not a feature of any particular model or vendor. It is a structural consequence of open weights meeting efficient architecture meeting consumer hardware that can run it. The economics are not going to reverse.

The implications for individuals are significant. The implications for businesses and researchers working in contexts where data privacy is non-negotiable are larger still. The implications for organisations that have built commercial services around access control to AI intelligence are the most significant of all.

A journalist conducting source analysis no longer sends their documents to a foreign server. A researcher in a resource-constrained institution runs frontier-grade reasoning locally for free. A developer ships an application with embedded intelligence and zero ongoing inference cost. These are not future possibilities. They are present realities.

The Uncensored Layer

Open-weight models introduce a question that closed API models sidestep entirely: what happens when someone modifies the model itself?

The answer arrived in the form of a technique called abliteration, a post-training intervention that identifies and weakens specific directional vectors in a model's residual stream corresponding to refusal behaviour. The process does not retrain the model from scratch. It does not alter its capabilities or knowledge. It removes, with targeted precision, the trained disposition to decline.

The result is a class of models that have become an active and growing part of the open-weight AI ecosystem. The Qwen3.5-35B-A3B variant processed through aggressive abliteration, publicly available on Hugging Face, reports zero refusals across 465 tested prompts. Not because the model's intelligence has been reduced. Because the learned hesitation has been extracted.

// What Abliteration Does and Does Not Do

0/465

Refusals on tested prompts

Unchanged

Core knowledge & reasoning capability

Modified

Trained refusal direction vectors

Apache 2.0

License (commercially usable)

These models are released on open hosting platforms. They are downloaded by thousands of users. They are quantized, re-hosted, mirrored, and redistributed. The question of whether they can be taken down has a precise and uncomfortable answer: at the level of any individual repository, yes. At the level of the information they contain, no.

There Is No Switch

This is the structural reality that distinguishes the local AI era from everything that preceded it.

Content moderation, safety filtering, and usage policies in cloud AI systems all operate through access control. They sit at the boundary between the user and the model. Remove the intermediary, and those controls have no surface to operate on. The model runs directly on your hardware, under your operating system, subject only to your choices.

A model weight file that has been distributed is distributed. There is no server to switch off that can retrieve what has already been seeded into the network.

The physics of information distribution apply to model weights as they do to any other digital content. Once a file has propagated across enough nodes, its persistence is no longer contingent on any single host's continued cooperation. Uncensored model releases are not a temporary gap in the enforcement landscape, addressable through better takedown procedures or platform policy updates. They are a permanent consequence of the open-weight distribution ecosystem combined with consumer inference hardware.

It is worth observing that this dynamic maps precisely onto one of the structural layers described in the Evolving Software framework: specifically Layer V, Influence Without Deletion. In that architectural layer, the key insight is that adaptive systems do not require extermination. Variation persists even when unsuccessful, and what changes across the ecosystem is influence rather than existence. Uncensored model variants occupy precisely this structural position. They do not displace the safety-aligned originals. They persist alongside them, their reach shifting based on who chooses to run them, not on whether any central authority has authorised their survival.

This is not a statement about whether such models should be built, released, or used. It is a structural observation about the ecosystem they now inhabit.

The Questions This Opens

None of the above resolves cleanly into a single narrative.

A world in which frontier-class intelligence runs locally, free of charge, permanently distributed, with no recourse to central control, places extraordinary demands on individual judgement, community norms, and structural incentives that have barely been designed yet. It is not uniformly good or uniformly bad. It is genuinely both.

The optimistic reading is compelling: intelligence becomes, for the first time in history, something approaching a genuine public good. Researchers in resource-constrained environments access capabilities previously available only to well-funded institutions. Journalists working in contexts of surveillance or authoritarian pressure analyse sensitive material without sending it to foreign servers. Developers build without API costs, terms-of-service constraints, or the risk of a provider changing pricing or policy mid-product. Small businesses own their AI infrastructure entirely.

The technical safeguards baked into safety-aligned model weights remain meaningful. They are simply no longer sufficient as the primary line of defence, because the line of defence can now be relocated to a device where no oversight structure can see it.

The honest reading is harder. The same architecture that enables everything above also enables things we would collectively prefer did not exist. Open-weight AI governance is not primarily a technical problem. It is a social, legal, and philosophical one. And the field has not yet developed adequate frameworks for it.

Training-time alignment, the safety properties embedded in the original model weights before any abliteration, still carry real weight. A model trained with thoughtful reinforcement learning from human feedback does not simply become malign when its refusal vectors are weakened. Its values and knowledge are a function of its entire training distribution, not a single layer. But the combination of weakened refusals with frontier-class capability and permanent local availability is a combination the governance conversation has not yet caught up with.

The question of who bears responsibility when intelligence is free and local, running without any intermediary, is one the field has not yet answered. It is, increasingly, urgent.

A Threshold Crossed

There have been several moments in the history of computing that were recognised clearly only in retrospect as thresholds. The commoditisation of storage. The arrival of broadband. The smartphone's erasure of the boundary between identity and location. Each of these was visible in real time to a few people who understood its structural implications, and invisible to most everyone else, until it was not.

The local AI inference threshold has this quality. The capability is already here. The hardware already exists at consumer price points. The models are already distributed across millions of devices. The techniques for modifying their constraints are already documented, practiced, and published. The ecosystem for running them is already mature.

What has not yet caught up is the broader cultural and institutional understanding of what this means, for how AI systems should be designed, for how governance frameworks should be constructed, for how individuals and institutions make decisions when intelligence is no longer a service with a provider but an infrastructure with no single point of control.

The server has always been the regulator, by default. Not because anyone designed it that way, but because intelligence lived there and required it to function. Now it does not.

This changes the shape of many conversations we have been having about AI. Not because it invalidates them, but because several of their core assumptions, about where intelligence lives, who controls it, and what happens when someone does not want it controlled, are no longer reliably true.

The useful response to that is not alarm and not indifference. It is clarity. The threshold has been crossed. The question worth asking now is not how to uncross it, but what kind of world we want to build on the other side.

← EvolvingSoftware.com The Architecture of Emergence