Which language models can you self-host?

Open-source models such as Llama (Meta), Mistral, DeepSeek, gpt-oss, and their derivatives can be operated in your own infrastructure. Proprietary models like Claude (Anthropic) and ChatGPT (OpenAI) are only available via API.

Where can you run self-hosted LLMs?

On Azure (via Azure ML or dedicated VMs), on GCP (via Vertex AI or dedicated VMs), on your own servers (on-premise in certified data centers), or in a hybrid setup.

What does LLM self-hosting cost?

Costs depend on the model, the hardware, and the usage volume. GPU servers (NVIDIA A100/H100) are the largest cost factor. At high usage volumes, self-hosting is often more cost-effective than API-based usage.

LLM Self-Hosting for Enterprise - Azure, GCP, On-Premise

Why Self-Hosting?

For many enterprise clients, the question is not whether AI will be deployed, but where the data is processed. When using cloud APIs (OpenAI, Anthropic, Google), data leaves the organization’s own infrastructure. For regulated industries - finance, healthcare, public sector - data residency can be a disqualifying factor.

At a Glance - LLM Self-Hosting for Enterprise

Self-hosting keeps all data within the corporate network - no third-party processing, full control over model, data, and inference.
Open-source models (Llama, Mistral, DeepSeek, gpt-oss) can be deployed on Azure ML, GCP Vertex AI, on-premise GPU servers, or hybrid setups.
GPU sizing is the primary cost driver: a 7B model runs on one GPU, a 70B model requires multiple GPUs or quantization.
Model-agnostic routing allows agents to use self-hosted models for sensitive data and cloud APIs for non-critical tasks.
Gartner (2024) finds that 45% of enterprise AI deployments in regulated industries will run on private infrastructure by 2027, up from 20% in 2023.

Self-hosting means: the language model runs in the client’s infrastructure. No data leaves the corporate network. No third party processes the requests. Full control over model, data, and processing.

Which Models Can You Self-Host?

Open-source models can be operated in your own infrastructure:

Volume workhorse - Mistral Small 3.2 (24B, Apache 2.0, EU-built): European model, runs on a single RTX 4090 with 4-bit quantization. Ideal for batch inference on non-critical workloads. Mixtral 8x22B and Codestral Mamba 32B (coding-specialised) complete the Mistral portfolio.

Reasoning OSS - gpt-oss-120b (OpenAI, Apache 2.0): 117B parameters, MoE architecture, runs on a single H100 (80 GB). OpenAI’s first open-source model; gpt-oss-20b for edge scenarios.

Frontier OSS - DeepSeek V4-Flash and V4-Pro (MIT): DeepSeek V4-Flash (April 2026, 284B/13B active MoE) runs on a single H100 with quantization. V4-Pro (1.6T/49B) requires an 8x H100 cluster and delivers frontier-grade reasoning. DeepSeek R1 (Jan 2025) remains production-ready for mature deployments - V4 does not retire R1 overnight.

Long context - Llama 4 Scout (Meta License): 10M-token context window for document analysis across entire case files. Llama 4 Maverick handles shorter contexts with higher token throughput.

Coding OSS - Qwen 3 Coder 110B (Apache 2.0, Alibaba) and DeepSeek Coder V4 (MIT): Specialised for code generation and repository understanding. Codestral Mamba 32B (Mistral, EU-built) as the European alternative.

Proprietary models (Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro) are not available for self-hosting but can be used via API with EU-based processing.

In a model-agnostic architecture, an agent can use multiple models: self-hosted for sensitive data, cloud API for non-critical tasks. The routing is rule-based and configured in the Decision Layer.

Deployment Options

Azure: LLMs can be deployed on Azure ML or operated on dedicated GPU VMs (NC-Series, ND-Series). Integration with Azure Entra ID for authentication and access control. Processing in EU data centers (West Europe, North Europe).

GCP: Deployment via Vertex AI or on dedicated GPU VMs (A2, G2). Integration with Google Cloud IAM. Processing in EU data centers (europe-west1, europe-west4).

On-Premise: Dedicated servers with NVIDIA GPUs (A100, H100, RTX 4000 Ada). Operation in certified data centers. Maximum control, no cloud dependency.

Hybrid: Combination of self-hosted and cloud. Sensitive workloads run locally, non-critical workloads in the cloud. Unified governance across both environments.

Criterion	Self-Hosted	Cloud API
Data Residency	Full control, data stays on-premise	Provider-dependent, EU regions available
Model Choice	Open-source only (Llama, Mistral, DeepSeek)	Proprietary + open-source via API
Cost at Scale	Lower (fixed GPU cost, no per-token fees)	Higher (per-token pricing scales linearly)
Operational Effort	High (GPU management, updates, HA)	Low (managed by provider)
Latency	Low (local network)	Variable (network-dependent)

Free eBook: AI Infrastructure

Build, Buy, Hybrid - EU AI Act (UK: UK AI regulatory framework)-compliant infrastructure with B/B/H Framework and 7-Layer Reference Architecture.

Download for free

Architecture Considerations

GPU Sizing: Model size determines GPU requirements. A 7B model runs on a single GPU. A 70B model requires multiple GPUs or quantization. The right sizing depends on the use case.

Inference Optimization: Techniques such as quantization (4-bit, 8-bit), batching, and KV cache optimization reduce resource requirements with acceptable quality trade-offs.

High Availability: For production systems: redundant GPU servers, load balancing, automatic failover. No single point of failure.

Model Updates: New model versions must be tested before going into production. A staging environment for model testing is part of the infrastructure.

TCO crossover - self-host vs Cloud API: The threshold sits at roughly 50-100M tokens/month sustained. Below that line, Cloud APIs win on cost; above it, a dedicated H100 amortises in 12-18 months. See Self-hosted Open-Source AI 2026 for the full model matrix and cost calculation.

AI Governance Briefing

Enterprise AI, regulation, and infrastructure - once a month, directly from me.