LLM Self-Hosting for Enterprise - Azure, GCP, On-Premise
Self-host language models: DeepSeek, Llama, Mistral in your own infrastructure. Deployment options: Azure, GCP, on-premise, hybrid
Why Self-Hosting?
For many enterprise clients, the question is not whether AI will be deployed, but where the data is processed. When using cloud APIs (OpenAI, Anthropic, Google), data leaves the organization’s own infrastructure. For regulated industries - finance, healthcare, public sector - data residency can be a disqualifying factor.
At a Glance - LLM Self-Hosting for Enterprise
- Self-hosting keeps all data within the corporate network - no third-party processing, full control over model, data, and inference.
- Open-source models (Llama, Mistral, DeepSeek, gpt-oss) can be deployed on Azure ML, GCP Vertex AI, on-premise GPU servers, or hybrid setups.
- GPU sizing is the primary cost driver: a 7B model runs on one GPU, a 70B model requires multiple GPUs or quantization.
- Model-agnostic routing allows agents to use self-hosted models for sensitive data and cloud APIs for non-critical tasks.
- Gartner (2024) finds that 45% of enterprise AI deployments in regulated industries will run on private infrastructure by 2027, up from 20% in 2023.
Self-hosting means: the language model runs in the client’s infrastructure. No data leaves the corporate network. No third party processes the requests. Full control over model, data, and processing.
Which Models Can You Self-Host?
Open-source models can be operated in your own infrastructure:
Volume workhorse - Mistral Small 3.2 (24B, Apache 2.0, EU-built): European model, runs on a single RTX 4090 with 4-bit quantization. Ideal for batch inference on non-critical workloads. Mixtral 8x22B and Codestral Mamba 32B (coding-specialised) complete the Mistral portfolio.
Reasoning OSS - gpt-oss-120b (OpenAI, Apache 2.0): 117B parameters, MoE architecture, runs on a single H100 (80 GB). OpenAI’s first open-source model; gpt-oss-20b for edge scenarios.
Frontier OSS - DeepSeek V4-Flash and V4-Pro (MIT): DeepSeek V4-Flash (April 2026, 284B/13B active MoE) runs on a single H100 with quantization. V4-Pro (1.6T/49B) requires an 8x H100 cluster and delivers frontier-grade reasoning. DeepSeek R1 (Jan 2025) remains production-ready for mature deployments - V4 does not retire R1 overnight.
Long context - Llama 4 Scout (Meta License): 10M-token context window for document analysis across entire case files. Llama 4 Maverick handles shorter contexts with higher token throughput.
Coding OSS - Qwen 3 Coder 110B (Apache 2.0, Alibaba) and DeepSeek Coder V4 (MIT): Specialised for code generation and repository understanding. Codestral Mamba 32B (Mistral, EU-built) as the European alternative.
Proprietary models (Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro) are not available for self-hosting but can be used via API with EU-based processing.
In a model-agnostic architecture, an agent can use multiple models: self-hosted for sensitive data, cloud API for non-critical tasks. The routing is rule-based and configured in the Decision Layer.
Deployment Options
Azure: LLMs can be deployed on Azure ML or operated on dedicated GPU VMs (NC-Series, ND-Series). Integration with Azure Entra ID for authentication and access control. Processing in EU data centers (West Europe, North Europe).
GCP: Deployment via Vertex AI or on dedicated GPU VMs (A2, G2). Integration with Google Cloud IAM. Processing in EU data centers (europe-west1, europe-west4).
On-Premise: Dedicated servers with NVIDIA GPUs (A100, H100, RTX 4000 Ada). Operation in certified data centers. Maximum control, no cloud dependency.
Hybrid: Combination of self-hosted and cloud. Sensitive workloads run locally, non-critical workloads in the cloud. Unified governance across both environments.
| Criterion | Self-Hosted | Cloud API |
|---|---|---|
| Data Residency | Full control, data stays on-premise | Provider-dependent, EU regions available |
| Model Choice | Open-source only (Llama, Mistral, DeepSeek) | Proprietary + open-source via API |
| Cost at Scale | Lower (fixed GPU cost, no per-token fees) | Higher (per-token pricing scales linearly) |
| Operational Effort | High (GPU management, updates, HA) | Low (managed by provider) |
| Latency | Low (local network) | Variable (network-dependent) |
Free eBook: AI Infrastructure
Build, Buy, Hybrid - EU AI Act (UK: UK AI regulatory framework)-compliant infrastructure with B/B/H Framework and 7-Layer Reference Architecture.
Download for freeArchitecture Considerations
GPU Sizing: Model size determines GPU requirements. A 7B model runs on a single GPU. A 70B model requires multiple GPUs or quantization. The right sizing depends on the use case.
Inference Optimization: Techniques such as quantization (4-bit, 8-bit), batching, and KV cache optimization reduce resource requirements with acceptable quality trade-offs.
High Availability: For production systems: redundant GPU servers, load balancing, automatic failover. No single point of failure.
Model Updates: New model versions must be tested before going into production. A staging environment for model testing is part of the infrastructure.
TCO crossover - self-host vs Cloud API: The threshold sits at roughly 50-100M tokens/month sustained. Below that line, Cloud APIs win on cost; above it, a dedicated H100 amortises in 12-18 months. See Self-hosted Open-Source AI 2026 for the full model matrix and cost calculation.
More on this: AI Infrastructure
Book a consultation - We will show you the optimal hosting strategy for your requirements.

Bert Gogolin
CEO & Founder, Gosign
AI Governance Briefing
Enterprise AI, regulation, and infrastructure - once a month, directly from me.