Self-hosted Open-Source AI 2026: Mistral, gpt-oss, DeepSeek V4, Llama 4 in the Enterprise Stack
Self-hosting open-source LLMs in 2026: Mistral Small 3.2, gpt-oss-120b, DeepSeek V4-Pro/Flash, Llama 4. Hardware floor, TCO, EU GPU hosting providers, decision matrix per workload.
The model market gave EU-Enterprise procurement a choice it never had before. Open-weight models match proprietary models on most enterprise workloads. Three frontier-class open-source models shipped under Apache 2.0 in 2025 alone. EU GPU hosting providers offer H100 capacity at predictable hourly rates from Paris and Frankfurt data centers. The Schrems II ruling combined with the US CLOUD Act made self-hosting the only architecture with zero foreign-provider exposure.
And yet the conversation in procurement still treats “self-hosted open-source AI” as one product. It is not. It is a stack decision with four credible models, three deployment patterns, and a real total-cost-of-ownership math problem. This article is the detailed companion to When Mistral, When Claude Opus? Decision Routing for Agentic EU-Enterprise 2026 - if you have decided to self-host, here is how the model selection actually plays out.
At a Glance - Self-hosted Open-Source AI for EU-Enterprise 2026
- Five credible self-hostable models in 2026: Mistral Small 3.2 (Apache 2.0, 24B, single consumer GPU), gpt-oss-120b (Apache 2.0, MoE, single H100), DeepSeek V4-Flash (MIT, 284B/13B active MoE, April 2026 preview), DeepSeek V4-Pro (MIT, 1.6T/49B active, preview, cluster-grade), Llama 4 Scout (Meta License, 10M context).
- Mistral Small 3.2 wins the workhorse slot because it runs on consumer hardware (single RTX 4090), ships with multilingual training, and has native vision capability for document workloads.
- DeepSeek V4-Pro (preview, April 24, 2026) approaches frontier closed-source performance under MIT license but needs multi-GPU cluster - realistic self-hosting starts with V4-Flash for most enterprises.
- EU GPU hosting is no longer the bottleneck: Scaleway offers H100 SXM at ~3.50 EUR/h, OVHcloud has a sovereign tier, Hetzner provides dedicated RTX servers, IONOS and T-Systems serve regulated industries.
- TCO crossover from cloud API to self-hosted: typically around 50-100 million tokens per month sustained. Below that, EU-Cloud-API beats self-hosted. Above that, dedicated hardware amortizes within 12-18 months.
- The architecture is multi-model, not single-model: Mistral Small for volume, gpt-oss-120b or DeepSeek V4-Flash for on-prem heavy reasoning, V4-Pro or R1 for math/logic specialists, Llama 4 Scout for ultra-long context. Routing decides which model handles which decision.
You decided to self-host - the model question begins here
The choice to self-host an LLM stack is rarely a model decision. It is a compliance decision: data classified above a certain threshold cannot leave the company network. It is an architecture decision: the inference layer must be a controlled dependency, not an external API. It is a procurement decision: capital expenditure on hardware vs. operating expenditure on hosted GPU instances.
Once that decision is made, the model question opens. Which open-source model on which hardware floor for which workload mix? Five models have credible Q2 2026 production-readiness: Mistral Small 3.2, gpt-oss-120b, DeepSeek V4-Flash, DeepSeek V4-Pro (preview), and Llama 4 Scout. DeepSeek R1 from January 2025 is still production-ready but largely superseded by the V4 line for new deployments. Each model has a different cost-quality curve and a different operational profile.
This article skips the leaderboard discussion. Benchmark scores converge enough that workload fit matters more than nominal points on MMLU or HumanEval. The question is which model survives 18 months in your stack, which one earns its hardware, and which combination produces the audit trail that the EU AI Act requires.
The credible self-hosted models, side by side
| Model | Parameters | License | Hardware (CAPEX one-time / OPEX hosted) | Key strength | Key weakness |
|---|---|---|---|---|---|
| Mistral Small 3.2 | 24B dense, GQA (32Q/8KV) | Apache 2.0 | ~55 GB VRAM. CAPEX: 1× RTX 4090 ~1,500 EUR (pilot, 4-bit quant) or 1× H100 80GB ~30,000 EUR (production). OPEX: ~1,500-2,500 EUR/month on Scaleway/OVHcloud | Multilingual, vision, fast (~150 tok/s consumer GPU), volume-friendly | Not top-tier reasoning |
| gpt-oss-120b | 117B total / 5.1B active (MoE) | Apache 2.0 | 1× H100/A100 80GB. CAPEX: ~30,000 EUR. OPEX: ~1,200-2,500 EUR/month hosted | Reasoning at o4-mini level, MoE-efficient inference | No vision, datacenter-grade hardware only |
| DeepSeek V4-Flash (preview, Apr 2026) | 284B total / 13B active (MoE), 1M context | MIT | 1-2× H100/A100 80GB with quant, 4× H100 full precision. CAPEX: ~30,000-120,000 EUR. OPEX: ~1,500-5,000 EUR/month hosted | Frontier-class reasoning at moderate hardware cost, native multimodal, agent-optimized | Preview status - benchmarks should be re-verified before production |
| DeepSeek V4-Pro (preview, Apr 2026) | 1.6T total / 49B active (MoE), 1M context | MIT (open-source on Hugging Face) | 8× H100 minimum cluster. CAPEX: ~240,000 EUR. OPEX: ~10,000-12,000 EUR/month hosted. For DAX-Konzern and upper Mittelstand: feasible. For KMU under 500 employees: API/hosted variant (Together.ai, Fireworks, DeepSeek API) more realistic | Approaches GPT-5.5 and Gemini 3.1 Pro performance under open license, agent-tool optimized (Claude Code, OpenClaw) | Preview status; for KMU, hardware floor pushes to API path |
| DeepSeek R1 (Jan 2025, mature) | 671B total / 37B active (MoE) | MIT | 4-8× H100 minimum. CAPEX: ~120,000-240,000 EUR. OPEX: ~5,000-10,000 EUR/month hosted | Mature math/logic specialist, broad framework support | Largely superseded by V4-Flash for new deployments |
| Llama 4 Scout | 17B active (MoE) | Meta Llama Community License | 1× GPU. CAPEX: ~30,000 EUR. OPEX: ~1,500 EUR/month hosted | 10 million token context window | License restriction at >700M MAU; license review needed |
Three clarifications matter here.
Mistral Small 3.2 hardware floor. The official Mistral guidance lists ~55 GB GPU RAM for bf16/fp16 inference, which puts it on an H100 or A100 80GB in production. With 4-bit quantization (GPTQ, AWQ), it runs on a single 24 GB RTX 4090 at slight quality cost. For pilot deployments or single-tenant inference, the RTX 4090 path is real. For multi-tenant production with concurrent requests, the H100 path is the correct sizing.
DeepSeek V4 preview status. DeepSeek-V4-Pro and V4-Flash launched as preview on April 24, 2026 under MIT license, both with a 1M token context window via the new Hybrid Attention architecture (Compressed Sparse Attention + Heavily Compressed Attention). In the 1M-token context setting, V4-Pro reportedly requires only 27% of single-token inference FLOPs and 10% of the KV cache compared with V3.2 - significant efficiency gains for long-context workloads. Both variants are optimized for agent tooling (Claude Code, OpenClaw integration). However: preview means benchmark claims are not independently verified at scale yet. For production decisions in regulated industries, wait for the general-availability release or run your own representative benchmarks before committing.
License review for Llama 4 Scout. The Meta Llama Community License permits commercial use but contains two restrictions enterprise procurement should review: a 700-million-MAU threshold above which a separate Meta license is required, and a restriction on using model outputs to train competing models. For most enterprises both are irrelevant in practice, but the procurement note should be made explicit.
TCO: when does self-hosted beat cloud API?
The economics flip at a token volume threshold. Below it, hosted APIs win because hardware idle time dominates. Above it, dedicated GPUs win because incremental token cost approaches the cost of electricity plus depreciation.
A representative calculation for Mistral Small 3.2 in EU hosting:
| Cost element | Value (EU hosting) |
|---|---|
| H100 80GB instance, EU provider (Scaleway-class) | ~2,500 EUR/month dedicated, or ~3.50 EUR/h on-demand |
| Mistral Small 3.2 throughput (single H100) | ~150 tokens/sec sustained, ~390M tokens/month at 100% utilization |
| Effective cost per 1M tokens at 60% utilization | ~10-12 EUR per 1M tokens |
| Mistral La Plateforme API equivalent (Mistral Small via API) | ~0.40 USD per 1M input tokens; volume-dependent |
| Claude Sonnet 4.6 API equivalent | ~3 USD per 1M input tokens; ~15 USD output |
| Claude Opus 4.7 API equivalent | ~5 USD per 1M input tokens; ~25 USD output |
The crossover for Mistral Small lands between 50 and 100 million tokens per month sustained, depending on whether the workload is input-heavy or output-heavy. A 24/7 enterprise pipeline running 5 to 10 worker nodes typically crosses that threshold within the first quarter.
For gpt-oss-120b the math is similar but starts higher: a single H100 supports lower throughput than Mistral Small at the same hardware cost, so the per-token amortization is roughly 2× Mistral Small. The crossover vs. Claude Opus 4.7 sits around 30-50 million tokens per month - which is exactly the range where heavy-reasoning workloads land in enterprise AI systems.
A clarification on DeepSeek V4-Pro: the weights are open-source under MIT license and available on Hugging Face - the model is fully self-hostable. The question is enterprise size, not legality. V4-Pro’s 1.6T/49B-active architecture requires an 8× H100 cluster (~240,000 EUR CAPEX or ~10,000-12,000 EUR/month hosted). For DAX-Konzerne and upper Mittelstand (typical 2,500+ employees with established AI infrastructure budgets) those numbers fit a standard IT capex line item. For KMU under 500 employees, the same numbers push the realistic path to API access (DeepSeek API directly) or a hosted variant (Together.ai, Fireworks) at per-token economics. V4-Flash (284B/13B active) sits in between: 1-4× H100 footprint (30,000-120,000 EUR CAPEX), realistic for upper Mittelstand from day one. Self-hosted TCO for V4-Flash is justified when frontier-class reasoning is a sustained workload at sovereignty-critical data classifications; for occasional reasoning, the V4-Flash API or Mistral La Plateforme is cheaper.
These numbers are based on public EU hosting pricing from Scaleway and OVHcloud and on public model throughput data. They are illustrative, not contractual.
EU GPU hosting in 2026: who actually has H100 capacity?
The EU GPU hosting market matured significantly in 2025-2026. Three providers cover most enterprise self-hosting use cases:
Scaleway (France, GDPR-native). The most aggressive on price-performance for AI workloads. H100 SXM at ~3.50 EUR/h, A100 at ~2.50 EUR/h, plus the newer NVIDIA Blackwell B300-SXM (288 GB VRAM) for frontier workloads. French data centers, full GDPR compliance, no CLOUD Act exposure. Reserved-instance contracts available for predictable workloads.
OVHcloud (France, sovereign tier). The largest European cloud provider, with a “Sovereign Cloud” tier explicitly built for government and regulated-industry use. Portfolio includes H100, RTX 5000, A10, plus an “AI Deploy” service for pay-as-you-go notebook and inference. Good fit when procurement requires a sovereign-cloud sign-off.
Hetzner (Germany). The cost-leader for dedicated GPU servers, not on-demand instances. Current GPU options include RTX 4000 SFF Ada and RTX 6000 Ada paired with modern CPUs. The path for Mistral Small 3.2 with quantization or for development environments. Less suited to peak elastic scaling.
For regulated industries (financial services, healthcare, public sector) with strict sovereignty requirements:
IONOS (Germany). Sovereign-cloud-grade hosting with GPU instances. The compliance fit for German BaFin-regulated workloads.
T-Systems (Germany). Deutsche Telekom subsidiary. Sovereign cloud explicitly designed for public-sector and critical-infrastructure customers. The procurement-comfortable choice when board-level sovereignty is the requirement.
For an enterprise deciding on a self-hosted stack, the practical sequence is: pilot on Scaleway or Hetzner for cost-efficient validation, move to OVHcloud or T-Systems for production if regulatory sign-off requires sovereign-cloud certification, retain reserved-instance contracts to control cost predictability.
Deployment patterns: single worker, cluster, hybrid
Three deployment patterns cover almost all enterprise self-hosted scenarios.
Single-worker pattern. One model, one GPU instance, deployed behind a load balancer with health checks. Suitable for: Mistral Small 3.2 on an RTX 4090 or H100 for the 70% volume workload. Llama 4 Scout on a single GPU for long-context document analysis. Operational complexity: low. Failure mode: single point of failure unless replicated.
Multi-model cluster pattern. Multiple models on multiple GPUs behind a routing layer. Suitable for: Mistral Small for volume + gpt-oss-120b or DeepSeek V4-Flash for heavy reasoning + (optional) DeepSeek V4-Pro on dedicated cluster for math-grade workloads, all behind a single routing layer. The routing layer decides per request which model handles it. Operational complexity: medium. Requires a model server (vLLM, TGI, llama.cpp-server) and a routing rules engine. This is the typical production pattern for agentic workloads with mixed decision complexity.
Hybrid edge-cloud pattern. Sensitive workloads (HR onboarding, contract review, customer data extraction) on self-hosted models; non-sensitive workloads (marketing copy generation, knowledge base Q&A on public information) on EU-cloud APIs like Mistral La Plateforme. The routing layer enforces the data classification before model selection. Operational complexity: high (two stacks to maintain) but the lowest sovereignty exposure and the best cost-per-decision ratio.
The pattern choice depends on the data classification taxonomy, not the model selection. If everything is classified as “internal” or higher, the multi-model cluster pattern dominates. If a meaningful percentage of work is on public-facing or non-sensitive data, the hybrid pattern is cheaper.
Decision matrix: which model for which workload
| Workload category | Recommended model | Why |
|---|---|---|
| Document classification, structured extraction, OCR field parsing | Mistral Small 3.2 (self-hosted) | Vision-capable, fast on consumer GPU, multilingual coverage |
| High-volume text generation (emails, notifications, templates) | Mistral Small 3.2 (self-hosted) | Throughput, template-friendly, lowest cost per token |
| Contract clause classification, vendor risk flags, anomaly detection | Mistral Small 3.2 or Mistral Medium 3.1 (La Plateforme) | Medium reasoning at moderate cost, EU-sovereign |
| Anti-discrimination analysis under AGG/Equality Act, cross-statute compliance reasoning | gpt-oss-120b (on-prem) or Claude Opus 4.7 (cloud) | Tier-1 reasoning, audit-grade output for HR/Legal use cases |
| Code generation, code review (cloud flagships) | Claude Opus 4.7 or GPT-5.5 | Both benchmark-leading; Claude Opus 4.7 ahead on long agentic loops (Claude Code), GPT-5.5 ahead in IDE integrations (Cursor, Copilot) |
| Code generation (self-hosted, sovereignty-critical) | Qwen 3 Coder 110B (Apache 2.0, Alibaba), DeepSeek Coder V4 (MIT), or Codestral Mamba 32B (Mistral, EU-built) | Tier-1 coding benchmarks on-prem; Qwen 3 Coder leads HumanEval/SWE-Bench among OSS, DeepSeek Coder V4 strongest on agentic multi-file tasks, Codestral Mamba lowest latency on consumer GPU |
| Microsoft 365 / Copilot deep integration | GPT-5.5 via Azure OpenAI | Native stack, lowest integration effort for organisations on Microsoft data plane |
| Agentic workflows with heavy function-calling / tool-use | GPT-5.5 or Claude Opus 4.7 | Both top-tier for structured outputs and tool orchestration; GPT-5.5 has broader ecosystem of pre-built tools |
| Financial risk modeling, stress testing, optimization | DeepSeek V4-Flash (current) or V4-Pro via API; R1 still production-ready | Top-tier math/logic; V4 line adds 1M context for cross-portfolio analysis |
| Document analysis of large corpora (entire contract portfolios, annual reports) | Llama 4 Scout | 10M token context window - unique in this band |
| Multimodal (image + text correlation, technical drawings, video segments) | Gemini 3.1 Pro (cloud, no self-hosted equivalent) | Native multimodal training, 1M context |
| Conversational AI / customer-facing chatbots | Mistral Small 3.2 (self-hosted) for volume; GPT-5.5 (Azure) when MS-stack-native | Production-grade quality at lowest hardware cost; GPT-5.5 if integrated into Dynamics/Copilot |
| SaaS feature gating (per-customer model tiers, per-region routing) | Hybrid pattern: Mistral Small + Claude Opus 4.7 / GPT-5.5 | Sensitive customer data on self-hosted, premium features on cloud flagship |
The matrix is not a prescription. It is a starting point that gets refined per organization. A finance-heavy enterprise weights DeepSeek V4 higher. A multimedia-heavy operation may need a cloud Gemini hop. A high-document-volume HR pipeline puts Mistral Small at 80% of decisions, not 70%.
The routing layer makes the matrix operational. Without it, every workload runs against whatever model is configured as the default, and the matrix becomes a slideware artifact.
Building the routing layer: where Decision Layer fits
Self-hosted multi-model architectures break down without a routing layer for a simple reason: no human operator wants to remember 14 decision-to-model mappings while also writing the agent’s business logic. The routing has to be configuration, not code.
A Decision Layer holds:
- The data classification taxonomy (which data types require self-hosted? Which can route to EU-cloud-API? Which can route to US-cloud-API?)
- The decision-to-model routing rules per workflow step
- The fallback chain (if Mistral Small fails or saturates, route to which alternative?)
- The audit log: every decision recorded with input snapshot, rule version, model used, confidence score, reasoning chain, outcome, and human approver where applicable
- The Challenge button: any affected subject can contest an automated decision, triggering re-decision under human review - the mechanism required by GDPR Art. 22
This is the artifact that an EU AI Act Article 13 auditor inspects. It is the artifact that a Betriebsrat reviews when classifying which agents fall under BetrVG §87 Absatz 1 Nummer 6 co-determination scope. It is the artifact that satisfies the procurement question “what happens when your AI vendor changes a model?” - because the routing rule changes, not the business logic.
Building this layer in-house is doable but rarely faster than 6-9 months for an enterprise team starting from scratch. Buying it as a configuration framework typically shortens the path to 4-6 weeks for the first production agent.
Bottom line
Self-hosted open-source AI is a credible production choice for EU-Enterprise in 2026 - but only as a multi-model architecture with a routing layer, not as a single-model bet. Mistral Small 3.2 covers the volume band. gpt-oss-120b or DeepSeek V4-Flash covers heavy reasoning on-prem. DeepSeek V4-Pro (currently in preview) approaches Claude Opus territory if you have hyperscaler-class hardware - or you wait for the GA release and use it via API in the meantime. Llama 4 Scout covers ultra-long context. The cloud-API tier (Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro) stays available for the workloads where the regulatory framework permits it.
The routing decision is the architecture. The TCO crossover (around 50-100 million tokens per month sustained) sets the economic threshold for self-hosting. The compliance taxonomy (which data classification cannot leave the network) sets the sovereignty threshold. Both thresholds shape the routing rules.
Others publish hardware buying guides. We size the GPU footprint against your actual decision-complexity mix and the TCO crossover that matters for your token volume. The model market reprices itself every quarter; an H100 cluster that pays back in 14 months under DeepSeek V4-Flash today is the same cluster that runs whichever open-weight model wins the next benchmark generation. Hardware investment outlives any single model release. The routing rules absorb the model churn so the CAPEX line item stays defensible across three procurement cycles.
If you want to know what your self-hosted stack should look like based on your actual workload mix and data classification, book a consultation.
📘 Enterprise AI Infrastructure Blueprint 2026 - Article Series
All articles in this series: Enterprise AI Infrastructure Blueprint 2026
Sources
Primary sources used in this article:
- Mistral Small 3.2 release and specifications - 24B dense, Apache 2.0, ~55 GB GPU RAM in bf16/fp16, June 2025 release. Mistral AI announcement (3.1 baseline) | VentureBeat coverage 3.1 to 3.2 | Hugging Face model page | Spec comparison
- gpt-oss-120b release - 117B parameters total, 5.1B active per token (MoE), Apache 2.0, August 5, 2025 release, single 80 GB GPU. OpenAI announcement | GitHub repository | Model card PDF
- DeepSeek V4-Pro and V4-Flash - Preview released April 24, 2026. V4-Pro: 1.6T total / 49B active MoE; V4-Flash: 284B / 13B active. MIT license. 1M token context via Hybrid Attention (Compressed Sparse Attention + Heavily Compressed Attention). DeepSeek API Docs - V4 Preview | Hugging Face V4-Pro | Hugging Face V4-Flash | Simon Willison analysis
- DeepSeek R1 (prior generation, Jan 2025) - MIT license, mature math/logic specialist, 671B total / 37B active MoE. Hugging Face model page
- Llama 4 Scout - 17B active MoE, 10M token context, Meta Llama Community License. Meta AI announcement
- Claude Opus 4.7 - Released April 16, 2026, $5/$25 per million tokens, available on Anthropic API, Amazon Bedrock, Google Vertex AI, Microsoft Foundry. Anthropic announcement
- GPT-5.5 - Released April 23-24, 2026, latest GPT-5 series model. OpenAI announcement
- Gemini 3.1 Pro - Released February 19, 2026, current Gemini flagship. Google AI for Developers | Vertex AI documentation
- EU GPU hosting providers - Scaleway H100 SXM ~3.50 EUR/h, A100 ~2.50 EUR/h with full GDPR compliance. OVHcloud sovereign tier. Hetzner dedicated RTX servers. Cloud GPU Tracker Europe | Scaleway H100 page | OVHcloud H100 page
- EU AI Act Article 13 - Transparency obligations for high-risk AI systems. Regulation (EU) 2024/1689. artificialintelligenceact.eu/article/13 | EU AI Act Service Desk
- US CLOUD Act vs EU sovereignty - CLOUD Act follows provider control not data location, EU Tech Sovereignty Package expected Q2 2026. Kiteworks analysis
Book a consultation. We analyze your workload mix, your data classification, and your sovereignty requirements - and recommend the multi-model self-hosted stack that matches.

Bert Gogolin
CEO & Founder, Gosign
AI Governance Briefing
Enterprise AI, regulation, and infrastructure - once a month, directly from me.