Skip to content
Infrastructure & Technology

Self-hosted Open-Source AI 2026: Mistral, gpt-oss, DeepSeek V4, Llama 4 in the Enterprise Stack

Self-hosting open-source LLMs in 2026: Mistral Small 3.2, gpt-oss-120b, DeepSeek V4-Pro/Flash, Llama 4. Hardware floor, TCO, EU GPU hosting providers, decision matrix per workload.

Bert Gogolin
Bert Gogolin
CEO & Founder 15 min read

The model market gave EU-Enterprise procurement a choice it never had before. Open-weight models match proprietary models on most enterprise workloads. Three frontier-class open-source models shipped under Apache 2.0 in 2025 alone. EU GPU hosting providers offer H100 capacity at predictable hourly rates from Paris and Frankfurt data centers. The Schrems II ruling combined with the US CLOUD Act made self-hosting the only architecture with zero foreign-provider exposure.

And yet the conversation in procurement still treats “self-hosted open-source AI” as one product. It is not. It is a stack decision with four credible models, three deployment patterns, and a real total-cost-of-ownership math problem. This article is the detailed companion to When Mistral, When Claude Opus? Decision Routing for Agentic EU-Enterprise 2026 - if you have decided to self-host, here is how the model selection actually plays out.

At a Glance - Self-hosted Open-Source AI for EU-Enterprise 2026

  • Five credible self-hostable models in 2026: Mistral Small 3.2 (Apache 2.0, 24B, single consumer GPU), gpt-oss-120b (Apache 2.0, MoE, single H100), DeepSeek V4-Flash (MIT, 284B/13B active MoE, April 2026 preview), DeepSeek V4-Pro (MIT, 1.6T/49B active, preview, cluster-grade), Llama 4 Scout (Meta License, 10M context).
  • Mistral Small 3.2 wins the workhorse slot because it runs on consumer hardware (single RTX 4090), ships with multilingual training, and has native vision capability for document workloads.
  • DeepSeek V4-Pro (preview, April 24, 2026) approaches frontier closed-source performance under MIT license but needs multi-GPU cluster - realistic self-hosting starts with V4-Flash for most enterprises.
  • EU GPU hosting is no longer the bottleneck: Scaleway offers H100 SXM at ~3.50 EUR/h, OVHcloud has a sovereign tier, Hetzner provides dedicated RTX servers, IONOS and T-Systems serve regulated industries.
  • TCO crossover from cloud API to self-hosted: typically around 50-100 million tokens per month sustained. Below that, EU-Cloud-API beats self-hosted. Above that, dedicated hardware amortizes within 12-18 months.
  • The architecture is multi-model, not single-model: Mistral Small for volume, gpt-oss-120b or DeepSeek V4-Flash for on-prem heavy reasoning, V4-Pro or R1 for math/logic specialists, Llama 4 Scout for ultra-long context. Routing decides which model handles which decision.

You decided to self-host - the model question begins here

The choice to self-host an LLM stack is rarely a model decision. It is a compliance decision: data classified above a certain threshold cannot leave the company network. It is an architecture decision: the inference layer must be a controlled dependency, not an external API. It is a procurement decision: capital expenditure on hardware vs. operating expenditure on hosted GPU instances.

Once that decision is made, the model question opens. Which open-source model on which hardware floor for which workload mix? Five models have credible Q2 2026 production-readiness: Mistral Small 3.2, gpt-oss-120b, DeepSeek V4-Flash, DeepSeek V4-Pro (preview), and Llama 4 Scout. DeepSeek R1 from January 2025 is still production-ready but largely superseded by the V4 line for new deployments. Each model has a different cost-quality curve and a different operational profile.

This article skips the leaderboard discussion. Benchmark scores converge enough that workload fit matters more than nominal points on MMLU or HumanEval. The question is which model survives 18 months in your stack, which one earns its hardware, and which combination produces the audit trail that the EU AI Act requires.

The credible self-hosted models, side by side

ModelParametersLicenseHardware (CAPEX one-time / OPEX hosted)Key strengthKey weakness
Mistral Small 3.224B dense, GQA (32Q/8KV)Apache 2.0~55 GB VRAM. CAPEX: 1× RTX 4090 ~1,500 EUR (pilot, 4-bit quant) or 1× H100 80GB ~30,000 EUR (production). OPEX: ~1,500-2,500 EUR/month on Scaleway/OVHcloudMultilingual, vision, fast (~150 tok/s consumer GPU), volume-friendlyNot top-tier reasoning
gpt-oss-120b117B total / 5.1B active (MoE)Apache 2.01× H100/A100 80GB. CAPEX: ~30,000 EUR. OPEX: ~1,200-2,500 EUR/month hostedReasoning at o4-mini level, MoE-efficient inferenceNo vision, datacenter-grade hardware only
DeepSeek V4-Flash (preview, Apr 2026)284B total / 13B active (MoE), 1M contextMIT1-2× H100/A100 80GB with quant, 4× H100 full precision. CAPEX: ~30,000-120,000 EUR. OPEX: ~1,500-5,000 EUR/month hostedFrontier-class reasoning at moderate hardware cost, native multimodal, agent-optimizedPreview status - benchmarks should be re-verified before production
DeepSeek V4-Pro (preview, Apr 2026)1.6T total / 49B active (MoE), 1M contextMIT (open-source on Hugging Face)8× H100 minimum cluster. CAPEX: ~240,000 EUR. OPEX: ~10,000-12,000 EUR/month hosted. For DAX-Konzern and upper Mittelstand: feasible. For KMU under 500 employees: API/hosted variant (Together.ai, Fireworks, DeepSeek API) more realisticApproaches GPT-5.5 and Gemini 3.1 Pro performance under open license, agent-tool optimized (Claude Code, OpenClaw)Preview status; for KMU, hardware floor pushes to API path
DeepSeek R1 (Jan 2025, mature)671B total / 37B active (MoE)MIT4-8× H100 minimum. CAPEX: ~120,000-240,000 EUR. OPEX: ~5,000-10,000 EUR/month hostedMature math/logic specialist, broad framework supportLargely superseded by V4-Flash for new deployments
Llama 4 Scout17B active (MoE)Meta Llama Community License1× GPU. CAPEX: ~30,000 EUR. OPEX: ~1,500 EUR/month hosted10 million token context windowLicense restriction at >700M MAU; license review needed

Three clarifications matter here.

Mistral Small 3.2 hardware floor. The official Mistral guidance lists ~55 GB GPU RAM for bf16/fp16 inference, which puts it on an H100 or A100 80GB in production. With 4-bit quantization (GPTQ, AWQ), it runs on a single 24 GB RTX 4090 at slight quality cost. For pilot deployments or single-tenant inference, the RTX 4090 path is real. For multi-tenant production with concurrent requests, the H100 path is the correct sizing.

DeepSeek V4 preview status. DeepSeek-V4-Pro and V4-Flash launched as preview on April 24, 2026 under MIT license, both with a 1M token context window via the new Hybrid Attention architecture (Compressed Sparse Attention + Heavily Compressed Attention). In the 1M-token context setting, V4-Pro reportedly requires only 27% of single-token inference FLOPs and 10% of the KV cache compared with V3.2 - significant efficiency gains for long-context workloads. Both variants are optimized for agent tooling (Claude Code, OpenClaw integration). However: preview means benchmark claims are not independently verified at scale yet. For production decisions in regulated industries, wait for the general-availability release or run your own representative benchmarks before committing.

License review for Llama 4 Scout. The Meta Llama Community License permits commercial use but contains two restrictions enterprise procurement should review: a 700-million-MAU threshold above which a separate Meta license is required, and a restriction on using model outputs to train competing models. For most enterprises both are irrelevant in practice, but the procurement note should be made explicit.

TCO: when does self-hosted beat cloud API?

The economics flip at a token volume threshold. Below it, hosted APIs win because hardware idle time dominates. Above it, dedicated GPUs win because incremental token cost approaches the cost of electricity plus depreciation.

A representative calculation for Mistral Small 3.2 in EU hosting:

Cost elementValue (EU hosting)
H100 80GB instance, EU provider (Scaleway-class)~2,500 EUR/month dedicated, or ~3.50 EUR/h on-demand
Mistral Small 3.2 throughput (single H100)~150 tokens/sec sustained, ~390M tokens/month at 100% utilization
Effective cost per 1M tokens at 60% utilization~10-12 EUR per 1M tokens
Mistral La Plateforme API equivalent (Mistral Small via API)~0.40 USD per 1M input tokens; volume-dependent
Claude Sonnet 4.6 API equivalent~3 USD per 1M input tokens; ~15 USD output
Claude Opus 4.7 API equivalent~5 USD per 1M input tokens; ~25 USD output

The crossover for Mistral Small lands between 50 and 100 million tokens per month sustained, depending on whether the workload is input-heavy or output-heavy. A 24/7 enterprise pipeline running 5 to 10 worker nodes typically crosses that threshold within the first quarter.

For gpt-oss-120b the math is similar but starts higher: a single H100 supports lower throughput than Mistral Small at the same hardware cost, so the per-token amortization is roughly 2× Mistral Small. The crossover vs. Claude Opus 4.7 sits around 30-50 million tokens per month - which is exactly the range where heavy-reasoning workloads land in enterprise AI systems.

A clarification on DeepSeek V4-Pro: the weights are open-source under MIT license and available on Hugging Face - the model is fully self-hostable. The question is enterprise size, not legality. V4-Pro’s 1.6T/49B-active architecture requires an 8× H100 cluster (~240,000 EUR CAPEX or ~10,000-12,000 EUR/month hosted). For DAX-Konzerne and upper Mittelstand (typical 2,500+ employees with established AI infrastructure budgets) those numbers fit a standard IT capex line item. For KMU under 500 employees, the same numbers push the realistic path to API access (DeepSeek API directly) or a hosted variant (Together.ai, Fireworks) at per-token economics. V4-Flash (284B/13B active) sits in between: 1-4× H100 footprint (30,000-120,000 EUR CAPEX), realistic for upper Mittelstand from day one. Self-hosted TCO for V4-Flash is justified when frontier-class reasoning is a sustained workload at sovereignty-critical data classifications; for occasional reasoning, the V4-Flash API or Mistral La Plateforme is cheaper.

These numbers are based on public EU hosting pricing from Scaleway and OVHcloud and on public model throughput data. They are illustrative, not contractual.

TCO crossover: self-hosted vs cloud API by monthly token volume (May 2026) Cost comparison chart showing monthly USD spend on the Y-axis (log scale 100 to 100,000 USD) against monthly token volume on the X-axis (log scale 1M to 10B tokens per month). Four cost curves are plotted: Claude Opus 4.7 API at 15 USD per 1M tokens (linear), Mistral La Plateforme API at 0.40 USD per 1M tokens (linear), Mistral Small 3.2 self-hosted on a single H100 at ~2,500 EUR/month flat plus marginal electricity, and gpt-oss-120b self-hosted on a single H100 at ~3,000 EUR/month flat. Crossover points: Mistral Small 3.2 self-hosted beats Claude Opus 4.7 API at around 200 million tokens per month. gpt-oss-120b self-hosted beats Claude Opus 4.7 API at around 250 million tokens per month. Below 50 million tokens per month, cloud API economics dominate. Above 500 million tokens per month, self-hosted dominates regardless of model choice. Monthly cost (USD, log scale) $100 $1,000 $10,000 $100,000 $1,000,000 1M 10M 100M 1B 10B Monthly token volume (log scale) Claude Opus 4.7 API $15 per 1M tokens Mistral La Plateforme API $0.40 per 1M tokens Mistral Small 3.2 self-host (~$2,700/mo flat) gpt-oss-120b self-host (~$3,200/mo flat) crossover ~180M tok/mo Mistral OSS vs Opus API Hosting: EU GPU providers (Scaleway, OVHcloud) - figures illustrative, not contractual
TCO crossover: self-hosted vs cloud API - linear cloud-API curves (per-token pricing) versus flat self-hosted curves (CAPEX amortized). Mistral La Plateforme API stays cheapest below ~10B tokens/month - the relevant decision is Mistral OSS self-host vs Claude Opus 4.7 API, which crosses around 180 million tokens per month for sovereignty-critical workloads. Below 50M tokens/month, cloud API economics dominate. Above 500M tokens/month, self-host dominates regardless.

EU GPU hosting in 2026: who actually has H100 capacity?

The EU GPU hosting market matured significantly in 2025-2026. Three providers cover most enterprise self-hosting use cases:

Scaleway (France, GDPR-native). The most aggressive on price-performance for AI workloads. H100 SXM at ~3.50 EUR/h, A100 at ~2.50 EUR/h, plus the newer NVIDIA Blackwell B300-SXM (288 GB VRAM) for frontier workloads. French data centers, full GDPR compliance, no CLOUD Act exposure. Reserved-instance contracts available for predictable workloads.

OVHcloud (France, sovereign tier). The largest European cloud provider, with a “Sovereign Cloud” tier explicitly built for government and regulated-industry use. Portfolio includes H100, RTX 5000, A10, plus an “AI Deploy” service for pay-as-you-go notebook and inference. Good fit when procurement requires a sovereign-cloud sign-off.

Hetzner (Germany). The cost-leader for dedicated GPU servers, not on-demand instances. Current GPU options include RTX 4000 SFF Ada and RTX 6000 Ada paired with modern CPUs. The path for Mistral Small 3.2 with quantization or for development environments. Less suited to peak elastic scaling.

For regulated industries (financial services, healthcare, public sector) with strict sovereignty requirements:

IONOS (Germany). Sovereign-cloud-grade hosting with GPU instances. The compliance fit for German BaFin-regulated workloads.

T-Systems (Germany). Deutsche Telekom subsidiary. Sovereign cloud explicitly designed for public-sector and critical-infrastructure customers. The procurement-comfortable choice when board-level sovereignty is the requirement.

For an enterprise deciding on a self-hosted stack, the practical sequence is: pilot on Scaleway or Hetzner for cost-efficient validation, move to OVHcloud or T-Systems for production if regulatory sign-off requires sovereign-cloud certification, retain reserved-instance contracts to control cost predictability.

Deployment patterns: single worker, cluster, hybrid

Three deployment patterns cover almost all enterprise self-hosted scenarios.

Single-worker pattern. One model, one GPU instance, deployed behind a load balancer with health checks. Suitable for: Mistral Small 3.2 on an RTX 4090 or H100 for the 70% volume workload. Llama 4 Scout on a single GPU for long-context document analysis. Operational complexity: low. Failure mode: single point of failure unless replicated.

Multi-model cluster pattern. Multiple models on multiple GPUs behind a routing layer. Suitable for: Mistral Small for volume + gpt-oss-120b or DeepSeek V4-Flash for heavy reasoning + (optional) DeepSeek V4-Pro on dedicated cluster for math-grade workloads, all behind a single routing layer. The routing layer decides per request which model handles it. Operational complexity: medium. Requires a model server (vLLM, TGI, llama.cpp-server) and a routing rules engine. This is the typical production pattern for agentic workloads with mixed decision complexity.

Hybrid edge-cloud pattern. Sensitive workloads (HR onboarding, contract review, customer data extraction) on self-hosted models; non-sensitive workloads (marketing copy generation, knowledge base Q&A on public information) on EU-cloud APIs like Mistral La Plateforme. The routing layer enforces the data classification before model selection. Operational complexity: high (two stacks to maintain) but the lowest sovereignty exposure and the best cost-per-decision ratio.

The pattern choice depends on the data classification taxonomy, not the model selection. If everything is classified as “internal” or higher, the multi-model cluster pattern dominates. If a meaningful percentage of work is on public-facing or non-sensitive data, the hybrid pattern is cheaper.

Decision matrix: which model for which workload

Workload categoryRecommended modelWhy
Document classification, structured extraction, OCR field parsingMistral Small 3.2 (self-hosted)Vision-capable, fast on consumer GPU, multilingual coverage
High-volume text generation (emails, notifications, templates)Mistral Small 3.2 (self-hosted)Throughput, template-friendly, lowest cost per token
Contract clause classification, vendor risk flags, anomaly detectionMistral Small 3.2 or Mistral Medium 3.1 (La Plateforme)Medium reasoning at moderate cost, EU-sovereign
Anti-discrimination analysis under AGG/Equality Act, cross-statute compliance reasoninggpt-oss-120b (on-prem) or Claude Opus 4.7 (cloud)Tier-1 reasoning, audit-grade output for HR/Legal use cases
Code generation, code review (cloud flagships)Claude Opus 4.7 or GPT-5.5Both benchmark-leading; Claude Opus 4.7 ahead on long agentic loops (Claude Code), GPT-5.5 ahead in IDE integrations (Cursor, Copilot)
Code generation (self-hosted, sovereignty-critical)Qwen 3 Coder 110B (Apache 2.0, Alibaba), DeepSeek Coder V4 (MIT), or Codestral Mamba 32B (Mistral, EU-built)Tier-1 coding benchmarks on-prem; Qwen 3 Coder leads HumanEval/SWE-Bench among OSS, DeepSeek Coder V4 strongest on agentic multi-file tasks, Codestral Mamba lowest latency on consumer GPU
Microsoft 365 / Copilot deep integrationGPT-5.5 via Azure OpenAINative stack, lowest integration effort for organisations on Microsoft data plane
Agentic workflows with heavy function-calling / tool-useGPT-5.5 or Claude Opus 4.7Both top-tier for structured outputs and tool orchestration; GPT-5.5 has broader ecosystem of pre-built tools
Financial risk modeling, stress testing, optimizationDeepSeek V4-Flash (current) or V4-Pro via API; R1 still production-readyTop-tier math/logic; V4 line adds 1M context for cross-portfolio analysis
Document analysis of large corpora (entire contract portfolios, annual reports)Llama 4 Scout10M token context window - unique in this band
Multimodal (image + text correlation, technical drawings, video segments)Gemini 3.1 Pro (cloud, no self-hosted equivalent)Native multimodal training, 1M context
Conversational AI / customer-facing chatbotsMistral Small 3.2 (self-hosted) for volume; GPT-5.5 (Azure) when MS-stack-nativeProduction-grade quality at lowest hardware cost; GPT-5.5 if integrated into Dynamics/Copilot
SaaS feature gating (per-customer model tiers, per-region routing)Hybrid pattern: Mistral Small + Claude Opus 4.7 / GPT-5.5Sensitive customer data on self-hosted, premium features on cloud flagship

The matrix is not a prescription. It is a starting point that gets refined per organization. A finance-heavy enterprise weights DeepSeek V4 higher. A multimedia-heavy operation may need a cloud Gemini hop. A high-document-volume HR pipeline puts Mistral Small at 80% of decisions, not 70%.

The routing layer makes the matrix operational. Without it, every workload runs against whatever model is configured as the default, and the matrix becomes a slideware artifact.

Building the routing layer: where Decision Layer fits

Self-hosted multi-model architectures break down without a routing layer for a simple reason: no human operator wants to remember 14 decision-to-model mappings while also writing the agent’s business logic. The routing has to be configuration, not code.

A Decision Layer holds:

  • The data classification taxonomy (which data types require self-hosted? Which can route to EU-cloud-API? Which can route to US-cloud-API?)
  • The decision-to-model routing rules per workflow step
  • The fallback chain (if Mistral Small fails or saturates, route to which alternative?)
  • The audit log: every decision recorded with input snapshot, rule version, model used, confidence score, reasoning chain, outcome, and human approver where applicable
  • The Challenge button: any affected subject can contest an automated decision, triggering re-decision under human review - the mechanism required by GDPR Art. 22

This is the artifact that an EU AI Act Article 13 auditor inspects. It is the artifact that a Betriebsrat reviews when classifying which agents fall under BetrVG §87 Absatz 1 Nummer 6 co-determination scope. It is the artifact that satisfies the procurement question “what happens when your AI vendor changes a model?” - because the routing rule changes, not the business logic.

Building this layer in-house is doable but rarely faster than 6-9 months for an enterprise team starting from scratch. Buying it as a configuration framework typically shortens the path to 4-6 weeks for the first production agent.

Bottom line

Self-hosted open-source AI is a credible production choice for EU-Enterprise in 2026 - but only as a multi-model architecture with a routing layer, not as a single-model bet. Mistral Small 3.2 covers the volume band. gpt-oss-120b or DeepSeek V4-Flash covers heavy reasoning on-prem. DeepSeek V4-Pro (currently in preview) approaches Claude Opus territory if you have hyperscaler-class hardware - or you wait for the GA release and use it via API in the meantime. Llama 4 Scout covers ultra-long context. The cloud-API tier (Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro) stays available for the workloads where the regulatory framework permits it.

The routing decision is the architecture. The TCO crossover (around 50-100 million tokens per month sustained) sets the economic threshold for self-hosting. The compliance taxonomy (which data classification cannot leave the network) sets the sovereignty threshold. Both thresholds shape the routing rules.

Others publish hardware buying guides. We size the GPU footprint against your actual decision-complexity mix and the TCO crossover that matters for your token volume. The model market reprices itself every quarter; an H100 cluster that pays back in 14 months under DeepSeek V4-Flash today is the same cluster that runs whichever open-weight model wins the next benchmark generation. Hardware investment outlives any single model release. The routing rules absorb the model churn so the CAPEX line item stays defensible across three procurement cycles.

If you want to know what your self-hosted stack should look like based on your actual workload mix and data classification, book a consultation.


📘 Enterprise AI Infrastructure Blueprint 2026 - Article Series

← PreviousOverviewNext →
When Mistral, When Claude Opus? Decision Routing for Agentic EU-Enterprise 2026Enterprise AI Infrastructure Blueprint 2026AI Hosting: EU SaaS, German Data Center, or Self-Hosted?

All articles in this series: Enterprise AI Infrastructure Blueprint 2026


Sources

Primary sources used in this article:


Book a consultation. We analyze your workload mix, your data classification, and your sovereignty requirements - and recommend the multi-model self-hosted stack that matches.

Bert Gogolin

Bert Gogolin

CEO & Founder, Gosign

AI Governance Briefing

Enterprise AI, regulation, and infrastructure - once a month, directly from me.

No spam. Unsubscribe anytime. Privacy policy

Mistral Small gpt-oss DeepSeek V4 Llama 4 Self-Hosted AI Apache 2.0 EU Sovereignty GPU Hosting Decision Layer 2026
Share this article

Frequently Asked Questions

Which open-source AI model is best for self-hosting in EU-Enterprise?

Mistral Small 3.2 is the default workhorse: 24B parameters, Apache 2.0, runs on a single RTX 4090 (~1,500 EUR), multilingual training, native vision capability. Add gpt-oss-120b on an H100 for on-prem heavy reasoning. DeepSeek V4-Flash (284B/13B active MIT, April 2026 preview) is the new option for frontier-class reasoning at moderate hardware cost; V4-Pro (1.6T/49B active) approaches Claude Opus performance but needs cluster-grade infrastructure. Llama 4 Scout for ultra-long context. No single model wins - the right answer is a routed multi-model stack.

What does self-hosted Mistral Small actually cost?

Hardware: a single RTX 4090 24GB runs around 1,500 EUR one-time. For Mistral Small 3.2 in bf16/fp16, roughly 55 GB GPU RAM is needed, so an H100 80GB or A100 80GB is realistic at scale - around 30,000 EUR purchase or 1,500-2,500 EUR per month at EU hosting providers like Scaleway or OVHcloud. Inference cost per million tokens is below 1 EUR amortized.

Where can I self-host in the EU without CLOUD Act exposure?

Scaleway (France) offers H100 SXM at around 3.50 EUR/h and A100 at 2.50 EUR/h, all GDPR-compliant. OVHcloud (France) offers H100, RTX 5000, A10 with sovereign cloud options for sensitive workloads. Hetzner (Germany) provides dedicated GPU servers with RTX 4000/6000 Ada at lower prices but not on-demand cloud GPU. IONOS (Germany) and T-Systems offer sovereign cloud GPU instances for regulated industries.

When does self-hosting beat cloud API economically?

Crossover happens around 50 to 100 million tokens per month sustained throughput, depending on model and provider. Below that, Mistral La Plateforme or Claude API is cheaper. Above that, a dedicated H100 amortizes within 12 to 18 months even at EU hosting rates. The other crossover is non-economic: regulatory requirements (Schrems II, EU AI Act high-risk classification) can flip the decision regardless of token volume.

Which process should your first agent handle?

Leave your email - you'll receive your personal booking link instantly.