Private LLM hosting: Azure, Anthropic, self-hosted

Why private hosting matters

Every conversation with a public LLM endpoint sends your data to infrastructure you don't control, for engineering teams working with proprietary code, customer data, or regulated information, that's a compliance liability, not a theoretical one. Enterprise customers ask about it during procurement. Auditors flag it during reviews, and once data has left your perimeter, there's no taking it back.

Private LLM hosting means deploying model access in a way that keeps data within boundaries you control. The options range from cloud-hosted private endpoints (same models, your infrastructure) to fully self-hosted open-source models running on your own hardware.

Each pattern makes different trade-offs between cost, capability, operational complexity, and compliance posture. This guide covers the three main patterns, their architecture, and how to choose between them.

The three deployment patterns

Pattern 1: Cloud-hosted private endpoints

What it is: You deploy commercial models (GPT-5.4, Claude) through dedicated instances within your existing cloud tenant. The model runs on the provider's infrastructure but within a private boundary. Your data doesn't touch shared endpoints or contribute to model training.

Primary options: Azure OpenAI Service (for OpenAI models), AWS Bedrock (for Anthropic and other models), Google Cloud Vertex AI.

Architecture: Your applications call the model through private endpoints within your cloud VPC. Traffic never traverses the public internet. The provider manages model serving, scaling, and updates. You control networking, access, and logging.

Architecture: Your applications call the model through private endpoints within your cloud VPC. Traffic never traverses the public internet. Data stays within your tenant boundary.

Strengths: Minimal operational overhead. The provider handles model serving. You get access to the latest commercial models (GPT-5.4, Claude Sonnet) without running GPU infrastructure. Data residency is controlled through your existing cloud tenant configuration. Integration with existing cloud IAM and networking.

Limitations: You're still dependent on the provider for model availability, pricing changes, and API stability. Cost can be significant at scale. Provisioned throughput pricing is less flexible than pay-per-token public APIs. You don't control the model weights or inference pipeline.

Pattern 2: Provider-managed enterprise access

What it is: You use a commercial model provider's enterprise tier, which provides contractual data isolation, dedicated capacity, and compliance commitments, but without deploying into your own cloud tenant.

Primary options: Anthropic Enterprise (Claude), OpenAI Enterprise, Cohere Enterprise.

Architecture: Your applications call the provider's API through dedicated enterprise endpoints. The provider guarantees data isolation, zero-retention policies, and contractual compliance commitments. You manage API keys, usage policies, and application-level controls.

Strengths: Simplest deployment model, no infrastructure to manage. Enterprise contracts provide the compliance documentation (DPAs, SOC 2 reports, data processing commitments) that auditors and customers require. Typically the fastest path to governed LLM access.

Limitations: Data still leaves your network perimeter, even though the provider commits to isolation. Some compliance regimes require data to remain within specific geographic or infrastructure boundaries that provider-managed endpoints can't satisfy. Less visibility into the inference pipeline.

Pattern 3: Self-hosted open-source models

What it is: You run open-source models (Llama 3, Mistral, Qwen) on your own GPU infrastructure. You control everything: the model weights, the inference server, the networking, the data flow.

Primary options: Ollama (development/small-scale), vLLM (production serving), TGI (Hugging Face's text generation inference), SGLang.

Architecture: You deploy a model serving stack on GPU-equipped infrastructure. Either on-premise servers or cloud GPU instances. Your applications call the inference server through internal networking, no data leaves your infrastructure.

python

# Example: vLLM serving setup
# Deploy Llama 3 70B on 4x A100 GPUs
vllm serve meta-llama/Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --port 8000 \
  --api-key $INTERNAL_API_KEY

Strengths: Complete data sovereignty. Nothing leaves your infrastructure, no per-token costs after initial hardware investment. Full control over model selection, fine-tuning, and inference parameters, no vendor dependency for model access.

Limitations: Significant operational overhead: GPU procurement, model serving, scaling, monitoring, model updates. Open-source models generally trail commercial models in capability, especially for complex reasoning and code generation. Requires ML engineering expertise to operate effectively.

Cost comparison

Costs vary significantly based on usage volume. Here's a representative comparison for a team generating approximately 10 million tokens per month (a mid-size engineering team using LLMs for code review, test generation, and documentation).

Pattern	Monthly cost estimate	Cost driver	Scaling model
Azure OpenAI (provisioned)	$2,000-5,000	Provisioned throughput units	Step function. Buy capacity blocks
Provider enterprise	$1,500-4,000	Per-seat + per-token	Linear with usage and headcount
Self-hosted (cloud GPU)	$3,000-8,000	GPU instance hours	Fixed infrastructure cost
Self-hosted (on-premise)	$500-1,500 (amortised)	Hardware depreciation + power	Front-loaded capital expenditure

The cloud-hosted private endpoint pattern typically offers the best cost-to-capability ratio for most engineering teams. Self-hosted becomes cost-effective at very high volumes (50M+ tokens/month) or when you already have GPU infrastructure.

Compliance comparison

Dimension	Cloud private endpoints	Provider enterprise	Self-hosted
Data residency	Within your cloud tenant	Provider-managed, contractual guarantees	Your infrastructure
Data retention	Provider-controlled, configurable	Zero-retention policies available	You control entirely
Audit logging	Cloud-native logging (CloudWatch, Azure Monitor)	Provider dashboards + API logs	Full control. Build your own
Access controls	Cloud IAM integration	API key + provider admin console	Your IAM stack
SOC 2 evidence	Provider SOC 2 report + your cloud controls	Provider SOC 2 report	Your responsibility entirely
Regulatory fit	Good for most regimes	Depends on data residency requirements	Best for strict sovereignty

How to choose

The decision comes down to three factors: data sensitivity, operational capacity, and model capability requirements.

Start with cloud-hosted private endpoints if you need commercial-grade model capability, your data fits within a major cloud provider's compliance boundary, and you want minimal operational overhead. This is the right choice for most enterprise engineering teams.

Choose provider-managed enterprise access if your compliance requirements are satisfied by contractual commitments (rather than infrastructure-level isolation), you want the simplest possible deployment, and you're primarily using one provider's models.

Go self-hosted only if you have strict data sovereignty requirements that cloud-hosted options can't satisfy, you have the GPU infrastructure and ML engineering capability to operate it, or your usage volume makes the economics compelling.

Hybrid patterns

In practice, most organisations end up with a hybrid approach. A common pattern we deploy:

Cloud-hosted private endpoint for the primary production workload (code review, enterprise chatbot, knowledge retrieval) where you need the best model quality
Self-hosted open-source model for high-volume, lower-complexity tasks (embeddings, classification, simple extraction) where cost matters more than peak capability
Provider enterprise tier for developer tooling (IDE agents, individual productivity) where deployment simplicity matters most

This hybrid approach optimises cost without compromising on capability or compliance for the workloads that matter most.

If you're evaluating private LLM deployment options and want help modelling the cost, compliance, and architecture trade-offs for your specific situation, book a diagnostic. We'll review your requirements and recommend the pattern that fits.

Private LLM hosting patterns: Azure, Anthropic, and self-hosted compared