EngineeringApr 202611 min read

Private LLM hosting patterns: Azure, Anthropic, and self-hosted compared

Architecture patterns and trade-offs for deploying LLMs without exposing sensitive data to public endpoints. Covers cost modelling, compliance implications, and operational complexity for each approach.

Why private hosting matters

Every conversation with a public LLM endpoint sends your data to infrastructure you don't control, for engineering teams working with proprietary code, customer data, or regulated information, that's a compliance liability, not a theoretical one. Enterprise customers ask about it during procurement. Auditors flag it during reviews, and once data has left your perimeter, there's no taking it back.

Private LLM hosting means deploying model access in a way that keeps data within boundaries you control. The options range from cloud-hosted private endpoints (same models, your infrastructure) to fully self-hosted open-source models running on your own hardware.

Each pattern makes different trade-offs between cost, capability, operational complexity, and compliance posture. This guide covers the three main patterns, their architecture, and how to choose between them.

The three deployment patterns

Pattern 1: Cloud-hosted private endpoints

What it is: You deploy commercial models (GPT-5.4, Claude) through dedicated instances within your existing cloud tenant. The model runs on the provider's infrastructure but within a private boundary. Your data doesn't touch shared endpoints or contribute to model training.

Primary options: Azure OpenAI Service (for OpenAI models), AWS Bedrock (for Anthropic and other models), Google Cloud Vertex AI.

Architecture: Your applications call the model through private endpoints within your cloud VPC. Traffic never traverses the public internet. The provider manages model serving, scaling, and updates. You control networking, access, and logging.

Architecture: Your applications call the model through private endpoints within your cloud VPC. Traffic never traverses the public internet. Data stays within your tenant boundary.

Strengths: Minimal operational overhead. The provider handles model serving. You get access to the latest commercial models (GPT-5.4, Claude Sonnet) without running GPU infrastructure. Data residency is controlled through your existing cloud tenant configuration. Integration with existing cloud IAM and networking.

Limitations: You're still dependent on the provider for model availability, pricing changes, and API stability. Cost can be significant at scale. Provisioned throughput pricing is less flexible than pay-per-token public APIs. You don't control the model weights or inference pipeline.

Pattern 2: Provider-managed enterprise access

What it is: You use a commercial model provider's enterprise tier, which provides contractual data isolation, dedicated capacity, and compliance commitments, but without deploying into your own cloud tenant.

Primary options: Anthropic Enterprise (Claude), OpenAI Enterprise, Cohere Enterprise.

Architecture: Your applications call the provider's API through dedicated enterprise endpoints. The provider guarantees data isolation, zero-retention policies, and contractual compliance commitments. You manage API keys, usage policies, and application-level controls.

Strengths: Simplest deployment model, no infrastructure to manage. Enterprise contracts provide the compliance documentation (DPAs, SOC 2 reports, data processing commitments) that auditors and customers require. Typically the fastest path to governed LLM access.

Limitations: Data still leaves your network perimeter, even though the provider commits to isolation. Some compliance regimes require data to remain within specific geographic or infrastructure boundaries that provider-managed endpoints can't satisfy. Less visibility into the inference pipeline.

Pattern 3: Self-hosted open-source models

What it is: You run open-source models (Llama 3, Mistral, Qwen) on your own GPU infrastructure. You control everything: the model weights, the inference server, the networking, the data flow.

Primary options: Ollama (development/small-scale), vLLM (production serving), TGI (Hugging Face's text generation inference), SGLang.

Architecture: You deploy a model serving stack on GPU-equipped infrastructure. Either on-premise servers or cloud GPU instances. Your applications call the inference server through internal networking, no data leaves your infrastructure.

python
# Example: vLLM serving setup
# Deploy Llama 3 70B on 4x A100 GPUs
vllm serve meta-llama/Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --port 8000 \
  --api-key $INTERNAL_API_KEY

Strengths: Complete data sovereignty. Nothing leaves your infrastructure, no per-token costs after initial hardware investment. Full control over model selection, fine-tuning, and inference parameters, no vendor dependency for model access.

Limitations: Significant operational overhead: GPU procurement, model serving, scaling, monitoring, model updates. Open-source models generally trail commercial models in capability, especially for complex reasoning and code generation. Requires ML engineering expertise to operate effectively.

Cost comparison

Costs vary significantly based on usage volume. Here's a representative comparison for a team generating approximately 10 million tokens per month (a mid-size engineering team using LLMs for code review, test generation, and documentation).

PatternMonthly cost estimateCost driverScaling model
Azure OpenAI (provisioned)$2,000-5,000Provisioned throughput unitsStep function. Buy capacity blocks
Provider enterprise$1,500-4,000Per-seat + per-tokenLinear with usage and headcount
Self-hosted (cloud GPU)$3,000-8,000GPU instance hoursFixed infrastructure cost
Self-hosted (on-premise)$500-1,500 (amortised)Hardware depreciation + powerFront-loaded capital expenditure

The cloud-hosted private endpoint pattern typically offers the best cost-to-capability ratio for most engineering teams. Self-hosted becomes cost-effective at very high volumes (50M+ tokens/month) or when you already have GPU infrastructure.

Compliance comparison

DimensionCloud private endpointsProvider enterpriseSelf-hosted
Data residencyWithin your cloud tenantProvider-managed, contractual guaranteesYour infrastructure
Data retentionProvider-controlled, configurableZero-retention policies availableYou control entirely
Audit loggingCloud-native logging (CloudWatch, Azure Monitor)Provider dashboards + API logsFull control. Build your own
Access controlsCloud IAM integrationAPI key + provider admin consoleYour IAM stack
SOC 2 evidenceProvider SOC 2 report + your cloud controlsProvider SOC 2 reportYour responsibility entirely
Regulatory fitGood for most regimesDepends on data residency requirementsBest for strict sovereignty

How to choose

The decision comes down to three factors: data sensitivity, operational capacity, and model capability requirements.

Start with cloud-hosted private endpoints if you need commercial-grade model capability, your data fits within a major cloud provider's compliance boundary, and you want minimal operational overhead. This is the right choice for most enterprise engineering teams.

Choose provider-managed enterprise access if your compliance requirements are satisfied by contractual commitments (rather than infrastructure-level isolation), you want the simplest possible deployment, and you're primarily using one provider's models.

Go self-hosted only if you have strict data sovereignty requirements that cloud-hosted options can't satisfy, you have the GPU infrastructure and ML engineering capability to operate it, or your usage volume makes the economics compelling.

Hybrid patterns

In practice, most organisations end up with a hybrid approach. A common pattern we deploy:

  • Cloud-hosted private endpoint for the primary production workload (code review, enterprise chatbot, knowledge retrieval) where you need the best model quality

  • Self-hosted open-source model for high-volume, lower-complexity tasks (embeddings, classification, simple extraction) where cost matters more than peak capability

  • Provider enterprise tier for developer tooling (IDE agents, individual productivity) where deployment simplicity matters most

This hybrid approach optimises cost without compromising on capability or compliance for the workloads that matter most.

If you're evaluating private LLM deployment options and want help modelling the cost, compliance, and architecture trade-offs for your specific situation, book a diagnostic. We'll review your requirements and recommend the pattern that fits.

Ready to put this into practice?