GuideJun 24, 2026 · 12 min read

Deploy DeepSeek-V4 in your own VPC

Frontier open-weight models are good enough to run your most sensitive workloads — but only if the inference happens where your data already lives. Here's a reference architecture for deploying DeepSeek-V4 on dedicated GPUs inside your own cloud account, so prompts never cross your security boundary.

Marcus Lindqvist
Solutions Engineer

Why in-VPC inference, not an API call

When you call a typical hosted inference API, your prompts leave your network, traverse the public internet, and land in a multi-tenant system you don't control. For most teams that's fine. For regulated workloads — PHI, financial records, privileged communications — it's a non-starter, because the data has left your compliance boundary the moment it hits the wire.

In-VPC deployment inverts the model: instead of sending data to the inference, you bring the inference to the data. The GPUs run inside your AWS, GCP or Azure account. Your application talks to them over private networking. Nothing sensitive ever leaves.

The reference architecture

A production HyperInfer VPC deployment has four moving parts, all inside your account:

A private gateway — terminates TLS 1.3, authenticates requests, and enforces rate limits. Reachable only over PrivateLink or VPC peering.
Dedicated GPU nodes — single-tenant B200 / B100 accelerators that load the model weights and run inference. Never shared with another tenant.
A control plane connection — an outbound-only link to HyperInfer for orchestration, autoscaling and updates. It carries metadata only — never prompts or completions.
Your KMS keys — any optional at-rest artifacts are encrypted with keys you own and can revoke.

What crosses the boundary, and what never does

This is the whole point, so it's worth stating precisely. Request and response content stays entirely within your VPC, in GPU memory, for the lifetime of a single request. What leaves your account is limited to operational metadata: timestamps, token counts for billing, model IDs and error classes. Prompt text, completions and embeddings are never transmitted out and never written to disk.

Step 1 — Provision capacity

Your solutions engineer sizes the deployment with you based on model, concurrency and latency targets. DeepSeek-V4 Flash on a single 8×B200 node comfortably serves production traffic at 300+ tokens/sec; the heavier V4 Pro reasoning model is provisioned per your throughput needs.

Step 2 — Connect networking

Establish private connectivity so your apps reach the gateway without touching the public internet:

terraform

resource "aws_vpc_endpoint" "hyperinfer" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.hyperinfer.inference"
  vpc_endpoint_type = "Interface"
  subnet_ids        = aws_subnet.private[*].id
  private_dns_enabled = true
}

Step 3 — Point your SDK at the private endpoint

Because the API is OpenAI-compatible, your application code barely changes — just the base URL and the model name:

python

from openai import OpenAI

client = OpenAI(
    base_url="https://inference.your-vpc.internal/v1",
    api_key=os.environ["HYPERINFER_API_KEY"],
)

resp = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Summarize this record."}],
)

Step 4 — Verify zero retention

Trust, then verify. Every response carries headers you can assert against in your own integration tests — X-HyperInfer-Retention: none and the region the request ran in. Pair that with the contractual guarantee in your DPA, and your auditors have both a technical and a legal control to point to.

Time to production

Most VPC deployments are serving real traffic within a week: a few days to provision and connect, then load testing and a joint security review. Compared to the months a from-scratch GPU-serving stack would take — and the compliance risk of a hosted API — it's the fastest path to private inference your security team will actually approve.

Run this privately, in your own environment

A solutions engineer will scope a zero-retention deployment for your models and volume.

Talk to an engineer