Skip to main content
GGUF Cloud

Deploy any open LLM as your own API

Paste a HuggingFace GGUF link, pick a GPU, and get a dedicated OpenAI- and Anthropic-compatible endpoint. Flat monthly price. No DevOps.

Build your endpoint

Configure your deployment

Pick a trending model — or paste any HuggingFace GGUF link — then choose a GPU and deploy. We pre-select the cheapest card that fits and re-check it before you pay.

01

Pick a trending model

Top GGUF models on HuggingFace, updated daily.

or
02

Pick a GPU

One flat monthly price — no per-token or idle charges.

You're deploying

Llama 3.1 8B Instruct on RTX 4090 · $499/mo

From link to live endpoint in three steps

No Dockerfiles, no vLLM configs, no GPU shopping. Paste, pick, deploy.

01

Paste a GGUF link

Drop in a HuggingFace GGUF repo or a direct .gguf URL — or pick one of the trending models. Choose your quant: Q4_K_M, Q5_K_M or Q8_0.

02

We recommend a GPU

GGUF Cloud reads the model size and pre-selects the cheapest dedicated GPU that fits, showing the flat monthly price before you commit.

03

Get your endpoint + key

One click spins up a dedicated llama.cpp container. You get an OpenAI- and Anthropic-compatible endpoint and your ModelsLab API key.

Why GGUF Cloud

Built for the models everyone else turns away

GGUF & llama.cpp native

Bring your own quantization — Q4, Q6 or Q8. The managed LLM hosts force vLLM and safetensors and reject GGUF outright. We run it as a first-class primitive.

Flat monthly per GPU

One predictable price per dedicated GPU. No per-token billing, no idle meter ticking at zero requests — the single most-complained-about thing in the category.

Dedicated single-tenant

Your model runs on its own GPU. No noisy neighbours, no shared-pool throttling, no preemptible boxes vanishing mid-run. Guaranteed throughput.

OpenAI and Anthropic

Every endpoint speaks both protocols natively. Point the OpenAI SDK, the Anthropic SDK, or Claude Code straight at your own model — no translation layer.

Automatic GPU sizing

We compute the VRAM your model needs from its parameters, quant and context, then recommend the cheapest card that fits. Nobody else does this.

Transparent pricing

Every GPU tier and its monthly price is on this page. No "contact sales" wall to find out what a dedicated endpoint actually costs.

Point any SDK at your own model

Your endpoint speaks both the OpenAI and Anthropic protocols natively — no shim, no rewrites. Swap the base URL and keep your existing code.

OpenAI SDK (Python)

Python
1from openai import OpenAI
2
3client = OpenAI(
4 base_url="https://modelslab.com/api/gguf/{id}/v1",
5 api_key="$MODELSLAB_API_KEY",
6)
7
8resp = client.chat.completions.create(
9 model="local",
10 messages=[{"role": "user", "content": "Explain GGUF in one line."}],
11 stream=True,
12)
13for chunk in resp:
14 print(chunk.choices[0].delta.content or "", end="")

Anthropic SDK (Python)

Python
1from anthropic import Anthropic
2
3# Point Claude Code or the Anthropic SDK at your own model.
4client = Anthropic(
5 base_url="https://modelslab.com/api/gguf/{id}",
6 api_key="$MODELSLAB_API_KEY",
7)
8
9msg = client.messages.create(
10 model="local",
11 max_tokens=512,
12 messages=[{"role": "user", "content": "Explain GGUF in one line."}],
13)
14print(msg.content[0].text)

cURL

bash
1curl https://modelslab.com/api/gguf/{id}/v1/chat/completions \
2 -H "Authorization: Bearer $MODELSLAB_API_KEY" \
3 -H "Content-Type: application/json" \
4 -d '{
5 "model": "local",
6 "messages": [{"role": "user", "content": "Hello!"}]
7 }'

Deploy a trending model

Llama, Qwen, Mistral, Mixtral, Phi, DeepSeek and more — pre-verified GGUF builds, or paste any HuggingFace repo of your own.

Llama 3.1 8B Instruct

8B · Q4_K_M · 4.92 GB

chatpopular

Fits RTX 4090 · $499/mo

Qwen2.5 7B Instruct

7B · Q4_K_M · 4.68 GB

chat

Fits RTX 4090 · $499/mo

Qwen2.5 Coder 32B

32B · Q4_K_M · 19.9 GB

code

Fits RTX A6000 · $899/mo

Mistral Nemo 12B

12B · Q4_K_M · 7.48 GB

chat

Fits RTX 4090 · $499/mo

Mixtral 8x7B Instruct

47B · Q4_K_M · 26.4 GB

chatmoe

Fits RTX A6000 · $899/mo

Phi-4 14B

14B · Q4_K_M · 9.05 GB

reasoning

Fits RTX 4090 · $499/mo

DeepSeek-R1-Distill-Qwen 32B

32B · Q4_K_M · 19.9 GB

reasoning

Fits RTX A6000 · $899/mo

Llama 3.3 70B Instruct

70B · Q4_K_M · 42.5 GB

chatfrontier

Fits A100 80GB · $1,499/mo

One flat price per dedicated GPU

Pick the card your model fits on. Billed monthly, cancel anytime — no per-token charges, no idle meter, no surprises.

RTX 3090

Entry

Standard

$249/mo
VRAM
24 GB
Usable for weights
~21 GB
Deploy on RTX 3090

RTX 4090

Fast

$499/mo
VRAM
24 GB
Usable for weights
~21 GB
Deploy on RTX 4090

RTX 5090

Fastest (consumer)

$749/mo
VRAM
32 GB
Usable for weights
~28 GB
Deploy on RTX 5090

RTX A6000

Big VRAM

$899/mo
VRAM
48 GB
Usable for weights
~43 GB
Deploy on RTX A6000

A100 80GB

Datacenter

$1,499/mo
VRAM
80 GB
Usable for weights
~72 GB
Deploy on A100 80GB

RTX 6000 PRO

Workstation (Blackwell)

$1,999/mo
VRAM
96 GB
Usable for weights
~86 GB
Deploy on RTX 6000 PRO

H100 80GB

Datacenter (max)

$2,499/mo
VRAM
80 GB
Usable for weights
~72 GB
Deploy on H100 80GB
  • Flat monthly billing
  • Cancel anytime
  • No per-token charges
  • Pause to stop compute billing

Why teams pick GGUF Cloud

The managed LLM hosts reject GGUF. The dedicated GPU hosts bill hourly and punish idle time. GGUF Cloud is the synthesis.

CapabilityGGUF CloudHF EndpointsTogether / FireworksFeatherless
Runs GGUF / llama.cppYesYesRejects GGUFFP8 only
Bring your own quantQ4 / Q6 / Q8LimitedNoNo
Dedicated single-tenantYesYesSharedShared
Flat monthly pricePer GPUHourlyPer tokenYes
No idle / per-token chargesYesIdle billedPer tokenYes
OpenAI + Anthropic APIBothOpenAI onlyVariesOpenAI only
Automatic GPU recommendationYesNoNoNo

Comparison reflects each provider's standard offering as of June 2026. Based on publicly documented pricing and engine support.

GGUF Cloud FAQ

GGUF Cloud is a managed hosting product that turns any open large language model into your own dedicated API. You paste a HuggingFace GGUF link or pick a trending model, GGUF Cloud recommends a GPU that fits, and one click spins up a dedicated llama.cpp (llama-server) container. You get an OpenAI- and Anthropic-compatible endpoint plus your API key, for a flat monthly price per GPU.

Sign up for ModelsLab, open GGUF Cloud, and either paste a HuggingFace GGUF repo (for example bartowski/Meta-Llama-3.1-8B-Instruct-GGUF) or select a trending model. Choose a quantization, accept the recommended GPU, and subscribe. Within a few minutes you receive a dedicated endpoint at https://modelslab.com/api/gguf/{id} that you authenticate with your ModelsLab API key.

Yes. Every GGUF Cloud endpoint exposes both the OpenAI API (/v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models) and the Anthropic Messages API (/v1/messages) natively through llama-server. Point the OpenAI SDK at .../api/gguf/{id}/v1, point the Anthropic SDK or Claude Code at .../api/gguf/{id}, and use your ModelsLab API key as the key.

You pay one flat monthly price per dedicated GPU — there are no per-token charges and no idle meter. Tiers run from an RTX 3090 (24 GB) for smaller models up to an H100 (80 GB) for 70B-class models. The exact monthly price for every GPU is listed on the pricing section of this page.

Any GGUF model that fits in a single GPU works — Llama, Qwen, Mistral, Mixtral, Gemma, Phi, DeepSeek distills and more, in Q4_K_M, Q5_K_M or Q8_0. GGUF Cloud reads the model size and recommends the cheapest GPU that fits, from RTX 3090 / 4090 / 5090 and RTX A6000 up to A100 and H100 80 GB.

Shared pools throttle under load, cap quality at FP8, and can drop requests mid-run. A dedicated single-tenant GPU gives you guaranteed throughput, your choice of GGUF quant, and no noisy neighbours — which matters for agents, coding assistants, roleplay and any latency-sensitive workload.

Yes. You can pause an endpoint to stop compute billing while keeping the weights saved (it resumes warm in a minute or two), swap to a different GGUF or quant on the same GPU, resize the GPU tier, or delete the deployment and cancel the subscription at any time.

Deploy your first model in minutes

Bring your own GGUF or pick a trending one. Get a dedicated, single-tenant endpoint that speaks OpenAI and Anthropic — for a price you can predict.