GGUF Cloud is a managed hosting product that turns any open large language model into your own dedicated API. You paste a HuggingFace GGUF link or pick a trending model, GGUF Cloud recommends a GPU that fits, and one click spins up a dedicated llama.cpp (llama-server) container. You get an OpenAI- and Anthropic-compatible endpoint plus your API key, for a flat monthly price per GPU.

How do I deploy a GGUF model as an API?

Sign up for ModelsLab, open GGUF Cloud, and either paste a HuggingFace GGUF repo (for example bartowski/Meta-Llama-3.1-8B-Instruct-GGUF) or select a trending model. Choose a quantization, accept the recommended GPU, and subscribe. Within a few minutes you receive a dedicated endpoint at https://modelslab.com/api/gguf/{id} that you authenticate with your ModelsLab API key.

Is the endpoint OpenAI and Anthropic compatible?

Yes. Every GGUF Cloud endpoint exposes both the OpenAI API (/v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models) and the Anthropic Messages API (/v1/messages) natively through llama-server. Point the OpenAI SDK at .../api/gguf/{id}/v1, point the Anthropic SDK or Claude Code at .../api/gguf/{id}, and use your ModelsLab API key as the key.

How much does it cost to host an LLM on GGUF Cloud?

You pay one flat monthly price per dedicated GPU — there are no per-token charges and no idle meter. Tiers run from an RTX 3090 (24 GB) for smaller models up to an H100 (80 GB) for 70B-class models. The exact monthly price for every GPU is listed on the pricing section of this page.

Which models and GPUs can I use?

Any GGUF model that fits in a single GPU works — Llama, Qwen, Mistral, Mixtral, Gemma, Phi, DeepSeek distills and more, in Q4_K_M, Q5_K_M or Q8_0. GGUF Cloud reads the model size and recommends the cheapest GPU that fits, from RTX 3090 / 4090 / 5090 and RTX A6000 up to A100 and H100 80 GB.

Why dedicated instead of a shared LLM pool?

Shared pools throttle under load, cap quality at FP8, and can drop requests mid-run. A dedicated single-tenant GPU gives you guaranteed throughput, your choice of GGUF quant, and no noisy neighbours — which matters for agents, coding assistants, roleplay and any latency-sensitive workload.

Can I pause or swap a deployment?

Yes. You can pause an endpoint to stop compute billing while keeping the weights saved (it resumes warm in a minute or two), swap to a different GGUF or quant on the same GPU, resize the GPU tier, or delete the deployment and cancel the subscription at any time.

GGUF Cloud

Deploy any open LLM as your own API

Name: GGUF Cloud
Brand: ModelsLab
Rating: 4.6 (7 reviews)

Paste a HuggingFace GGUF link, pick a GPU, and get a dedicated OpenAI- and Anthropic-compatible endpoint. Flat monthly price. No DevOps.

Deploy a model API reference View pricing

Build your endpoint

Configure your deployment

Pick a trending model — or paste any HuggingFace GGUF link — then choose a GPU and deploy. We pre-select the cheapest card that fits and re-check it before you pay.

Pick a trending model

Top GGUF models on HuggingFace, updated daily.

Paste any HuggingFace GGUF link

Pick a GPU

One flat monthly price — no per-token or idle charges.

You're deploying

Llama 3.1 8B Instruct on RTX 4090 · $499/mo

Bring your own GGUF Deploy Llama 3.1 8B Instruct on RTX 4090

From link to live endpoint in three steps

No Dockerfiles, no vLLM configs, no GPU shopping. Paste, pick, deploy.

Paste a GGUF link

Drop in a HuggingFace GGUF repo or a direct .gguf URL — or pick one of the trending models. Choose your quant: Q4_K_M, Q5_K_M or Q8_0.

We recommend a GPU

GGUF Cloud reads the model size and pre-selects the cheapest dedicated GPU that fits, showing the flat monthly price before you commit.

Get your endpoint + key

One click spins up a dedicated llama.cpp container. You get an OpenAI- and Anthropic-compatible endpoint and your ModelsLab API key.

Deploy a model

Why GGUF Cloud

Built for the models everyone else turns away

GGUF & llama.cpp native

Bring your own quantization — Q4, Q6 or Q8. The managed LLM hosts force vLLM and safetensors and reject GGUF outright. We run it as a first-class primitive.

Flat monthly per GPU

One predictable price per dedicated GPU. No per-token billing, no idle meter ticking at zero requests — the single most-complained-about thing in the category.

Dedicated single-tenant

Your model runs on its own GPU. No noisy neighbours, no shared-pool throttling, no preemptible boxes vanishing mid-run. Guaranteed throughput.

OpenAI and Anthropic

Every endpoint speaks both protocols natively. Point the OpenAI SDK, the Anthropic SDK, or Claude Code straight at your own model — no translation layer.

Automatic GPU sizing

We compute the VRAM your model needs from its parameters, quant and context, then recommend the cheapest card that fits. Nobody else does this.

Transparent pricing

Every GPU tier and its monthly price is on this page. No "contact sales" wall to find out what a dedicated endpoint actually costs.

Point any SDK at your own model

Your endpoint speaks both the OpenAI and Anthropic protocols natively — no shim, no rewrites. Swap the base URL and keep your existing code.

OpenAI SDK (Python)

Python

1from openai import OpenAI
2
3client = OpenAI(
4    base_url="https://modelslab.com/api/gguf/{id}/v1",
5    api_key="$MODELSLAB_API_KEY",
6)
7
8resp = client.chat.completions.create(
9    model="local",
10    messages=[{"role": "user", "content": "Explain GGUF in one line."}],
11    stream=True,
12)
13for chunk in resp:
14    print(chunk.choices[0].delta.content or "", end="")

Anthropic SDK (Python)

Python

1from anthropic import Anthropic
2
3# Point Claude Code or the Anthropic SDK at your own model.
4client = Anthropic(
5    base_url="https://modelslab.com/api/gguf/{id}",
6    api_key="$MODELSLAB_API_KEY",
7)
8
9msg = client.messages.create(
10    model="local",
11    max_tokens=512,
12    messages=[{"role": "user", "content": "Explain GGUF in one line."}],
13)
14print(msg.content[0].text)

cURL

bash

1curl https://modelslab.com/api/gguf/{id}/v1/chat/completions \
2  -H "Authorization: Bearer $MODELSLAB_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "model": "local",
6    "messages": [{"role": "user", "content": "Hello!"}]
7  }'

Read the full API docs

Deploy a trending model

Llama, Qwen, Mistral, Mixtral, Phi, DeepSeek and more — pre-verified GGUF builds, or paste any HuggingFace repo of your own.

Llama 3.1 8B Instruct

8B · Q4_K_M · 4.92 GB

chatpopular

Fits RTX 4090 · $499/mo

Qwen2.5 7B Instruct

7B · Q4_K_M · 4.68 GB

chat

Fits RTX 4090 · $499/mo

Qwen2.5 Coder 32B

32B · Q4_K_M · 19.9 GB

code

Fits RTX A6000 · $899/mo

Mistral Nemo 12B

12B · Q4_K_M · 7.48 GB

chat

Fits RTX 4090 · $499/mo

Mixtral 8x7B Instruct

47B · Q4_K_M · 26.4 GB

chatmoe

Fits RTX A6000 · $899/mo

Phi-4 14B

14B · Q4_K_M · 9.05 GB

reasoning

Fits RTX 4090 · $499/mo

DeepSeek-R1-Distill-Qwen 32B

32B · Q4_K_M · 19.9 GB

reasoning

Fits RTX A6000 · $899/mo

Llama 3.3 70B Instruct

70B · Q4_K_M · 42.5 GB

chatfrontier

Fits A100 80GB · $1,499/mo

One flat price per dedicated GPU

Pick the card your model fits on. Billed monthly, cancel anytime — no per-token charges, no idle meter, no surprises.

RTX 3090

Entry

Standard

$249/mo

VRAM: 24 GB
Usable for weights: ~21 GB

Deploy on RTX 3090

RTX 4090

Fast

$499/mo

VRAM: 24 GB
Usable for weights: ~21 GB

Deploy on RTX 4090

RTX 5090

Fastest (consumer)

$749/mo

VRAM: 32 GB
Usable for weights: ~28 GB

Deploy on RTX 5090

RTX A6000

Big VRAM

$899/mo

VRAM: 48 GB
Usable for weights: ~43 GB

Deploy on RTX A6000

A100 80GB

Datacenter

$1,499/mo

VRAM: 80 GB
Usable for weights: ~72 GB

Deploy on A100 80GB

RTX 6000 PRO

Workstation (Blackwell)

$1,999/mo

VRAM: 96 GB
Usable for weights: ~86 GB

Deploy on RTX 6000 PRO

H100 80GB

Datacenter (max)

$2,499/mo

VRAM: 80 GB
Usable for weights: ~72 GB

Deploy on H100 80GB

Flat monthly billing
Cancel anytime
No per-token charges
Pause to stop compute billing

Why teams pick GGUF Cloud

The managed LLM hosts reject GGUF. The dedicated GPU hosts bill hourly and punish idle time. GGUF Cloud is the synthesis.

Capability	GGUF Cloud	HF Endpoints	Together / Fireworks	Featherless
Runs GGUF / llama.cpp	Yes	Yes	Rejects GGUF	FP8 only
Bring your own quant	Q4 / Q6 / Q8	Limited	No	No
Dedicated single-tenant	Yes	Yes	Shared	Shared
Flat monthly price	Per GPU	Hourly	Per token	Yes
No idle / per-token charges	Yes	Idle billed	Per token	Yes
OpenAI + Anthropic API	Both	OpenAI only	Varies	OpenAI only
Automatic GPU recommendation	Yes	No	No	No