Deploy any open LLM as your own API
Paste a HuggingFace GGUF link, pick a GPU, and get a dedicated OpenAI- and Anthropic-compatible endpoint. Flat monthly price. No DevOps.
Build your endpoint
Configure your deployment
Pick a trending model — or paste any HuggingFace GGUF link — then choose a GPU and deploy. We pre-select the cheapest card that fits and re-check it before you pay.
Pick a trending model
Top GGUF models on HuggingFace, updated daily.
Pick a GPU
One flat monthly price — no per-token or idle charges.
You're deploying
Llama 3.1 8B Instruct on RTX 4090 · $499/mo
From link to live endpoint in three steps
No Dockerfiles, no vLLM configs, no GPU shopping. Paste, pick, deploy.
Paste a GGUF link
Drop in a HuggingFace GGUF repo or a direct .gguf URL — or pick one of the trending models. Choose your quant: Q4_K_M, Q5_K_M or Q8_0.
We recommend a GPU
GGUF Cloud reads the model size and pre-selects the cheapest dedicated GPU that fits, showing the flat monthly price before you commit.
Get your endpoint + key
One click spins up a dedicated llama.cpp container. You get an OpenAI- and Anthropic-compatible endpoint and your ModelsLab API key.
Why GGUF Cloud
Built for the models everyone else turns away
GGUF & llama.cpp native
Bring your own quantization — Q4, Q6 or Q8. The managed LLM hosts force vLLM and safetensors and reject GGUF outright. We run it as a first-class primitive.
Flat monthly per GPU
One predictable price per dedicated GPU. No per-token billing, no idle meter ticking at zero requests — the single most-complained-about thing in the category.
Dedicated single-tenant
Your model runs on its own GPU. No noisy neighbours, no shared-pool throttling, no preemptible boxes vanishing mid-run. Guaranteed throughput.
OpenAI and Anthropic
Every endpoint speaks both protocols natively. Point the OpenAI SDK, the Anthropic SDK, or Claude Code straight at your own model — no translation layer.
Automatic GPU sizing
We compute the VRAM your model needs from its parameters, quant and context, then recommend the cheapest card that fits. Nobody else does this.
Transparent pricing
Every GPU tier and its monthly price is on this page. No "contact sales" wall to find out what a dedicated endpoint actually costs.
Point any SDK at your own model
Your endpoint speaks both the OpenAI and Anthropic protocols natively — no shim, no rewrites. Swap the base URL and keep your existing code.
OpenAI SDK (Python)
Python1from openai import OpenAI23client = OpenAI(4 base_url="https://modelslab.com/api/gguf/{id}/v1",5 api_key="$MODELSLAB_API_KEY",6)78resp = client.chat.completions.create(9 model="local",10 messages=[{"role": "user", "content": "Explain GGUF in one line."}],11 stream=True,12)13for chunk in resp:14 print(chunk.choices[0].delta.content or "", end="")
Anthropic SDK (Python)
Python1from anthropic import Anthropic23# Point Claude Code or the Anthropic SDK at your own model.4client = Anthropic(5 base_url="https://modelslab.com/api/gguf/{id}",6 api_key="$MODELSLAB_API_KEY",7)89msg = client.messages.create(10 model="local",11 max_tokens=512,12 messages=[{"role": "user", "content": "Explain GGUF in one line."}],13)14print(msg.content[0].text)
cURL
bash1curl https://modelslab.com/api/gguf/{id}/v1/chat/completions \2 -H "Authorization: Bearer $MODELSLAB_API_KEY" \3 -H "Content-Type: application/json" \4 -d '{5 "model": "local",6 "messages": [{"role": "user", "content": "Hello!"}]7 }'
Deploy a trending model
Llama, Qwen, Mistral, Mixtral, Phi, DeepSeek and more — pre-verified GGUF builds, or paste any HuggingFace repo of your own.
Llama 3.1 8B Instruct
8B · Q4_K_M · 4.92 GB
Fits RTX 4090 · $499/mo
Qwen2.5 7B Instruct
7B · Q4_K_M · 4.68 GB
Fits RTX 4090 · $499/mo
Qwen2.5 Coder 32B
32B · Q4_K_M · 19.9 GB
Fits RTX A6000 · $899/mo
Mistral Nemo 12B
12B · Q4_K_M · 7.48 GB
Fits RTX 4090 · $499/mo
Mixtral 8x7B Instruct
47B · Q4_K_M · 26.4 GB
Fits RTX A6000 · $899/mo
Phi-4 14B
14B · Q4_K_M · 9.05 GB
Fits RTX 4090 · $499/mo
DeepSeek-R1-Distill-Qwen 32B
32B · Q4_K_M · 19.9 GB
Fits RTX A6000 · $899/mo
Llama 3.3 70B Instruct
70B · Q4_K_M · 42.5 GB
Fits A100 80GB · $1,499/mo
One flat price per dedicated GPU
Pick the card your model fits on. Billed monthly, cancel anytime — no per-token charges, no idle meter, no surprises.
- Flat monthly billing
- Cancel anytime
- No per-token charges
- Pause to stop compute billing
Why teams pick GGUF Cloud
The managed LLM hosts reject GGUF. The dedicated GPU hosts bill hourly and punish idle time. GGUF Cloud is the synthesis.
| Capability | GGUF Cloud | HF Endpoints | Together / Fireworks | Featherless |
|---|---|---|---|---|
| Runs GGUF / llama.cpp | Yes | Yes | Rejects GGUF | FP8 only |
| Bring your own quant | Q4 / Q6 / Q8 | Limited | No | No |
| Dedicated single-tenant | Yes | Yes | Shared | Shared |
| Flat monthly price | Per GPU | Hourly | Per token | Yes |
| No idle / per-token charges | Yes | Idle billed | Per token | Yes |
| OpenAI + Anthropic API | Both | OpenAI only | Varies | OpenAI only |
| Automatic GPU recommendation | Yes | No | No | No |
Comparison reflects each provider's standard offering as of June 2026. Based on publicly documented pricing and engine support.
GGUF Cloud FAQ
GGUF Cloud is a managed hosting product that turns any open large language model into your own dedicated API. You paste a HuggingFace GGUF link or pick a trending model, GGUF Cloud recommends a GPU that fits, and one click spins up a dedicated llama.cpp (llama-server) container. You get an OpenAI- and Anthropic-compatible endpoint plus your API key, for a flat monthly price per GPU.
Sign up for ModelsLab, open GGUF Cloud, and either paste a HuggingFace GGUF repo (for example bartowski/Meta-Llama-3.1-8B-Instruct-GGUF) or select a trending model. Choose a quantization, accept the recommended GPU, and subscribe. Within a few minutes you receive a dedicated endpoint at https://modelslab.com/api/gguf/{id} that you authenticate with your ModelsLab API key.
Yes. Every GGUF Cloud endpoint exposes both the OpenAI API (/v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models) and the Anthropic Messages API (/v1/messages) natively through llama-server. Point the OpenAI SDK at .../api/gguf/{id}/v1, point the Anthropic SDK or Claude Code at .../api/gguf/{id}, and use your ModelsLab API key as the key.
You pay one flat monthly price per dedicated GPU — there are no per-token charges and no idle meter. Tiers run from an RTX 3090 (24 GB) for smaller models up to an H100 (80 GB) for 70B-class models. The exact monthly price for every GPU is listed on the pricing section of this page.
Any GGUF model that fits in a single GPU works — Llama, Qwen, Mistral, Mixtral, Gemma, Phi, DeepSeek distills and more, in Q4_K_M, Q5_K_M or Q8_0. GGUF Cloud reads the model size and recommends the cheapest GPU that fits, from RTX 3090 / 4090 / 5090 and RTX A6000 up to A100 and H100 80 GB.
Shared pools throttle under load, cap quality at FP8, and can drop requests mid-run. A dedicated single-tenant GPU gives you guaranteed throughput, your choice of GGUF quant, and no noisy neighbours — which matters for agents, coding assistants, roleplay and any latency-sensitive workload.
Yes. You can pause an endpoint to stop compute billing while keeping the weights saved (it resumes warm in a minute or two), swap to a different GGUF or quant on the same GPU, resize the GPU tier, or delete the deployment and cancel the subscription at any time.
Deploy your first model in minutes
Bring your own GGUF or pick a trending one. Get a dedicated, single-tenant endpoint that speaks OpenAI and Anthropic — for a price you can predict.
Explore Our Other Solutions
Unlock your creative potential and scale your business with ModelsLab's comprehensive suite of AI-powered solutions.
AI Image Generation & Tools
Generate, edit, upscale, and transform images with state-of-the-art AI models.
AI Audio Generation
Text-to-speech, voice cloning, music generation, and audio processing APIs.
AI Video Generation & Tools
Create, edit, and enhance videos with AI-powered generation and transformation tools.
Create Stunning 3D Models
Transform images and text into 3D models with advanced AI-powered generation.