GPU Compute
Stackpad lets you run AI and ML workloads on European GPU infrastructure. Add a GPU service from the service catalog, pick a model, and it’s running — no CUDA setup, no drivers, no DevOps.
Available GPU templates
| Template | Use case | Example models |
|---|---|---|
| vLLM Inference | Run any HuggingFace language model | Llama, Mistral, Qwen, Phi |
| Text Embeddings | Generate vector embeddings for RAG and search | BGE, E5, GTE |
| Whisper | Speech-to-text transcription | OpenAI Whisper |
Adding a GPU service
- Open your project and click Add Service
- Search for the GPU template you want (e.g. “vLLM”)
- Configure the model and GPU type
- Click Add
Stackpad deploys the GPU service and automatically wires it to your other services. Your web app gets the inference URL and API key injected as environment variables.
Connecting from your application
When you add a GPU service, Stackpad auto-injects connection details into your web services:
// Call your vLLM inference endpointconst response = await fetch(process.env.INFERENCE_URL + '/v1/chat/completions', { method: 'POST', headers: { 'Content-Type': 'application/json', 'Authorization': `Bearer ${process.env.INFERENCE_API_KEY}`, }, body: JSON.stringify({ model: 'your-model-name', messages: [{ role: 'user', content: 'Hello!' }], }),});
const data = await response.json();The API is OpenAI-compatible, so you can use the OpenAI SDK:
import OpenAI from 'openai';
const client = new OpenAI({ baseURL: process.env.INFERENCE_URL + '/v1', apiKey: process.env.INFERENCE_API_KEY,});
const completion = await client.chat.completions.create({ model: 'your-model-name', messages: [{ role: 'user', content: 'Hello!' }],});Text embeddings
const response = await fetch(process.env.EMBEDDINGS_URL + '/embed', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ inputs: 'Text to embed', }),});
const embeddings = await response.json();Whisper (speech-to-text)
const formData = new FormData();formData.append('file', audioFile);
const response = await fetch(process.env.WHISPER_URL + '/v1/audio/transcriptions', { method: 'POST', body: formData,});
const transcription = await response.json();Pricing
GPU services are billed pay-per-hour — you only pay while the GPU is running. The cost is displayed upfront in the service catalog when you select a GPU type.
There are no idle charges when your GPU service is paused.
Managing GPU services
Pause and resume
GPU services can be paused to stop billing when not in use:
- Pause — stops the GPU container, stops billing
- Resume — restarts the GPU container, billing resumes
Manage this from the service detail page.
Live progress
GPU deployments show live progress logs in the dashboard. Large model images can take several minutes to pull — the progress indicator shows exactly where things are.
Infrastructure
All GPU workloads run on European infrastructure via Verda serverless containers. Your models and data stay in the EU.
What’s next?
- Connecting services — how auto-injected variables work
- Environment variables — manage GPU service configuration
- Templates — browse all available templates