GPU Compute

Stackpad lets you run AI and ML workloads on European GPU infrastructure. Add a GPU service from the service catalog, pick a model, and it’s running — no CUDA setup, no drivers, no DevOps.

Available GPU templates

Template	Use case	Example models
vLLM Inference	Run any HuggingFace language model	Llama, Mistral, Qwen, Phi
Text Embeddings	Generate vector embeddings for RAG and search	BGE, E5, GTE
Whisper	Speech-to-text transcription	OpenAI Whisper

Adding a GPU service

Open your project and click Add Service
Search for the GPU template you want (e.g. “vLLM”)
Configure the model and GPU type
Click Add

Stackpad deploys the GPU service and automatically wires it to your other services. Your web app gets the inference URL and API key injected as environment variables.

Connecting from your application

When you add a GPU service, Stackpad auto-injects connection details into your web services:

// Call your vLLM inference endpoint
const response = await fetch(process.env.INFERENCE_URL + '/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': `Bearer ${process.env.INFERENCE_API_KEY}`,
  },
  body: JSON.stringify({
    model: 'your-model-name',
    messages: [{ role: 'user', content: 'Hello!' }],
  }),
});

const data = await response.json();

The API is OpenAI-compatible, so you can use the OpenAI SDK:

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: process.env.INFERENCE_URL + '/v1',
  apiKey: process.env.INFERENCE_API_KEY,
});

const completion = await client.chat.completions.create({
  model: 'your-model-name',
  messages: [{ role: 'user', content: 'Hello!' }],
});

Text embeddings

const response = await fetch(process.env.EMBEDDINGS_URL + '/embed', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    inputs: 'Text to embed',
  }),
});

const embeddings = await response.json();

Whisper (speech-to-text)

const formData = new FormData();
formData.append('file', audioFile);

const response = await fetch(process.env.WHISPER_URL + '/v1/audio/transcriptions', {
  method: 'POST',
  body: formData,
});

const transcription = await response.json();

Pricing

GPU services are billed pay-per-hour — you only pay while the GPU is running. The cost is displayed upfront in the service catalog when you select a GPU type.

There are no idle charges when your GPU service is paused.

Managing GPU services

Pause and resume

GPU services can be paused to stop billing when not in use:

Pause — stops the GPU container, stops billing
Resume — restarts the GPU container, billing resumes

Manage this from the service detail page.

Live progress

GPU deployments show live progress logs in the dashboard. Large model images can take several minutes to pull — the progress indicator shows exactly where things are.

Infrastructure

All GPU workloads run on European infrastructure via Verda serverless containers. Your models and data stay in the EU.

What’s next?

Connecting services — how auto-injected variables work
Environment variables — manage GPU service configuration
Templates — browse all available templates