May 20, 2026·15 min read

Claude Code with Replicate: Hosted Model Inference Done Right

Claude CodeReplicateAIInference

Why Replicate without CLAUDE.md leaks money and produces flaky inference

Replicate is the hosted inference platform for open-source models. You pick a model from the catalogue, hit the API with inputs, and a prediction runs on Replicate's GPUs. The pricing model is per-second of compute time, which means a single misconfigured loop or unhandled webhook retry can rack up bills in minutes. The latency profile depends on whether the model is warm or cold: warm models start inferring within hundreds of milliseconds, cold models take 30 seconds to several minutes to boot.

Claude Code without explicit constraints generates Replicate code that ignores the cold start, polls the prediction status in a tight loop (consuming connection slots and compute time), retries failed predictions without checking the error class, and references model versions by name only (which means the model can change underneath you when the maintainer pushes an update). None of these surface in development because the dev model is usually warm, the polling looks fast on a single request, and the model version is whatever was current the day you wrote the code.

This guide covers the CLAUDE.md template that locks Claude Code into Replicate's correct model: pinned model versions, the run() vs predictions.create() decision, webhook-based completion handling instead of polling for any prediction over five seconds, file output retrieval that handles both URL and stream cases, and cold start mitigation via wait parameter and pre-warming. For broader AI inference patterns, Claude Code with Hugging Face covers the open-source ecosystem more generally, and Claude Code with the Vercel AI SDK covers the abstraction layer for switching between providers.

The Replicate CLAUDE.md template

The CLAUDE.md at your project root needs to declare: the SDK version, the API token environment variable, the model versioning policy, the prediction lifecycle pattern, the webhook setup, the file output handling, and the hard rules that block the mistakes Claude makes most often.

# Replicate rules

## Stack
- replicate ^1.x (Node.js / TypeScript) or replicate ^0.30.x (Python)
- TypeScript 5.x strict
- REPLICATE_API_TOKEN in .env.local (never hardcode)
- REPLICATE_WEBHOOK_SECRET in .env.local (for webhook signing)

## Project structure
- src/lib/replicate.ts        , Replicate client singleton
- src/lib/models.ts           , model identifier + version constants
- src/app/api/predictions/    , route handlers that create predictions
- src/app/api/webhooks/replicate/  , webhook handler for prediction completion
- src/lib/storage.ts          , output file persistence (S3 / R2)

## Model version pinning (NON-NEGOTIABLE)
ALWAYS pin model versions with the full version hash:
- BAD:  replicate.run('stability-ai/sdxl', { input })
- GOOD: replicate.run('stability-ai/sdxl:7762fd07cf82c948538e41f63f77d685e02b063e37e496e96eefd46c929f9bdc', { input })

The version hash locks the exact model weights, schema, and behaviour.
Without it, the maintainer pushing an update changes your production output silently.

## Prediction lifecycle (MANDATORY)
For predictions estimated to take <5 seconds: use replicate.run() (blocks until done)
For predictions estimated to take >5 seconds: use predictions.create() + webhook
NEVER poll predictions.get() in a tight loop, the inference compute time keeps billing.

## Hard rules
- NEVER use a model reference without a pinned version hash
- NEVER poll predictions.get() more than once per 2 seconds
- NEVER ignore the cancelled state, cancelled predictions still bill for elapsed time
- NEVER trust the prediction.output field without checking prediction.status === 'succeeded'
- NEVER process a webhook without verifying the signature
- NEVER expose REPLICATE_API_TOKEN to client-side JavaScript
- ALWAYS persist output files to your own storage, Replicate URLs are short-lived (1 hour)
- ALWAYS handle the failed and cancelled cases on every prediction
- ALWAYS log the prediction ID for billing reconciliation

The version pinning rule is the policy that prevents the most insidious class of production bug: silent model regression. When you reference stability-ai/sdxl without a version, your code runs against whatever version is currently the "latest" tag on the model page. The maintainer can push a new version that uses a different scheduler, different default parameters, or different output post-processing, and your application output changes without any deploy on your side. The fix is to always pin: read the version hash once when you adopt the model, store it as a constant, and only update the hash through an explicit PR with a tested change.

The polling rule prevents the most expensive class of bug. Polling predictions.get() in a tight loop does not affect billing for the inference itself (that bills based on the model's compute time on the GPU), but it does consume your account's API rate limit and your application's connection slots. Replicate provides webhooks specifically to avoid this. Use them.

Install and client setup

npm i replicate

Add the API token:

# .env.local
REPLICATE_API_TOKEN=r8_your_token_here
REPLICATE_WEBHOOK_SECRET=whsec_your_secret_here

Create the singleton client:

// src/lib/replicate.ts
import Replicate from 'replicate';

if (!process.env.REPLICATE_API_TOKEN) {
  throw new Error('REPLICATE_API_TOKEN is not defined');
}

export const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Define the model versions as constants:

// src/lib/models.ts
export const MODELS = {
  sdxl: 'stability-ai/sdxl:7762fd07cf82c948538e41f63f77d685e02b063e37e496e96eefd46c929f9bdc',
  flux: 'black-forest-labs/flux-1.1-pro:80a09d66bdbc9cdfe7c4eb6daa18527e07ad2b6e2e88ec5b3a35bf7e89e0fa11',
  whisper: 'openai/whisper:8099696689d249cf8b122d833c36ac3f75505c666a395ca40ef26f68e7d3d16e',
  llama: 'meta/llama-3.1-70b-instruct:7f0d1d40c5e54b7d2c10f5a3f5e7c3a8b1a8f9d4e1c2b3a4f5e6d7c8b9a0e1f2',
} as const;

The constants give you a single audit point for every model in use. When a new version is released, you update the hash here and the change flows to every call site. Without these constants, model versions get scattered across handlers and become impossible to track.

Short predictions with run()

For predictions estimated to complete in under five seconds (small image generation, short audio transcription, fast LLM completions), the run() method blocks until the prediction completes and returns the output directly.

// src/app/api/generate-image/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { replicate } from '@/lib/replicate';
import { MODELS } from '@/lib/models';

const TIMEOUT_MS = 60_000;

export async function POST(req: NextRequest) {
  const { prompt } = await req.json();

  if (typeof prompt !== 'string' || prompt.length === 0) {
    return NextResponse.json({ error: 'Invalid prompt' }, { status: 400 });
  }

  try {
    const output = await Promise.race([
      replicate.run(MODELS.flux, {
        input: {
          prompt,
          aspect_ratio: '1:1',
          output_format: 'webp',
          safety_tolerance: 2,
        },
      }),
      new Promise<never>((_, reject) =>
        setTimeout(() => reject(new Error('Inference timeout')), TIMEOUT_MS),
      ),
    ]);

    // FLUX returns a single URL or an array depending on num_outputs
    const urls = Array.isArray(output) ? output : [output];

    return NextResponse.json({ images: urls });
  } catch (e) {
    console.error('[Replicate] Generation error:', e);
    return NextResponse.json({ error: 'Generation failed' }, { status: 500 });
  }
}

Three details Claude misses by default.

The Promise.race with a manual timeout wraps the SDK call because the Replicate SDK does not expose a timeout option directly. Without the race, a cold model that takes longer than your platform's request timeout (typically 30 seconds on Vercel free tier, 60 seconds on Pro) will be killed by the platform with no graceful error.

The Array.isArray(output) normalisation matters because Replicate model outputs vary in shape. Some return a single URL, some return an array of URLs, some return an object with multiple fields. The SDK does not normalise this. Check the model's schema on Replicate's page and normalise to a consistent shape at the boundary.

The fact that run() is correct here at all depends on the prediction being fast. For FLUX 1.1 Pro on a warm endpoint, it usually completes in 2 to 4 seconds. For SDXL with high resolution, it can take 10 to 20 seconds. For a cold model, add 30 seconds to either. The five-second threshold is approximate. The real question: can your platform tolerate the maximum latency? If not, use the webhook pattern.

Long predictions with webhooks

For predictions that exceed five seconds (high-resolution image generation, video synthesis, long-form audio transcription, batch LLM jobs), the predictions.create() method starts the prediction and returns immediately with a prediction ID. Replicate fires a webhook when the prediction completes.

// src/app/api/start-prediction/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { replicate } from '@/lib/replicate';
import { MODELS } from '@/lib/models';

export async function POST(req: NextRequest) {
  const { prompt, userId } = await req.json();

  const prediction = await replicate.predictions.create({
    version: MODELS.sdxl.split(':')[1],
    input: {
      prompt,
      width: 1024,
      height: 1024,
      num_outputs: 4,
    },
    webhook: `https://yourdomain.com/api/webhooks/replicate?user=${userId}`,
    webhook_events_filter: ['completed'],
  });

  // Store the prediction ID in your DB, associated with the user
  await storePredictionRequest({
    predictionId: prediction.id,
    userId,
    model: MODELS.sdxl,
    status: 'pending',
  });

  return NextResponse.json({
    predictionId: prediction.id,
    status: prediction.status,
  });
}

async function storePredictionRequest(record: {
  predictionId: string;
  userId: string;
  model: string;
  status: string;
}) {
  // DB write here
}

The webhook handler:

// src/app/api/webhooks/replicate/route.ts
import { NextRequest, NextResponse } from 'next/server';
import crypto from 'node:crypto';

const WEBHOOK_SECRET = process.env.REPLICATE_WEBHOOK_SECRET;

interface PredictionWebhookPayload {
  id: string;
  status: 'starting' | 'processing' | 'succeeded' | 'failed' | 'canceled';
  output: unknown;
  error: string | null;
  metrics: { predict_time?: number };
  completed_at: string | null;
}

export async function POST(req: NextRequest) {
  if (!WEBHOOK_SECRET) {
    return NextResponse.json({ error: 'Webhook secret not configured' }, { status: 500 });
  }

  const body = await req.text();
  const webhookId = req.headers.get('webhook-id');
  const webhookTimestamp = req.headers.get('webhook-timestamp');
  const webhookSignature = req.headers.get('webhook-signature');

  if (!webhookId || !webhookTimestamp || !webhookSignature) {
    return NextResponse.json({ error: 'Missing webhook headers' }, { status: 400 });
  }

  // Replicate signs as: webhook-id.webhook-timestamp.body
  const signedContent = `${webhookId}.${webhookTimestamp}.${body}`;
  const secretBytes = Buffer.from(WEBHOOK_SECRET.split('_')[1], 'base64');

  const expectedSignature = crypto
    .createHmac('sha256', secretBytes)
    .update(signedContent)
    .digest('base64');

  // webhookSignature header format: "v1,base64sig v1,base64sig" (space-separated)
  const passedSignatures = webhookSignature.split(' ').map(s => s.split(',')[1]);
  const isValid = passedSignatures.some(sig =>
    crypto.timingSafeEqual(Buffer.from(sig), Buffer.from(expectedSignature)),
  );

  if (!isValid) {
    return NextResponse.json({ error: 'Invalid signature' }, { status: 401 });
  }

  const payload: PredictionWebhookPayload = JSON.parse(body);
  const userId = req.nextUrl.searchParams.get('user');

  switch (payload.status) {
    case 'succeeded':
      await handleSuccess(payload, userId);
      break;
    case 'failed':
      console.error('[Replicate webhook] Prediction failed:', payload.id, payload.error);
      await handleFailure(payload, userId);
      break;
    case 'canceled':
      console.warn('[Replicate webhook] Prediction cancelled:', payload.id);
      await handleCancellation(payload, userId);
      break;
    default:
      // 'starting' or 'processing' if you subscribed to those events
      break;
  }

  return NextResponse.json({ ok: true });
}

async function handleSuccess(payload: PredictionWebhookPayload, userId: string | null) {
  // Persist output to your own storage (S3, R2)
  // Update DB record to succeeded
  // Notify user (websocket, email, push notification)
}

async function handleFailure(payload: PredictionWebhookPayload, userId: string | null) {
  // Update DB record to failed
  // Surface error to user
}

async function handleCancellation(payload: PredictionWebhookPayload, userId: string | null) {
  // Update DB record to cancelled
}

The signature verification is the part Claude most often gets wrong. Replicate uses the Standard Webhooks signature format, which signs the concatenation of webhook-id.webhook-timestamp.body with the webhook secret. The signature can include multiple comma-separated versions (for key rotation). The check must use constant-time comparison to prevent timing attacks.

The webhook secret is base64-encoded after the whsec_ prefix. Most implementations forget to decode it, which results in a signature mismatch and a 401 on every webhook. The Buffer.from(WEBHOOK_SECRET.split('_')[1], 'base64') line handles this correctly.

The query string user ID pattern (?user=${userId}) is how you correlate the webhook to the application user. The webhook payload itself does not include arbitrary metadata, so you need to pass identifiers through the webhook URL. An alternative is to store the prediction ID in your DB at create time and look up the user by prediction ID in the webhook handler.

Add a webhook section to CLAUDE.md:

## Replicate webhooks
- Endpoint: src/app/api/webhooks/replicate/route.ts
- ALWAYS verify Standard Webhooks signature (webhook-id.timestamp.body)
- ALWAYS decode REPLICATE_WEBHOOK_SECRET from base64 (after the whsec_ prefix)
- ALWAYS use timingSafeEqual for the comparison
- Subscribe to: completed (or completed + start for progress indicators)
- Pass user/job IDs via query string: webhook URL is your correlation key
- Return 200 for all valid signatures, even on unknown statuses
- Return 401 for invalid signatures
- Idempotent: Replicate retries on non-2xx responses

Persisting output files

Replicate output URLs (image URLs, video URLs, audio URLs) are short-lived. They expire approximately one hour after the prediction completes. Production code MUST download and re-upload to your own storage immediately.

// src/lib/persist-output.ts
import { put } from '@vercel/blob';
// or: import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';

export async function persistReplicateOutput(
  outputUrl: string,
  predictionId: string,
): Promise<string> {
  const response = await fetch(outputUrl, {
    signal: AbortSignal.timeout(60_000),
  });

  if (!response.ok) {
    throw new Error(`Failed to fetch Replicate output: ${response.status}`);
  }

  const contentType = response.headers.get('content-type') ?? 'application/octet-stream';
  const extension = guessExtension(contentType);
  const filename = `predictions/${predictionId}${extension}`;

  const blob = await response.blob();
  const { url } = await put(filename, blob, {
    access: 'public',
    contentType,
  });

  return url;
}

function guessExtension(contentType: string): string {
  if (contentType.includes('image/png')) return '.png';
  if (contentType.includes('image/jpeg')) return '.jpg';
  if (contentType.includes('image/webp')) return '.webp';
  if (contentType.includes('video/mp4')) return '.mp4';
  if (contentType.includes('audio/wav')) return '.wav';
  if (contentType.includes('audio/mpeg')) return '.mp3';
  return '';
}

Add an output persistence section to CLAUDE.md:

## Output persistence
- Replicate output URLs expire in ~1 hour, NEVER store them long-term
- ALWAYS download and re-upload to your own storage on webhook success
- Use Vercel Blob, Cloudflare R2, or S3 (whichever your stack already has)
- Persist by prediction ID to keep the audit trail traceable
- Set the correct content-type on upload (Replicate sends it via headers)
- For video/audio outputs over 100 MB, consider streaming the body instead of buffering

For S3/R2 storage patterns, Claude Code with Cloudflare R2 covers the upload patterns.

Cancelling predictions

A prediction that is running can be cancelled via the API. Cancellation stops the inference and bills only for the elapsed time. This matters for user-facing applications where the user can abort: image generation that takes too long, video synthesis that the user no longer wants.

// src/app/api/cancel-prediction/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { replicate } from '@/lib/replicate';

export async function POST(req: NextRequest) {
  const { predictionId } = await req.json();

  if (typeof predictionId !== 'string') {
    return NextResponse.json({ error: 'Invalid prediction ID' }, { status: 400 });
  }

  // Verify the user owns this prediction first
  const owns = await userOwnsPrediction(predictionId);
  if (!owns) {
    return NextResponse.json({ error: 'Forbidden' }, { status: 403 });
  }

  try {
    await replicate.predictions.cancel(predictionId);
    return NextResponse.json({ ok: true });
  } catch (e) {
    console.error('[Replicate] Cancel failed:', e);
    return NextResponse.json({ error: 'Cancel failed' }, { status: 500 });
  }
}

async function userOwnsPrediction(predictionId: string): Promise<boolean> {
  // DB check
  return true;
}

The ownership check is critical. Without it, a malicious user can cancel another user's prediction by guessing the ID. Prediction IDs are not secret (they appear in webhook URLs), so the application must enforce ownership at the API boundary.

Cold start mitigation

Replicate's free and Pro tiers run models on shared GPUs. When a model has not been called recently, the container scales to zero. The next call has to boot the container, load the model weights into VRAM, and warm up the inference path. This is the cold start. For popular models on dedicated endpoints, cold starts are rare. For less-used models on the standard catalogue, they happen frequently.

The wait parameter in predictions.create() lets you choose whether the API call blocks until the prediction starts running (or completes) or returns immediately:

const prediction = await replicate.predictions.create({
  version: MODELS.flux.split(':')[1],
  input: { /* ... */ },
  wait: { mode: 'block', timeout: 30 },  // wait up to 30s for completion
});

For cold-start-sensitive applications, the pattern is: warm the model with a small "prime" call when the user begins their flow, then make the real call once the user submits. The prime call costs a few seconds of compute (which is real money for high-traffic flows) but eliminates the cold start latency from the user's perception.

// On user starts flow
await replicate.predictions.create({
  version: MODELS.flux.split(':')[1],
  input: { prompt: 'warmup', width: 64, height: 64, num_outputs: 1 },
});

// User can take 10 to 30 seconds to compose their actual prompt
// Model stays warm during this window

// On user submits
const realPrediction = await replicate.predictions.create({
  version: MODELS.flux.split(':')[1],
  input: { prompt: userPrompt, /* full params */ },
});

Add a cold start section to CLAUDE.md:

## Cold start handling
- Standard catalogue models scale to zero after ~10 minutes idle
- Cold starts add 20-60s latency depending on the model
- For user-facing flows: prime the model on flow-start with a small input
- For deterministic latency at scale: use Replicate's dedicated deployments
- NEVER let cold start latency cause a platform timeout, set client timeouts higher

Permission hooks for Replicate scripts

A Replicate project accumulates scripts: model evaluators, batch processors, dataset annotators, cost reporters. Some are read-only or use minimal compute. Others can trigger thousands of dollars of GPU usage.

In .claude/settings.local.json:

{
  "permissions": {
    "allow": [
      "Bash(npx tsx scripts/list-models.ts*)",
      "Bash(npx tsx scripts/check-usage.ts*)",
      "Bash(npx tsx scripts/preview-model-schema.ts*)"
    ],
    "deny": [
      "Bash(npx tsx scripts/batch-generate.ts*)",
      "Bash(npx tsx scripts/run-evaluation.ts*)",
      "Bash(npx tsx scripts/train-cog-model.ts*)"
    ]
  }
}

Listing models and checking usage are safe operations. Batch generation, evaluation runs, and training jobs trigger paid compute and require explicit confirmation. For more on permission hooks, Claude Code permissions covers the full configuration model.

Common Claude Code mistakes with Replicate

Six patterns Claude generates incorrectly without CLAUDE.md constraints, with the correct replacement for each.

1. Unpinned model version

Claude generates: replicate.run('stability-ai/sdxl', { input }).

Correct pattern: replicate.run('stability-ai/sdxl:7762fd07cf82c948...', { input }) with the full hash.

2. Polling instead of webhook

Claude generates: a while (prediction.status !== 'succeeded') { await sleep(1000); prediction = await replicate.predictions.get(prediction.id); } loop.

Correct pattern: predictions.create() with a webhook URL, handle the completion via the webhook route.

3. No timeout on run()

Claude generates: await replicate.run(model, { input }) without any timeout.

Correct pattern: Promise.race([replicate.run(...), timeoutPromise]) with a sane upper bound.

4. Trusting output without status check

Claude generates: return prediction.output immediately after predictions.get().

Correct pattern: check prediction.status === 'succeeded' first, handle 'failed' and 'canceled' explicitly.

5. Storing Replicate URLs long-term

Claude generates: await db.insert(images).values({ url: prediction.output }) for a Replicate URL.

Correct pattern: download the output, re-upload to your own storage, store the durable URL.

6. Webhook signature comparison with ===

Claude generates: if (req.headers['webhook-signature'] === expected) for webhook verification.

Correct pattern: parse the multi-signature header, compare each candidate with timingSafeEqual.

Add these six pairs to CLAUDE.md as before/after examples. Claude reproduces concrete patterns faster than abstract rules.

When Replicate is the right choice

Replicate is the right choice when you need open-source models that are not available through closed-source providers, when you want a single platform for image, video, audio, and text inference, and when you accept per-second pricing as the trade-off for not running GPUs yourself. The platform handles the GPU infrastructure, the model serving, the autoscaling, and the version management.

The alternatives: Hugging Face Inference API for the same set of open-source models with a different platform philosophy. Self-hosted inference on your own GPUs (cheaper at very high volume, much more operational complexity). Provider-specific APIs for closed-source models (OpenAI for GPT, Anthropic for Claude, Google for Gemini). For most teams shipping production AI features in 2026, the answer is a mix: Replicate for open-source-only models, hosted providers for closed-source, with the orchestration layer hiding the details.

The CLAUDE.md template in this guide produces Replicate integrations where versions are pinned, predictions over five seconds use webhooks, outputs are persisted to durable storage, webhook signatures are verified with constant-time comparison, and cold starts are mitigated with a prime call. The underlying principle: Replicate without explicit CLAUDE.md constraints produces code that works in development and bills surprisingly large amounts in production, and the template removes each failure mode by making the correct pattern the only pattern Claude can generate.

Get Claudify. The bundle includes a Replicate CLAUDE.md template with the singleton client, model versioning rules, webhook handler with signature verification, output persistence, and all six common-mistake rules pre-configured.