May 20, 2026·15 min read

Claude Code with Hugging Face: Models, Inference, and Datasets

Claude CodeHugging FaceMachine LearningAI

Why Hugging Face without CLAUDE.md burns inference credits

Hugging Face is the open-source machine learning hub: hundreds of thousands of models, the Inference API for hosted endpoints, the transformers Python library, the transformers.js browser runtime, the datasets library for training corpora, and the Spaces platform for model demos. Each surface has its own authentication pattern, its own rate limit, its own cache directory, and its own way of failing silently.

Claude Code without explicit constraints generates Hugging Face integration code that works locally and falls apart in production. Common failure modes: HF tokens with the wrong scope (read-only when you need write, or write when read-only is sufficient), inference calls that hit the cold-start latency on every request because the warm-keep pattern is missing, model downloads that bypass the cache and re-download multi-gigabyte files on every server restart, and dataset loads that pull entire corpora into memory when streaming would solve the problem in three lines.

This guide covers the CLAUDE.md template that locks Claude Code into Hugging Face's correct model: the Inference API client with proper auth and timeout handling, the transformers.js setup for browser-based inference, the cache directory configuration that survives deploys, the dataset streaming pattern for large training data, and the token scoping that prevents production credentials from leaking into local scripts. For the broader AI tooling context, Claude Code with the Vercel AI SDK and Claude Code with LangChain cover orchestration layers that often sit above Hugging Face inference calls.

The Hugging Face CLAUDE.md template

The CLAUDE.md at your project root needs to declare: the HF client library and version, the token environment variable, the cache directory, the inference endpoint policy, the model selection rules, and the hard rules that block the mistakes Claude makes most often.

# Hugging Face rules

## Stack
- @huggingface/inference ^2.x (Node.js) or huggingface-hub ^0.20.x (Python)
- transformers.js ^3.x (browser inference, optional)
- HF_TOKEN in .env.local (never hardcode, never commit)
- Cache: HF_HOME=/var/cache/huggingface (or ~/.cache/huggingface in dev)

## Project structure
- src/lib/hf.ts             , Hugging Face client singleton
- src/lib/models.ts          , model ID constants and metadata
- src/app/api/inference/*    , route handlers for hosted inference
- src/lib/transformers/      , transformers.js model loaders (browser)
- scripts/                   , dataset processing, fine-tuning launchers

## Token scoping (NON-NEGOTIABLE)
Generate three separate HF tokens for different contexts:
- Local development: read-only, low rate limit
- CI/CD: read-write, scoped to specific repos
- Production: read-only for inference, scoped to specific models

NEVER use a write-scoped token in production inference code.
NEVER commit HF_TOKEN to source control under any naming.

## Inference API pattern (MANDATORY)
- ALWAYS construct the HfInference client with the token from env
- ALWAYS set a per-call timeout (default 60s is too long for user-facing requests)
- ALWAYS handle the cold-start case with the wait_for_model option for long-running models
- NEVER call the raw fetch(/api/inference/...) endpoint, use the SDK

## Hard rules
- NEVER hardcode HF_TOKEN in source files
- NEVER use a single HF_TOKEN across dev, CI, and production
- NEVER download models to the project directory, ALWAYS use HF_HOME
- NEVER load full datasets into memory, use streaming
- NEVER skip the cache mount in Docker production builds
- NEVER expose HF_TOKEN to client-side JavaScript
- ALWAYS log the model ID and revision being used for inference

The three-token rule is the policy that prevents the largest class of production incidents. A single token reused across environments means a leaked dev token gives an attacker write access to your production models. Three separate tokens limit the blast radius of any single leak. The token scoping UI is on the Hugging Face access tokens page and supports fine-grained permissions per repo.

The cache directory rule is the policy that keeps deploys fast. Models on Hugging Face are often several gigabytes. Without HF_HOME set to a persistent directory, every server restart re-downloads them. With it set to a mounted volume in Docker or a persistent EBS volume on EC2, the first download caches and subsequent boots are instant.

Install and client setup

For Node.js inference against the hosted API:

npm i @huggingface/inference

Add the token to your environment file:

# .env.local
HF_TOKEN=hf_your_token_here

Create the singleton client:

// src/lib/hf.ts
import { HfInference } from '@huggingface/inference';

if (!process.env.HF_TOKEN) {
  throw new Error('HF_TOKEN is not defined');
}

export const hf = new HfInference(process.env.HF_TOKEN);

The startup check on HF_TOKEN makes a missing token surface at boot time rather than at the first inference call. Claude omits this check by default. Add the singleton pattern to CLAUDE.md so Claude does not instantiate a new client inline in every route handler.

The Inference API call pattern

The basic text generation call against a hosted model:

// src/app/api/inference/text/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { hf } from '@/lib/hf';

const MODEL_ID = 'mistralai/Mistral-7B-Instruct-v0.3';
const TIMEOUT_MS = 30_000;

export async function POST(req: NextRequest) {
  const { prompt } = await req.json();

  if (typeof prompt !== 'string' || prompt.length === 0) {
    return NextResponse.json({ error: 'Invalid prompt' }, { status: 400 });
  }

  try {
    const result = await hf.textGeneration({
      model: MODEL_ID,
      inputs: prompt,
      parameters: {
        max_new_tokens: 512,
        temperature: 0.7,
        top_p: 0.95,
        return_full_text: false,
      },
      options: {
        wait_for_model: true,
        use_cache: true,
      },
    }, {
      signal: AbortSignal.timeout(TIMEOUT_MS),
    });

    return NextResponse.json({
      text: result.generated_text,
      model: MODEL_ID,
    });
  } catch (e) {
    if (e instanceof Error && e.name === 'TimeoutError') {
      return NextResponse.json(
        { error: 'Inference timeout', model: MODEL_ID },
        { status: 504 },
      );
    }
    console.error('[HF] Inference error:', e);
    return NextResponse.json(
      { error: 'Inference failed' },
      { status: 500 },
    );
  }
}

Three details Claude misses without CLAUDE.md instruction.

The wait_for_model: true option tells Hugging Face to wait for the model to load if it is currently cold. Without it, the first request after a model has been idle returns a 503 with a "loading" message. Most production code should set it to true. The trade-off is latency: the first request after a cold model can take 30 to 60 seconds.

The use_cache: true option lets the Inference API return cached results for identical inputs. For deterministic-ish prompts (temperature 0 or very low), the cache hit avoids a full inference. For high-temperature generation, the cache rarely hits and the flag is harmless.

The AbortSignal.timeout(30_000) is the per-call timeout. The default HfInference timeout is generous (more than 60 seconds). For user-facing endpoints, 30 seconds is usually the longest you want to keep a connection open. Beyond that, the user has navigated away and the work is wasted.

Model selection and the model ID rule

Hugging Face hosts over a million models. The choice between them depends on the task, the latency budget, and the compute cost. Hardcoding the model ID in a single constant file makes the choice explicit and auditable. Scattering model IDs across handlers makes it impossible to know what is running in production.

// src/lib/models.ts
export const MODELS = {
  textGeneration: {
    fast: 'mistralai/Mistral-7B-Instruct-v0.3',
    quality: 'meta-llama/Llama-3.1-70B-Instruct',
    cheap: 'google/gemma-2-9b-it',
  },
  embedding: {
    default: 'sentence-transformers/all-MiniLM-L6-v2',
    multilingual: 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2',
  },
  imageClassification: {
    default: 'google/vit-base-patch16-224',
  },
  speechToText: {
    default: 'openai/whisper-large-v3',
  },
} as const;

Add a model selection section to CLAUDE.md:

## Model selection
- ALL model IDs MUST be referenced from src/lib/models.ts
- NEVER hardcode a model ID inline in a route handler or library file
- New models go through: add to MODELS object, update relevant CLAUDE.md section, deploy
- Model changes are PR-reviewed like any other config change

This rule pays back when a model is deprecated or you need to swap to a faster variant for cost reasons. A single file change updates every call site.

Embeddings and vector search

Embeddings are the most common production use of Hugging Face: convert text into vectors, store them in a vector database, query by cosine similarity. The sentence-transformers models are the dominant choice because they are small, fast, and produce dense vectors that work well for retrieval.

// src/lib/embeddings.ts
import { hf } from '@/lib/hf';
import { MODELS } from '@/lib/models';

export async function embed(text: string | string[]): Promise<number[][]> {
  const inputs = Array.isArray(text) ? text : [text];

  const result = await hf.featureExtraction({
    model: MODELS.embedding.default,
    inputs,
  }, {
    signal: AbortSignal.timeout(15_000),
  });

  // sentence-transformers returns number[][] for batch input
  // and number[] for single input, normalise to number[][]
  if (Array.isArray(result[0])) {
    return result as number[][];
  }
  return [result as number[]];
}

The shape-normalisation step at the bottom of the function is the detail Claude omits. The featureExtraction API returns number[] for a single string input and number[][] for an array input. Code that always expects number[][] breaks on single inputs. Code that always expects number[] breaks on batch inputs. The check at the bottom of the function handles both.

For storing the resulting vectors, the most common pairing in 2026 is Postgres with pgvector. Claude Code with Postgres covers the schema and indexing patterns for vector storage at scale.

Batch inference and rate limits

The Inference API rate limits depend on your subscription tier. The free tier allows a few thousand requests per hour. The Pro and Enterprise tiers allow significantly more, with per-model throughput guarantees.

For batch processing, the featureExtraction endpoint accepts arrays directly, which is more efficient than looping individual calls:

// Good: single request for 100 strings
const vectors = await hf.featureExtraction({
  model: MODELS.embedding.default,
  inputs: documents,  // string[] of 100 items
});

// Bad: 100 separate requests
const vectors = await Promise.all(
  documents.map(doc => hf.featureExtraction({
    model: MODELS.embedding.default,
    inputs: doc,
  })),
);

For text generation, the textGeneration endpoint does not natively batch in the same way. Multiple calls are required, and rate limits apply per call. The pattern is to use Promise.all with a concurrency limit:

import pLimit from 'p-limit';

const limit = pLimit(5);  // max 5 concurrent requests

async function batchGenerate(prompts: string[]) {
  return Promise.all(
    prompts.map(prompt =>
      limit(() => hf.textGeneration({
        model: MODELS.textGeneration.fast,
        inputs: prompt,
        parameters: { max_new_tokens: 256 },
      })),
    ),
  );
}

Add a batch and rate limit section to CLAUDE.md:

## Batch inference
- featureExtraction accepts string[] for native batching, prefer it over loops
- textGeneration does not batch, use p-limit with concurrency 3-5
- 429 responses indicate rate limit, retry with exponential backoff
- Log rate limit hits, they indicate the volume needs a plan upgrade

transformers.js for browser-side inference

The @huggingface/transformers package (formerly transformers.js) runs models directly in the browser via ONNX Runtime. For small models (embeddings, classification, summarisation), this avoids the round-trip to a hosted API and the associated latency and cost.

npm i @huggingface/transformers

Loading a model in a Next.js client component:

// src/lib/transformers/embed.ts
'use client';

import { pipeline, FeatureExtractionPipeline } from '@huggingface/transformers';

let cachedPipeline: FeatureExtractionPipeline | null = null;

export async function getEmbeddingPipeline() {
  if (cachedPipeline) return cachedPipeline;

  cachedPipeline = await pipeline(
    'feature-extraction',
    'Xenova/all-MiniLM-L6-v2',
    { device: 'webgpu' },  // falls back to wasm if webgpu unavailable
  );

  return cachedPipeline;
}

export async function embedInBrowser(text: string): Promise<number[]> {
  const extractor = await getEmbeddingPipeline();
  const output = await extractor(text, { pooling: 'mean', normalize: true });
  return Array.from(output.data);
}

Two things matter here that Claude misses by default.

The cachedPipeline singleton ensures the model is loaded once per page. Pipeline construction downloads the model files (typically 20 to 100 MB for small models) and initialises the ONNX Runtime session. Doing that on every call would re-download nothing (the browser caches the files) but would spend several hundred milliseconds re-initialising the runtime.

The { device: 'webgpu' } option uses the browser's WebGPU API for accelerated inference. On supported browsers (Chrome, Edge, recent Safari) this is 5 to 10 times faster than the WASM fallback. The library handles the fallback automatically if WebGPU is unavailable, so you can always set the option without breaking older browsers.

Add a transformers.js section to CLAUDE.md:

## transformers.js (browser inference)
- Use for small models only: embeddings, classification, summarisation
- Model files are downloaded to browser cache, first load is slow
- ALWAYS cache the pipeline in a module-level variable
- ALWAYS set { device: 'webgpu' } for the WebGPU fallback chain
- NEVER use transformers.js for models larger than 500 MB
- NEVER bundle the model into the Next.js build, let the browser cache it

Python transformers and the cache directory

For server-side Python workloads (training, fine-tuning, batch inference), the transformers library is the default. The cache directory rule applies here even more strictly because Python's default cache location varies by platform and version.

# src/inference.py
import os

# Set BEFORE importing transformers
os.environ['HF_HOME'] = '/var/cache/huggingface'
os.environ['TRANSFORMERS_CACHE'] = '/var/cache/huggingface/hub'

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_ID = 'mistralai/Mistral-7B-Instruct-v0.3'

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map='auto',
)

def generate(prompt: str, max_new_tokens: int = 256) -> str:
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

The two os.environ lines must execute before the from transformers import line. The transformers library reads the cache environment variables at import time. Setting them after the import has no effect, and Claude often writes them after the imports because that is the typical Python import-first style.

Add a Python section to CLAUDE.md:

## Python transformers
- Set HF_HOME and TRANSFORMERS_CACHE BEFORE importing transformers
- Cache directory MUST be a persistent volume in Docker (not /tmp)
- Use torch_dtype=torch.bfloat16 on supported GPUs, halves memory vs float32
- Use device_map='auto' for automatic GPU placement
- For inference-only, set model.eval() and wrap calls in with torch.no_grad():
- Pin transformers version in requirements.txt, do not use ^ or >=

Dataset streaming for large corpora

The datasets library handles training corpora that often exceed available memory. The streaming mode loads examples on demand instead of materialising the full dataset.

# src/data.py
from datasets import load_dataset

# Streaming mode: examples loaded one at a time
ds = load_dataset(
    'wikipedia',
    '20231101.en',
    split='train',
    streaming=True,
)

# Iterate without loading the full dataset into memory
for example in ds.take(1000):
    process(example['text'])

# Filtering and mapping work on the stream
filtered = ds.filter(lambda x: len(x['text']) > 100)
mapped = filtered.map(lambda x: {'tokens': tokenize(x['text'])})

Add a datasets section to CLAUDE.md:

## Dataset loading
- ALWAYS use streaming=True for datasets larger than 1 GB
- NEVER call load_dataset(...) without streaming on Wikipedia, C4, OSCAR, or similar
- For training: shuffle the streaming dataset with .shuffle(buffer_size=10_000)
- For evaluation: .take(N) for a fixed sample size
- Document the dataset version (commit hash from HF Hub) in code comments

The shuffling note matters because streaming datasets cannot do a true random shuffle (that would require loading the full dataset). The buffer_size controls a reservoir-sample-style shuffle that fills a buffer of size N and samples randomly from it. A buffer of 10,000 is a reasonable default for most training runs.

Docker production setup

The Docker build for a Hugging Face workload has three concerns: the cache volume mount, the token at runtime (not build time), and the user that runs the inference process.

# Dockerfile
FROM python:3.12-slim AS base

# Cache directory must be writable by the runtime user
ENV HF_HOME=/var/cache/huggingface
RUN mkdir -p /var/cache/huggingface && chmod 777 /var/cache/huggingface

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/ src/

# DO NOT bake HF_TOKEN into the image, pass at runtime
CMD ["python", "-m", "src.server"]

# docker-compose.yml
services:
  inference:
    build: .
    volumes:
      - hf-cache:/var/cache/huggingface  # persistent across container restarts
    environment:
      HF_TOKEN: ${HF_TOKEN}              # from .env, not the image
      HF_HOME: /var/cache/huggingface
volumes:
  hf-cache:

The cache volume is the line that makes deploys fast. Without it, every container restart re-downloads the multi-gigabyte model. With it, the volume persists across container lifecycles and the first run after a deploy is the only slow one.

The HF_TOKEN is passed at runtime, never baked into the image. A token in the image layers means anyone with image pull access has the token. The runtime env var pattern keeps the token out of the registry. For more on the Docker side of things, Claude Code with Docker covers the broader patterns.

Permission hooks for HF scripts

A Hugging Face project accumulates scripts: dataset downloaders, fine-tuning launchers, model card generators, evaluation runners. Some are read-only. Some upload models, push datasets, or trigger paid compute. Permission hooks gate the destructive ones.

In .claude/settings.local.json:

{
  "permissions": {
    "allow": [
      "Bash(python scripts/eval-model.py*)",
      "Bash(python scripts/list-models.py*)",
      "Bash(python scripts/download-dataset.py --dry-run*)"
    ],
    "deny": [
      "Bash(python scripts/upload-model.py*)",
      "Bash(python scripts/finetune.py*)",
      "Bash(python scripts/push-dataset.py*)",
      "Bash(huggingface-cli upload*)"
    ]
  }
}

Evaluating models and listing what is available are safe operations. Uploading a fine-tuned model, pushing a dataset, or launching a fine-tune (which consumes paid compute) require explicit confirmation. The deny list forces Claude to surface those operations as prompts rather than running them in an automated workflow.

Common Claude Code mistakes with Hugging Face

Six patterns Claude generates incorrectly without CLAUDE.md constraints, with the correct replacement for each.

1. Inline HfInference instantiation

Claude generates: const hf = new HfInference(process.env.HF_TOKEN); at the top of every file.

Correct pattern: one singleton at src/lib/hf.ts, imported everywhere.

2. Missing wait_for_model

Claude generates: await hf.textGeneration({ model, inputs }) with no options.

Correct pattern: options: { wait_for_model: true } for production endpoints that may hit cold models.

3. No per-call timeout

Claude generates: a call without AbortSignal, relying on the SDK default.

Correct pattern: { signal: AbortSignal.timeout(30_000) } so failed requests do not hang the request lifecycle.

4. Hardcoded model IDs scattered everywhere

Claude generates: model: 'mistralai/Mistral-7B-Instruct-v0.3' inline in every handler.

Correct pattern: model: MODELS.textGeneration.fast from src/lib/models.ts.

5. Cache directory unset

Claude generates: Python code that imports transformers without setting HF_HOME.

Correct pattern: os.environ['HF_HOME'] = '/var/cache/huggingface' before any HF import.

6. Full dataset load

Claude generates: ds = load_dataset('wikipedia', '20231101.en') for a 30 GB dataset.

Correct pattern: ds = load_dataset('wikipedia', '20231101.en', streaming=True) with .take(N) for samples.

Add these six pairs to CLAUDE.md as before/after examples. Claude reproduces concrete patterns faster than abstract rules.

When to use Hugging Face vs a hosted alternative

Hugging Face is the right choice when you need open-source models, when you want to fine-tune on your own data, when you need to run inference on-premise for data residency reasons, or when the model you need is not available through Anthropic, OpenAI, or other closed-source providers.

For most chat and reasoning workloads in 2026, the closed-source providers offer better latency, lower cost per token, and higher quality on common benchmarks. Claude Code with the Vercel AI SDK covers the multi-provider orchestration layer that lets you swap between Hugging Face and closed providers based on the task. The decision is workload-specific: embeddings and small classification models almost always belong on Hugging Face, large reasoning models almost always belong on Anthropic or OpenAI, and the middle ground depends on volume, latency budget, and data sensitivity.

The CLAUDE.md template in this guide produces Hugging Face integrations where tokens are properly scoped, cache directories survive deploys, batch operations use native batching where available, and the cold-start case is handled with wait_for_model. The underlying principle: Hugging Face without explicit CLAUDE.md constraints produces code that works in the happy path and fails in expensive ways at scale, and the template removes each failure mode by making the correct pattern the only pattern Claude can generate.

Get Claudify. The bundle includes a Hugging Face CLAUDE.md template with the singleton client, model selection table, cache directory rules, and all six common-mistake rules pre-configured.