AssertionError: No Inf Checks Were Recorded For This Optimizer | Fix Mixed Precision Training Error

The error ‘AssertionError: No Inf Checks Were Recorded For This Optimizer’ means PyTorch’s grad scaler never saw gradients for that optimizer step.

This assertion usually appears in PyTorch training runs that use automatic mixed precision. Training stops, the stack trace points into grad_scaler.py, and it is not clear what actually went wrong.

This guide walks through what the assertion means in plain terms, where it tends to show up, and practical steps that fix it in real projects. The focus is PyTorch users who train neural networks with AMP, LoRA adapters, or Stable Diffusion style tools.

What This Optimizer Assertion Actually Means

Inside automatic mixed precision, PyTorch wraps your optimizer with a GradScaler helper. Each time you run a backward pass, the scaler checks tensors for infinite values, scales gradients, and records whether any device reported an overflow. The assertion fires when the scaler expects inf checks but finds none for the optimizer you pass into scaler.step.

Put plainly, the training loop told the scaler to unscale and step an optimizer that has no recorded gradient health data for the current step. That mismatch usually comes from one of three things.

  • Wrong optimizer object — the call to scaler.step uses a different optimizer instance than the one that saw gradients.
  • No trainable parameters — every parameter under that optimizer is frozen, so no gradients were produced for it.
  • Skipped backward pass — the code never reached loss.backward() for this step, or did so under torch.no_grad().

Mixed precision tries to keep your run safe from NaNs and overflow. When its bookkeeping cannot match optimizers to recorded checks, it raises this assertion instead of silently stepping with stale or invalid gradients.

During each iteration the scaler keeps state dictionaries for every optimizer it sees. Those records track whether any device found overflow, the current scaling factor, and flags that tell it when to skip a step. If the record for a given optimizer is empty, the scaler reports that no inf checks were recorded and the assertion stops the run.

When you see AssertionError: No Inf Checks Were Recorded For This Optimizer in the log, treat it as a signal that the scaler and the optimizer fell out of sync rather than a sign that your entire model is broken.

Where This Error Message Usually Shows Up

PyTorch Mixed Precision Training With GradScaler

The most common setting for this message is a standard training loop that uses torch.cuda.amp.GradScaler. The loop follows a pattern with a context manager, a forward pass, a loss, a backward pass, and a scaler step. If any part of that pattern drifts out of sync, the assertion is likely.

  • Two optimizers, one scaler — passing different optimizers into a single scaler without a clear order or matching unscale step.
  • Optimizer re-created mid run — building a new optimizer object after the scaler was created, then using the new one in scaler.step.
  • Conditional backward — skipping the backward call on some batches while still calling scaler.step.

Keeping that pattern stable makes it less likely that scaler bookkeeping will drift away from the optimizers you expect.

LoRA And Adapter Fine Tuning

Users often hit the assertion while fine tuning large language models with LoRA adapters. Libraries such as PEFT configure adapters through a LoraConfig object. When the config has inference_mode=True, adapters are frozen. The optimizer then tracks parameters that never produce gradients, and the scaler sees no inf checks for that optimizer.

Switching the setting to inference_mode=False turns the adapters back into trainable modules. After that change, new gradients flow, the scaler finds inf checks, and training can move past the assertion.

The same pattern appears in other adapter systems and quantized training stacks. Once all tuned layers are frozen, the training loop still calls into AMP, yet the scaler has nothing real to watch for that optimizer.

Stable Diffusion And Training Extensions

Stable Diffusion packages sometimes expose training panels for embeddings or hypernetworks. Under the hood they rely on PyTorch AMP in a similar way, with a scaler tied to an optimizer. After an update, a mismatch between training code and scaler setup can trigger the same assertion during the first training step.

Even if the interface looks different from a plain Python script, the cause remains the same. The scaler expects gradient health records for the optimizer and receives none.

Quick Checks Before You Start Refactoring

Before you rewrite your full training loop, run a series of fast checks. They often reveal the mismatch that leads to the assertion.

  • Confirm scaler wiring — make sure the GradScaler instance is created after the optimizer and that the same optimizer instance is passed into scaler.step.
  • Print trainable parameter counts — log how many parameters have requires_grad=True for each optimizer so you can spot a frozen setup.
  • Watch for early returns — look for branches that skip loss.backward() but still call scaler.step at the end of the loop.
  • Check for empty batches — ensure the data loader never feeds an empty batch that might make the loss undefined.

If your logs show AssertionError: No Inf Checks Were Recorded For This Optimizer only on certain batches, compare those batches to normal ones. Data with a different shape, missing fields, or masking bugs can change which branches run inside the training loop. That quick rewind often reveals the mismatch.

Symptom Clue First Fix To Try
Assertion on first step Optimizer was replaced or created after scaler Create scaler after optimizer and keep a single instance
Assertion after refactor Backward moved into a branch Call loss.backward() whenever you call scaler.step
Assertion in LoRA runs Adapters marked for inference only Set adapter configs so trainable layers stay unfrozen

Fixing The No Inf Checks Recorded Optimizer Error In Pytorch

Once the quick checks narrow things down, step through the training loop and align the scaler with each optimizer. The goal is simple. Every time you call scaler.step(optimizer), that optimizer must have at least one live parameter with gradients for this step.

  1. Match scaler and optimizer creation — initialize your optimizer, then create a single GradScaler tied to that optimizer. Avoid rebuilding the optimizer inside the epoch loop.
  2. Keep one scaler per optimizer — if you truly need two separate optimizers, such as one for the generator and one for the discriminator, give each optimizer its own scaler instance.
  3. Check the training order — follow the pattern: zero gradients, enter AMP context, run forward pass, compute loss, call scaler.scale(loss).backward(), unscale if needed, then call scaler.step(optimizer) and scaler.update().
  4. Remove accidental no grad scopes — search for torch.no_grad() or inference helpers that might wrap the forward pass and block gradients through the model.
  5. Unfreeze the layers you want to tune — in LoRA or adapter setups, inspect the config and set fields such as inference_mode so that trainable modules keep requires_grad=True.
  6. Guard against NaN losses — log the loss every few steps, clamp gradient norms, and lower the learning rate if you see spikes that might lead to invalid values.

In more advanced stacks you may also have gradient accumulation, multiple devices, and custom backward hooks. Line those up with the same pattern before you worry about batch size or new features. A clean, small loop that follows the AMP recipe is the baseline that confirms your hardware and libraries behave as expected.

This is also a good moment to confirm that the optimizer really owns the parameters you expect. Printing the first parameter tensor shape, or checking a short identifier set on the module, helps catch a case where the wrong model object was passed into the optimizer.

Sample Mixed Precision Training Pattern That Avoids The Error

Seeing the full pattern in one place makes it easier to map your own loop to a safe shape. The following sketch shows a minimal training step that uses GradScaler in a stable way.

scaler = GradScaler()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for batch in loader:
    optimizer.zero_grad()
    with autocast():
        outputs = model(batch["inputs"])
        loss = loss_fn(outputs, batch["targets"])
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Your real code may include schedulers, logging, validation, or gradient clipping. As long as the core pattern stays intact, the scaler will record inf checks for the optimizer and the assertion stays quiet.

Preventing AssertionError: No Inf Checks Were Recorded For This Optimizer In Future Runs

Once you have fixed the immediate training issue, you can add small safety habits that keep the error from returning during later experiments or library upgrades.

  • Wrap AMP usage in helpers — write a small utility function that runs the forward, backward, and scaler step in one consistent place instead of spreading AMP calls across the code base.
  • Add sanity checks at start up — assert that each optimizer has at least one parameter with requires_grad=True before training begins.
  • Log gradient norms — sampling gradient norms for a few layers per epoch makes it easier to spot runs where gradients vanish or blow up.
  • Pin library versions — keep a record of the PyTorch, CUDA, and extension versions that work for your training setup before upgrading to a new stack.
  • Keep a small repro script — store a tiny training script in your repo that you can run after upgrades to confirm that AMP, the optimizer, and device setup all behave.

These habits add a small amount of code, yet they save time when a refactor, new dataset, or updated dependency brings the assertion back. Instead of guessing, you already have signals that point to the problem layer or optimizer.

When The Assertion Still Refuses To Go Away

In rare cases the root cause sits deep inside a training tool or wrapper library rather than your own loop. If you have pared the script down and still hit the assertion, treat the run as a minimal reproduction.

  • Try full precision first — run the same script without AMP. If training works, the bug is tied to mixed precision handling rather than the model itself.
  • Strip the run down — remove callbacks, logging hooks, custom schedulers, and anything that touches the optimizer until only the core loop remains.
  • Search existing issues — look through the tracker for your training library using the full message text, since others often report the same pattern.
  • Open a concise bug report — share the minimal script, environment details, and full stack trace so maintainers can match your report to known fixes.

If nothing else works and you need results, you can fall back to a plain float32 training run. The step will be slower and use more memory, yet a stable baseline is still often better than chasing one mixed precision glitch through a complex stack.