Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

TL;DR

Thorsten Meyer AI has published a guide arguing that power-limiting and undervolting high-power GPUs can reduce heat and noise in local AI inference with limited throughput loss. The article cites RTX 4090 measurements showing a 70% power limit kept about 93.4% of tokens-per-second speed while cutting power draw from 390W to 300W.

Thorsten Meyer AI has published a GPU tuning guide that says local AI inference users can often cut heat, power draw and fan noise through power-limiting or undervolting, while keeping most of their tokens-per-second performance. The guide matters for people running high-power desktop AI workstations, where thermal limits, noise and electricity use can affect day-to-day model use.

The guide’s central claim is that local inference is often memory-bandwidth-bound rather than compute-bound. According to Thorsten Meyer AI, that means the GPU core may not need peak voltage and clocks to maintain most inference throughput, because the workload is waiting on VRAM bandwidth much of the time.

The source recommends starting with a power limit rather than a direct voltage-curve edit. It describes power limiting as a one-slider change that restricts the card instead of pushing it beyond factory settings. For Linux users, the guide gives an example command, sudo nvidia-smi -pl 300, while Windows users are pointed to MSI Afterburner.

The guide cites measured RTX 4090 data from a sustained workload: at stock settings, the card drew 390W, ran at 72C and kept 100% of baseline speed. At a 70% power limit, it drew 300W, ran at 67C and kept 93.4% of speed. At 60%, it drew 260W, ran at 62C and kept 91.5% of speed. The guide labels the 70% setting as a recommended power-efficiency area, while warning that lower caps can eventually reduce throughput more sharply.

Why It Matters

The report is aimed at a growing group of users running large language models and other AI workloads on local GPUs rather than only through cloud services. For those users, the cost of performance is not limited to the GPU purchase price. Sustained inference can add heat to a room, raise fan noise and push workstation cooling systems harder.

If the cited measurements carry over to a user’s own model and hardware, a power cap could offer a practical gain: less heat and noise with limited loss in tokens per second. That may delay or reduce the need for new cooling hardware, case changes or fan changes. The source frames this as the first tuning step for a high-power AI workstation because it costs nothing and is reversible.

Amazon

NVIDIA GPU power limit settings

As an affiliate, we earn on qualifying purchases.

Background

GPU undervolting and power-limiting are established tuning practices, but the guide narrows the case to local inference rather than gaming. Gaming workloads are often more sensitive to core clock reductions, while the guide says inference can be less affected when memory bandwidth is the limiting factor.

The source also distinguishes between two methods. Power limiting lets the GPU manage its own voltage and clocks under a lower wattage ceiling. Undervolting changes the voltage-frequency curve directly, which the guide says may preserve more performance for the same heat reduction but requires more care and longer stability testing.

“This is the first thing you should do to a high-power AI workstation, and it costs nothing.”

— Thorsten Meyer AI guide

“Local inference is memory-bound – the GPU core spends much of its time waiting on VRAM, not maxing out compute.”

— Thorsten Meyer AI guide

“Power limiting moves one slider and can’t damage anything.”

— Thorsten Meyer AI guide

Thermal Grizzly WireView GPU – 1x8Pin PCIe Normal – GPU Power Consumption Measuring Device – PCIe Power Connector – Real Time Direct Monitoring – Made in Germany

REAL-TIME OLED WATTAGE: Instantly shows current GPU power draw in watts for quick, at-a-glance monitoring while gaming, benchmarking,…

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

The source says the figures are illustrative and vary by card, model, quantization and workload. It is not yet clear from the supplied material which exact model, software stack and inference settings produced the RTX 4090 numbers. Users would need to measure their own tokens-per-second rate, power draw and temperatures under sustained workloads before treating the results as transferable.

New CPU+GPU Cooling Fan for Asus TUF Gaming FX505 FX705 FX505DT FX505DV FX505DY FX505DU FX505DD FX505GT FX505GE/GD/GM FA506 FX506 FX506LU FX705DT FX705GM/GD/GE FX95 FX86 ZX86 FZ86F FX95D FMIU FM1V

1.Compatible model: For Asus TUF Gaming FX505 FX705 FX505DT FX505DV FX505DY FX505DU FX505DD FX505GT FX505GE FX505GD FX505GM FA506…

As an affiliate, we earn on qualifying purchases.

What’s Next

The next step for readers is practical validation on their own systems: set a moderate power cap, run the real inference workload for longer than a short benchmark, and compare temperature, wattage, held clocks and tokens per second. The guide says users who want finer tuning can later test direct undervolting, starting around 0.9V to 0.95V, with longer stability checks.

Thermal Grizzly – PhaseSheet PTM (50x40mm)- High Performance Thermal pad with Phase Change Material | Durable, not electrically Conductive | for CPU, GPU & Electronics Cooling

High-performance thermal pad with phase change material for optimum heat transfer

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the actual news development?

Thorsten Meyer AI has published a guide and interactive infographic arguing that undervolting and power-limiting can reduce heat and noise for local GPU inference with limited speed loss.

Is the performance loss confirmed for every GPU?

No. The supplied figures are tied to cited measurements and examples, including RTX 4090 data. The guide says results vary by GPU, model, quantization and workload.

It recommends starting with a power limit, such as 70%, because the card manages voltage and clocks automatically under the lower ceiling.

What remains unclear?

The source material does not provide every test condition behind the cited numbers. It also does not confirm how the same settings perform across all GPUs, models and inference engines.

Source: Thorsten Meyer AI

Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

Up next

The mandate. Why the US conversational- finance surface does not translate to Europe.

Author

The Liberty Portfolio Team

Share article