TL;DR

Thorsten Meyer AI has published a guide arguing that power-limiting and undervolting high-power GPUs can reduce heat and noise in local AI inference with limited throughput loss. The article cites RTX 4090 measurements showing a 70% power limit kept about 93.4% of tokens-per-second speed while cutting power draw from 390W to 300W.

Thorsten Meyer AI has published a GPU tuning guide that says local AI inference users can often cut heat, power draw and fan noise through power-limiting or undervolting, while keeping most of their tokens-per-second performance. The guide matters for people running high-power desktop AI workstations, where thermal limits, noise and electricity use can affect day-to-day model use.

The guide’s central claim is that local inference is often memory-bandwidth-bound rather than compute-bound. According to Thorsten Meyer AI, that means the GPU core may not need peak voltage and clocks to maintain most inference throughput, because the workload is waiting on VRAM bandwidth much of the time.

The source recommends starting with a power limit rather than a direct voltage-curve edit. It describes power limiting as a one-slider change that restricts the card instead of pushing it beyond factory settings. For Linux users, the guide gives an example command, sudo nvidia-smi -pl 300, while Windows users are pointed to MSI Afterburner.

The guide cites measured RTX 4090 data from a sustained workload: at stock settings, the card drew 390W, ran at 72C and kept 100% of baseline speed. At a 70% power limit, it drew 300W, ran at 67C and kept 93.4% of speed. At 60%, it drew 260W, ran at 62C and kept 91.5% of speed. The guide labels the 70% setting as a recommended power-efficiency area, while warning that lower caps can eventually reduce throughput more sharply.

Why It Matters

The report is aimed at a growing group of users running large language models and other AI workloads on local GPUs rather than only through cloud services. For those users, the cost of performance is not limited to the GPU purchase price. Sustained inference can add heat to a room, raise fan noise and push workstation cooling systems harder.

If the cited measurements carry over to a user’s own model and hardware, a power cap could offer a practical gain: less heat and noise with limited loss in tokens per second. That may delay or reduce the need for new cooling hardware, case changes or fan changes. The source frames this as the first tuning step for a high-power AI workstation because it costs nothing and is reversible.

(2-Pack) COMeap 12 Pin GPU Cable, Dual PCIe 8 Pin Female to Mini 12 Pin Male GPU Power Adapter Extension for NVIDIA GeForce RTX 30 Series 9.5-inch (24cm)

(2-Pack) COMeap 12 Pin GPU Cable, Dual PCIe 8 Pin Female to Mini 12 Pin Male GPU Power Adapter Extension for NVIDIA GeForce RTX 30 Series 9.5-inch (24cm)

『12 Pin GPU Cable』Dual 8 pin female ends to plug into the power supply, Mini 12 Pin male…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

GPU undervolting and power-limiting are established tuning practices, but the guide narrows the case to local inference rather than gaming. Gaming workloads are often more sensitive to core clock reductions, while the guide says inference can be less affected when memory bandwidth is the limiting factor.

The source also distinguishes between two methods. Power limiting lets the GPU manage its own voltage and clocks under a lower wattage ceiling. Undervolting changes the voltage-frequency curve directly, which the guide says may preserve more performance for the same heat reduction but requires more care and longer stability testing.

“This is the first thing you should do to a high-power AI workstation, and it costs nothing.”

— Thorsten Meyer AI guide

“Local inference is memory-bound – the GPU core spends much of its time waiting on VRAM, not maxing out compute.”

— Thorsten Meyer AI guide

“Power limiting moves one slider and can’t damage anything.”

— Thorsten Meyer AI guide

Amazon

GPU undervolting software

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

The source says the figures are illustrative and vary by card, model, quantization and workload. It is not yet clear from the supplied material which exact model, software stack and inference settings produced the RTX 4090 numbers. Users would need to measure their own tokens-per-second rate, power draw and temperatures under sustained workloads before treating the results as transferable.

YiKaiEn 2 Packs 4-Pin PWM Fan Speed Reduction Cable, Optimized Cooling and Noise Reduction, Compatible with Computer Fans for Enhanced Performance 4.5inch (Black Reduce 30% Fan Speed)

YiKaiEn 2 Packs 4-Pin PWM Fan Speed Reduction Cable, Optimized Cooling and Noise Reduction, Compatible with Computer Fans for Enhanced Performance 4.5inch (Black Reduce 30% Fan Speed)

【Optimized Cooling & Noise Reduction】: This YIKAIEN 4-Pin PWM fan speed reduction cable helps regulate fan speed for…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The next step for readers is practical validation on their own systems: set a moderate power cap, run the real inference workload for longer than a short benchmark, and compare temperature, wattage, held clocks and tokens per second. The guide says users who want finer tuning can later test direct undervolting, starting around 0.9V to 0.95V, with longer stability checks.

Amazon

high-performance GPU undervolting

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the actual news development?

Thorsten Meyer AI has published a guide and interactive infographic arguing that undervolting and power-limiting can reduce heat and noise for local GPU inference with limited speed loss.

Is the performance loss confirmed for every GPU?

No. The supplied figures are tied to cited measurements and examples, including RTX 4090 data. The guide says results vary by GPU, model, quantization and workload.

What does the guide recommend first?

It recommends starting with a power limit, such as 70%, because the card manages voltage and clocks automatically under the lower ceiling.

What remains unclear?

The source material does not provide every test condition behind the cited numbers. It also does not confirm how the same settings perform across all GPUs, models and inference engines.

Source: Thorsten Meyer AI

You May Also Like

Nikkei to solicit feedback, finalize overhaul of Nikkei 225 rules

Nikkei announced plans to revise its Nikkei 225 index rules, including new sector classifications, with public feedback open until June 15, effective October 1.

China's economy loses steam in April as retail sales hit 40-month low

China’s retail sales declined 0.2% in April, the weakest since December 2022, amid slowing industrial output and investment, according to official data.

Trump’s Latin American domino play blocks out China

Under Trump, U.S. efforts in Latin America aim to diminish China’s regional presence, challenging Beijing’s Belt and Road initiatives and shifting geopolitical dynamics.

The American Question

An analysis of recent developments affecting American Jews, Israel’s security, and the evolving political landscape, highlighting confirmed facts and ongoing uncertainties.