Explaining GPU Computing Power in Simple Terms

05/13 2025 446

Hello everyone!

In our previous article, we delved into various GPU parameters. For those interested, you can check out "Easy to Understand: A Fun Guide to GPU Core Parameters and Specifications!" for more details.

Recently, many of you have asked how the GPU computing power data presented in our tables is calculated. Why are there different expressions for FP32 and FP16?

Below, let's explore the method for calculating computing power and understand how GPU computing power is determined. Feel free to like and share this article.

I. What Exactly is Computing Power?

GPU computing power is commonly expressed in FLOPS (Floating-point Operations Per Second), which reflects the efficiency of GPUs in performing complex computational tasks.

In simple terms, GPU computing power measures how many mathematical problems a GPU can solve per second. These problems aren't just basic arithmetic operations like addition, subtraction, multiplication, or division, but rather more complex floating-point operations (similar to decimal calculations) and integer operations (similar to whole number statistics).

Example of FLOPS

For instance:

Floating-point operations: When GPUs perform scientific calculations (like weather forecasting) or AI training, it's akin to solving complex calculus problems, where speed is crucial.

Integer operations: During AI inference (such as image recognition), GPUs need to quickly count pixel points or determine classification results. GPUs excel in integer calculations, offering higher performance and efficiency, especially when handling large datasets and complex algorithms.

II. Formula for Calculating Computing Power

Before diving into the computing power formula, let's clarify two key terms: TFLOPS (Tera Floating-point Operations Per Second) and TOPS (Tera Operations Per Second).

TFLOPS: Measures the number of trillion floating-point operations a computer hardware (like a CPU or GPU) can complete in one second. It's essential for tasks requiring high-precision calculations, such as scientific research and graphics rendering.

TOPS: Measures the number of trillion operations per second, encompassing various types of computations, including integer operations and logical operations. It's particularly relevant in AI, where efficient integer operations are vital for tasks like inference and image recognition.

In summary, TFLOPS focuses on high-precision floating-point operations, while TOPS is broader, encompassing various types of computations. TFLOPS is commonly used to evaluate GPU performance, whereas TOPS is more relevant for NPU or dedicated AI chips.

Here's the core formula for calculating GPU computing power:

Computing Power (FLOPS) = CUDA Core Count × Boost Frequency × Floating-point Calculation Coefficient per Core per Cycle

  • CUDA Core Count: The number of CUDA cores in each GPU, reflecting the number of computing units and a critical factor in determining computing power.
  • Core Clock Frequency: The operating speed of the CUDA cores, measured in GHz. A higher frequency means more operations per second.
  • Floating-point Calculation Coefficient per Core per Cycle: Determines the number of floating-point operations each core can perform in each clock cycle, a key parameter for evaluating GPU performance.

For example, let's calculate the theoretical peak computing power of the NVIDIA A100 GPU:

  • CUDA Core Count: 6912 (108 SMs, each with 64 CUDA cores).
  • Core Operating Frequency: 1.41 GHz.
  • Floating-point Calculation Coefficient per Core per Cycle: 2 (considering Tensor Core's fused multiply-add instructions).

Applying the formula: A100's computing power (FP32 single-precision) = 6912 × 1.41 × 2 = 19491.84 GFLOPS ≈ 19.5 TFLOPS.

Another method to estimate GPU computing power is the peak computation method, based on the number of instructions executed per clock cycle (F_clk), operating frequency (F_req), and the number of SMs (N_SM).

Calculation formula: Peak computing power = F_clk × F_req × N_SM

Application example (using NVIDIA A100):

  • Single-precision FP32 instruction throughput: 64 FLOPS/Cycle.
  • Core operating frequency: 1.41 GHz.
  • Number of SMs: 108.

Considering Tensor Core's fused multiply-add instructions, A100's peak computing power = 64 FLOPS/Cycle × 1.41 GHz × 108 SMs × 2 = 19.491 TFLOPS ≈ 19.5 TFLOPS.

III. Differences in Computing Power Across Architectures

NVIDIA GPU architecture upgrades are akin to upgrading mobile phone chips, each generation optimizing computational efficiency:

GPU Architecture Upgrades

  • Older Architectures (Kepler/Turing): Represented by Titan V (Kepler) and RTX 2080 Ti (Turing). These GPUs support single-precision (FP32) computing power, ideal for traditional graphics rendering and scientific calculations.
  • Newer Architectures (Ampere/Hopper): Represented by A100 (Ampere) and H100 (Hopper). These GPUs utilize FP32/FP16 mixed precision, supporting both high-precision training and low-precision inference, doubling efficiency.

IV. The Bottleneck of Memory Bandwidth

Regardless of computing power, if data transmission can't keep up, it's like having too many cars on a narrow highway. Memory bandwidth determines the speed at which the GPU processes data:

Memory Bandwidth Comparison

For example, RTX 4090's bandwidth of 1008 GB/s is equivalent to 10 trucks transporting data simultaneously, whereas A100's bandwidth of 2039 GB/s is like 20 trucks.

V. Considerations in Practical Applications

When evaluating GPU computing power, in addition to considering theoretical peak computing power and the peak computation method, the following points are crucial:

  • Computing Power vs. Actual Performance: Actual GPU performance can be affected by factors like algorithm parallelism, memory bandwidth, and memory access patterns. Testing in real-world scenarios is essential for accurate evaluation. Software optimization (like CUDA programming) and power consumption are also key considerations.
  • Technological Updates: GPU architectures and performance continuously improve. Staying updated with the latest technological trends and hardware specifications is vital.
  • Multi-GPU Interconnection: Multiple GPUs can work together via NVLink or SLI, adding their computing power.

Conclusion

GPU computing power is akin to a car's horsepower, determining its speed. However, the overall experience also hinges on factors like memory bandwidth (road width) and software optimization (driving skills). When selecting a GPU, consider your task requirements (gaming, training, inference) and budget comprehensively.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.