Memory | Enough data | Peak | Maximum | ||
---|---|---|---|---|---|
Processor | Computer | Bandwidth | for DP HPCG | Performance | Efficiency |
[GB/s] | [GFlop/s] | [GFlop/s] | Bound [%] | ||
Nvidia Tesla-100 | Summit | 900 | 210 | 7800 | 2.69 |
Intel Xeon Phi "KNL" | Cori | 480+120 | 140 | 3000 | 4.67 |
Fujitsu SPARC VIIIfx | K | 64 | 15 | 128 | 11.71 |
Fujitsu SX/ACE | SX/ACE | 256 | 60 | 256 | 23.44 |
The present page presents the main reason of the bad HPCG performance of recent supercomputers.
Four computers have been selected as representatives for hi-end processors: Tesla-100 of Nvidia, Xeon Phi "KNL" of Intel and two processors of Fujitsu (used in Japanese supercomputers with exceptionally good HPCG performance).
The principal characteristic of the HPCG benchmark is a very low arithmetic intensity (flop/byte ratio). Here we are using estimation 0.233 flops/byte based on the SpMV product: for each 8 byte matrix element we perform just 2 operations (which would give the F/B ratio 2/8=0.25), but we have also load vector elements, each one used (in most cases) 27 times, which decreases the F/B ratio to 0.233; see , e.g., here or here. The exact measurement of the HPCG with preconditioning should still be done (planned within LOWAIN).
The product of the memory bandwidth of a processor in the 3rd column and the flop/byte ratio estimation 0.233 gives the upper bound to the number of operations (the 4th column) that could be performed when running the HPCG under such memory bandwidth.
Compared with the theoretical peak performance of the processor (the 5th column), we get the upper bound to the HPCG efficiency of the computers. We can see that insufficient memory bandwidth is a severe limitation of the HPCG efficiency for both Tesla-100 and KNL.
Higher values in the 5th column for Fujitsu processors are due to their lower ratio of the Flop/s performance and the memory bandwidth - 1 Flop/byte for SX-ACE processor, 2 Flop/byte for the K-computer processor (but 8.7 Flop/byte for Tesla-100).