![]() Limiting the performance of compute-centric features in consumer parts is nothing new for NVIDIA. There is only 1 FP16x2 core for every 128 FP32 cores. The lack of a significant number of FP16x2 cores is why GP104’s FP16 CUDA performance is so low as listed above. It’s not used for FP32, and it’s not used for FP16 on APIs that can’t access the FP16x2 cores (and as such promote FP16 ops to FP32). This core is in turn only used for executing native FP16 code (i.e. To get right to the point then, each SM on GP104 only contains a single FP16x2 core. What isn’t seen in NVIDIA’s published core counts is that the company has built in the FP16x2 cores separately. ![]() The FP32 core count as we know it is for these pure FP32 cores. However on GP104, NVIDIA has retained the old FP32 cores. On GP100, these FP16x2 cores are used throughout the GPU as both the GPU’s primarily FP32 core and primary FP16 core. NVIDIA was able to confirm my findings, and furthermore that the FP16 instruction rate and throughput rates were different, confirming in a roundabout manner that GTX 1080 was using vec2 packing for FP16.Īs it turns out, when it comes to FP16 NVIDIA has made another significant divergence between the HPC-focused GP100, and the consumer-focused GP104. GTX 1080’s FP16 instruction rate is 1/128 th its FP32 instruction rate, or after you factor in vec2 packing, the resulting theoretical performance (in FLOPs) is 1/64 th the FP32 rate, or about 138 GFLOPs.Īfter initially testing FP16 performance with SiSoft Sandra – one of a handful of programs with an FP16 benchmark built against CUDA 7.5 – I reached out to NVIDIA to confirm whether my results were correct, and if they had any further explanation for what I was seeing. For their consumer cards, NVIDIA has severely limited FP16 CUDA performance. GeForce GTX 1080, on the other hand, is not faster at FP16. Pascal isn’t just faster than Maxwell overall, but when it comes to FP16 operations on the FP16x2 core, Pascal is a lot faster, with theoretical throughput over similar Maxwell GPUs increasing by over three-fold thanks to the combination of overall speed improvements and double speed FP16 execution. Low precision operations are in turn seen by NVIDIA as one of the keys into further growing their increasingly important datacenter market, as deep learning and certain other tasks are themselves rapidly growing fields. Fused multiply-add (FMA/MADD) is also a supported operation here, which is important for how frequently it is used and is necessary to extract the maximum throughput out of the CUDA cores. both FP16s are undergoing addition, multiplication, etc. Now there are several special cases here due to the use of vec2 – packing together operations is not the same as having native FP16 CUDA cores – but in a nutshell NVIDIA can pack together FP16 operations as long as they’re the same operation, e.g. ![]() This core, which for clarity I’m going to call an FP16x2 core, allows the GPU to process 1 FP32 or 2 FP16 operations per clock cycle, essentially doubling FP16 performance relative to an identically configured Maxwell or Kepler GPU. On the compute side, Pascal introduces a new type of FP32 CUDA core that supports a form of FP16 execution where two FP16 operations are run through the CUDA core at once (vec2). On the storage side, Pascal supports FP16 datatypes, with relative to the previous use of FP32 means that FP16 values take up less space at every level of the memory hierarchy (registers, cache, and DRAM). Pascal, in turn, brings with it native support for FP16 compute for both storage and compute. In practice this meant that if a developer only needed the precision offered by FP16 compute (and deep learning is quickly becoming the textbook example here), that at an architectural level power was being wasted computing that extra precision. Prior to these parts, any use of FP16 data would require that it be promoted to FP32 for both computational and storage purposes, which meant that using FP16 did not offer any meaningful improvement in performance or storage needs. Starting with the Tegra X1 – and then carried forward for Pascal – NVIDIA added native FP16 compute support to their architectures. FP16 performance has been a focus area for NVIDIA for both their server-side and client-side deep learning efforts, leading to the company turning FP16 performance into a feature in and of itself. Speaking of architectural details, I know that the question of FP16 (half precision) compute performance has been of significant interest. FP16 Throughput on GP104: Good for Compatibility (and Not Much Else) ![]()
0 Comments
Leave a Reply. |