What are the computational performance challenges?

Ole Saastad, Chief Engineer at USIT, shares his thoughts around computational performance challenges.

Image may contain: World, Font, Technology, Event, Stage equipment.

Photo captured with a mobile phone at Jack Dongarra’s Turing award lecture at Supercomputing 22 in Dallas, November 2022. Note the warning sign for entries 2 and 3 performing at only 0.8% of theoretical performance. Fugaku at the top on the HPCG list perform somewhat better at 3%. 

Supercomputers from 1950s to mid 1990s used to be made by superfast processing units (CPU), with equally matching memory. Due to dramatic improvement in microprocessors through the 1980s and early 1990s, most of these balanced supercomputers were replaced by clusters of microprocessor-based nodes. This continues to the present day where all supercomputers on the TOP500 list are driven by microprocessors. 

As almost all real-life scientific applications handle data, some shortcomings of this microprocessor-based architecture are becoming more and more apparent. 

We are all happy with the steady increase in microprocessor performance; it seems to follow Moore’s law so far (Moore’s law stated that the number of transistors double every 18 months). With more transistors, it is possible to make more computational units and more sophisticated control logic yielding more performance. Presently this amounts to several Teraflops (floating-point operations / second) per processor (the package you can hold in your palm). 

If we were not using a large amount of input data to process, this would be excellent. However, even if we have something looking like limitless performance growth, the bandwidth at which data travel to and from the microprocessor does not! And this is where the issue with performance prevention and the question about reaching real Exascale arises. This is sometimes called the memory wall. 

In the old days, a supercomputer could read a word (real computers use 64-bit words which equal 64-bit double precision floating point numbers) from memory per clock cycle. Even with core memory (small rings of ferrite material) this could be achieved using many memory banks and accessing them in a round robin way. A computer like Star-100 in 1969 could compute one result per clock cycle and theoretically could do a respectable 100 Mflops/s. This would be called a balanced system where floating-point operation matched the floating-point numbers (words) delivered by the memory. As development for semiconductors progressed, the computer performance in floating point operations per second followed Moore’s law, doubling every 18 months, while memory bandwidth did not follow the same steep ramp. 

Over the decades, the systems have become more unbalanced as microprocessors became progressively faster while the memory performance followed on a less steep ramp. While the old balanced computer could have one floating point operation per word of memory bandwidth, the fraction today is close to 100. The developers have not been sleeping in class and many tricks like cache and prefetching have been implemented to mitigate the gap. When running from the cache with a far high bandwidth, the main memory the balance is much better, in some cases down to a single digit number. However, cache is limited and cannot span the large memory bandwidth gap from the main memory. For small, selected problems that can fit inside the cache, outstanding performance can be reached. The trick is to do all your computation while you have the data in the cache. 

At time of writing December 2022, the unbalanced issue manifests itself with the simple fact that when running a relevant scientific code benchmark like the TOP500 HPCG (Conjugated Gradients benchmark, https://www.top500.org/lists/hpcg/), the machines at top of the ?normal? list TOP500 HPL benchmark (no. 1 & 3) perform at only 0.8% of their theoretical maximum. The top of the HPCG list performs slightly better at 3%. Still, even for a well-designed system like Fugaku, 97% of the possible floating-point performance is wasted. 

When systems perform only at about 0.8-3% of theoretical maximum, a real scientific application Exascale system is still far into the future. 

By Ole Saastad, USIT
Published Feb. 6, 2023 1:46 PM - Last modified Feb. 6, 2023 1:46 PM