How to maximise memory bandwidth with Vitis and Xilinx UltraScale+ HBM devices
08 September 2020
Many of today's workloads and applications such as artificial intelligence (AI), data analytics, live video transcoding, and genomic analytics require an increasing amount of bandwidth.
Traditional DDR memory solutions have not been able to keep up with these growing compute and memory bandwidth requirements, creating data bottlenecks. This is visible in Figure 1, which illustrates compute capacity growth vs traditional DDR bandwidth growth.
Fortunately, high-bandwidth memory (HBM) can alleviate bottlenecks by providing more storage capacity and data bandwidth using system in package (SiP), memory technology to stack DRAM chips vertically and using a wide (1024-bit) interface.
For example, the Virtex UltraScale+ HBM enabled devices (VU+ HBM) close the bandwidth gap with improved bandwidth capabilities up to 460GB/s delivered by two HBM2 stacks. These devices also include up to 2.85 million logic cells and up to 9,024 DSP slices capable of delivering 28.1 peak INT8 TOPs. In VU+ HBM, there is a hardened AXI Switch that enables access from any of the 32 AXI channels to any of the HBM pseudo channels and addressable memory.
This article explores the design aspects that can negatively impact memory bandwidth, the options available to improve the bandwidth, and then one way to profile the HBM bandwidth to illustrate the trade-offs. These same techniques can be used to profile HBM bandwidth on the Alveo U280, VCU128, and any Xilinx UltraScale+ HBM device. It can also be used on any accelerated application using a pre-existing DSA or custom Domain Specific Architectures (DSAs). We’ll explain the process for creating a custom DSA in Vivado and how to use Xilinx Vitis unified software platform to create C/C++ Kernels and memory traffic to profile the HBM stacks.
What can impact memory bandwidth?
Anyone who’s worked with external DRAM interfaces knows achieving theoretical bandwidth is not possible. In fact, depending on several different factors, it can be difficult to even come close.
Traffic pattern is a contributor to poor bandwidth. This comes down to the fact that DRAM requires opening (ACT) and closing (PRE) rows within a bank, and random accesses require more maintenance, which prevents data from transferring during this time. Additionally, some DRAM architectures (i.e. DDR4, HBM) have overhead associated with consecutive accesses to the same Bank Group. And, short burst of or alternating read/write data means that the DQ bits are bi-directional and have a bus turnaround time associated when switching direction.
In VU+ HBM, there is a hardened AXI Switch which enables access from any of the 32 AXI channels to any of the HBM pseudo channels and addressable memory. There are many advantages to having a hardened switch, such as flexible addressing and reduction of design complexity and routing congestion. To enable flexible addressing across the entire HBM stacks the hardened AXI switch contains switch boxes broken up across four masters x four slaves.
This facilitates the flexible addressing but there is a limitation that can impact memory bandwidth. As there are only four horizontal paths available, the HBM stack can limit your achievable bandwidth due to arbitration.
How to maximise memory bandwidth
When it comes to maximising memory bandwidth, consider changing your command and addressing patterns. Since random accesses and short bursts of read/write transactions result in the worst bandwidth see if you can alter this on the user application. This will get you the biggest bang for your buck.
If you’re unable to change your traffic pattern the HBM Memory Controller IP has several options available that may help:
• Custom Address Mapping: As mentioned previously, random accesses require higher rates of ACT and PRE commands. With a custom address map, you can define the AXI addresses to HBM memory addresses which can increase the number of page hits and improve bandwidth.
• Bank Group Interleave: Enables sequential address operation to alternate between even and odd bank groups to maximise bandwidth efficiency.
• Enable Request Re-Ordering: Enables the controller to re-order commands (i.e. coalesce commands to reduce bus turnaround times).
• Enable Close Page Reorder: Enables the controller to close a page after instruction has completed. If disabled, the page remains open until a higher priority operation is requested for another page in the same bank. This can be advantageous depending on if using a random, linear, or custom addressing pattern.
• Enable Look Ahead Pre-Charge: Enables controller to re-order commands to minimise PRE commands.
• Enable Look Ahead Activate: Enables controller to re-order commands to minimise ACT commands.
• Enable Lookahead Single Bank Refresh: Enables the controller to insert refresh operations based on pending operations to maximise efficiency.
• Single Bank Refresh: Instructs the controller to refresh banks individually instead of all at once.
• Enable Refresh Period Temperature Compensation: This enables the controller to dynamically adjust the refresh rate based on the temperature of the memory stacks.
• Hold Off Refresh for Read/Write: This allows the controller to delay a refresh to permit operations to complete first.
New to Vivado is the HBM monitor which, similar to SysMon, can display the die temperature of each HBM2 die stack individually. It can also display the bandwidth on a per MC or Psuedo Channel (PC) basis.
To profile your hardware design and HBM configuration properly start with the default HBM settings and capture the read/write throughput as your baseline. Then regenerate new .bit files using each of and combinations of HBM MC options discussed earlier to determine which provides the highest throughput. Note, the way the AXI Switch is configured can also impact the HBM bandwidth and throughput and should be considered profiling as well.
If you’re using a pre-existing design and the Vitis tool, you will need to modify the hardware platform design using a custom DSA flow.
To profile the HBM bandwidth create or use an existing design or application. To profile different HBM configurations you will need access to the hardware design in order to modify the HBM IP core and then generate new bitstreams and new .xsa/.dsa files that are used in the Vitis tool for software development.
For background, Vitis is a unified software tool developed by Xilinx that provides a framework for developing and delivering FPGA accelerated data centre applications using standard programming languages and for creating software platforms targeting embedded processors.
For existing designs refer to Github, the SDAccel example repositories, the U280 product page and the VCU128 product page that contains targeted reference designs (TRDs). If you are targeting a custom platform, or even the U280 or VCU128, and need to create a custom hardware platform design this can also be done.
Why do I need to create a custom hardware platform for the Alveo U280 if DSAs already exist? As workload algorithms evolve, reconfigurable hardware enables Alveo to adapt faster than fixed-function accelerator card product cycles. Having the flexibility to customise and reconfigure the hardware gives Alveo a unique advantage over competition. In the context of this tutorial, we want to customise and generate several new hardware platforms using different HBM IP core configurations to profile the impacts on memory bandwidth to determine which provides the best results.
There are several ways to build a custom hardware platform but the quickest is to use Vivado IP Integrator (IPI). Demonstrated below is one way to do this using Microblaze to generate the HBM memory traffic in software. This could also be done in HLS, SDAccel, or in the Vitis tool with hardware accelerated memory traffic. Using Microblaze as the traffic generator makes it easy to control the traffic pattern including memory address locations and we can use a default memory test template to modify and create loops and various patterns to help profile the HBM bandwidth effectively.
The steps to build a design in the Vitis tool or SDK are similar and will include something like this:
1. Open Vivado
1. Create new or open existing Vivado design
1. Target the U280, VCU128 or whichever US+ HBM device being used
1. Create Block Design
1. Add HBM IP core
Note: Ensure project contains an IP Integrator (IPI) block design that includes HBM and Microblaze. This block design is what we refer to as the hardware design and to achieve near maximum theoretical bandwidth (460GB/s) for both HBM2 stacks you'll need to drive continuous traffic to all 16 available Memory Controllers (MC) via the AXI channels.
Add MicroBlaze, UART and any additional peripheral IP needed.
1. Validate design and generate output products
2. generate_target all [get_files <>.bd]
2. Create HDL wrapper for .bd
1. make_wrapper -files [get_files <>.bd] -top
3. Run synthesis
4. Run Implementation
5. Generate Bitstream
6. Export Hardware
1. File=>Export Hardware
2. If using the Vitis tool you may need to follow these instructions
1. (If using 2019.2) write_hw_platform -fixed <>/xsa
2. (If using 2019.1) write_dsa -fixed <>.dsa
7. Launch the Vitis tool
8. Select workspace
9. Create new application project and Board Support Package
10. Click Next, Select Create from hardware, click “+” and point to .xsa
11. Click Next, select CPU Microblaze, Language C
12. Click Next, select “Memory Tests” and click Finish
13. Build and run memory test on target
This article has explained why HBM is needed to keep up with the growing DDR bandwidth demand and what you can do to impact DRAM bandwidth, the options available to maximise your bandwidth, and how to monitor and profile your results.
Using Vitis technology to generate and accelerate HBM traffic is a quick and easy way to verify your bandwidth requirements and ensure these are met, whilst also profiling different HBM configurations to determine which is optimal for your system.
Contact Details and Archive...