Hardware Accelerated RISC-V Vector Extension for High Performance Embedded Computing

Bookmark (0)
Please login to bookmark Close

The contemporary computational landscape is increasingly defined by the intensive data-processing requirements of Artificial Intelligence (AI) and Machine Learning (ML). To address these demands, Single Instruction, Multiple Data (SIMD) strategies have emerged as a critical approach for accelerating data-intensive workloads through the parallelisation of operations. Within this context, the open-source RISC-V Instruction Set Architecture (ISA) facilitates efficient SIMD computation through its RISC-V Vector Extension (RVV). This thesis presents the design and validation of a vector accelerator tailored for high-performance tasks within embedded systems. The architecture is developed with a modular foundation to support future scalability and implements the Zve32x vector sub-extension, providing support for 32-bit integer operations. Integrated as a coprocessor to the CV32E20 core within the eXtendable Heterogeneous Energy-efficient Platform (X-HEEP) ecosystem, the accelerator extends the scalar core’s capabilities via the Core-V eXtension Interface (CV-X-IF) 1.0. Data memory accesses to perform load/store operations are handled through the Open Bus Interface (OBI) v1.0 protocol to ensure efficient data throughput. A first implementation of the accelerator, constrained to a Vector Register Length (VLEN) of 128 bits, was validated via simulation and on Xilinx Pynq-z2 Field-Programmable Gate Array (FPGA). Performance was evaluated using standard data-parallel kernels: SAXPY and Indexed Arithmetic, which perform scalar-vector multiplication; and Matmul, executing matrix multiplication. Furthermore, this research evaluates the RISC-V GNU Compiler Toolchain, specifically investigating its auto-vectorisation capabilities in C-based applications. Comparative analysis was performed between standard C implementations and those utilising “RISC-V Vector C Intrinsics”. Results from simulation and FPGA execution demonstrate that the proposed accelerator achieves a maximum speed-up of 3.83× for the Indexed Arithmetic algorithm when employing the C Intrinsics library, highlighting the performance advantages of manual vectorisation in specialised embedded hardware.

​The contemporary computational landscape is increasingly defined by the intensive data-processing requirements of Artificial Intelligence (AI) and Machine Learning (ML). To address these demands, Single Instruction, Multiple Data (SIMD) strategies have emerged as a critical approach for accelerating data-intensive workloads through the parallelisation of operations. Within this context, the open-source RISC-V Instruction Set Architecture (ISA) facilitates efficient SIMD computation through its RISC-V Vector Extension (RVV). This thesis presents the design and validation of a vector accelerator tailored for high-performance tasks within embedded systems. The architecture is developed with a modular foundation to support future scalability and implements the Zve32x vector sub-extension, providing support for 32-bit integer operations. Integrated as a coprocessor to the CV32E20 core within the eXtendable Heterogeneous Energy-efficient Platform (X-HEEP) ecosystem, the accelerator extends the scalar core’s capabilities via the Core-V eXtension Interface (CV-X-IF) 1.0. Data memory accesses to perform load/store operations are handled through the Open Bus Interface (OBI) v1.0 protocol to ensure efficient data throughput. A first implementation of the accelerator, constrained to a Vector Register Length (VLEN) of 128 bits, was validated via simulation and on Xilinx Pynq-z2 Field-Programmable Gate Array (FPGA). Performance was evaluated using standard data-parallel kernels: SAXPY and Indexed Arithmetic, which perform scalar-vector multiplication; and Matmul, executing matrix multiplication. Furthermore, this research evaluates the RISC-V GNU Compiler Toolchain, specifically investigating its auto-vectorisation capabilities in C-based applications. Comparative analysis was performed between standard C implementations and those utilising “RISC-V Vector C Intrinsics”. Results from simulation and FPGA execution demonstrate that the proposed accelerator achieves a maximum speed-up of 3.83× for the Indexed Arithmetic algorithm when employing the C Intrinsics library, highlighting the performance advantages of manual vectorisation in specialised embedded hardware. Read More