Robustness against Faults in Configuration Memories of FPGA-based LLMs

Bookmark (0)
Please login to bookmark Close

Large Language Models (LLMs) pose significant challenges in terms of speed and energy dissipation of AI systems. Dependability is a further important issue for LLM implementations; this is especially relevant for FPGAs that are vulnerable to soft errors in the configuration memory. Moreover, as current GPU based implementations are not energy efficient, there is interest in running LLMs on different technology platforms, such as FlightLLM (an FPGA based accelerator designed to run LLMs for energy efficiency). In this paper, we analyze and evaluate the robustness of FPGA-based LLMs against faults/errors in the configuration memories. For the evaluation, we first propose a PyTorch based fault injection simulator and based on the analysis of FlightLLM and we study its robustness against stuck-at faults on the configuration memory. Furthermore, we propose an efficient error detection technique based on a concurrent classifier. Evaluation results show that stuck-at errors on high bits of the logic units can dramatically degrade the LLM performance, and the proposed concurrent classifier can effectively detect errors with negligible complexity and overhead. Finally, a low-cost fault location scheme is proposed, so that the fault can be easily recovered by dynamic partial reconfiguration. The combination of the concurrent classifier error detection and fault location can be used to improve the robustness of a FPGA-based LLM efficiently, such as FlightLLM

​Large Language Models (LLMs) pose significant challenges in terms of speed and energy dissipation of AI systems. Dependability is a further important issue for LLM implementations; this is especially relevant for FPGAs that are vulnerable to soft errors in the configuration memory. Moreover, as current GPU based implementations are not energy efficient, there is interest in running LLMs on different technology platforms, such as FlightLLM (an FPGA based accelerator designed to run LLMs for energy efficiency). In this paper, we analyze and evaluate the robustness of FPGA-based LLMs against faults/errors in the configuration memories. For the evaluation, we first propose a PyTorch based fault injection simulator and based on the analysis of FlightLLM and we study its robustness against stuck-at faults on the configuration memory. Furthermore, we propose an efficient error detection technique based on a concurrent classifier. Evaluation results show that stuck-at errors on high bits of the logic units can dramatically degrade the LLM performance, and the proposed concurrent classifier can effectively detect errors with negligible complexity and overhead. Finally, a low-cost fault location scheme is proposed, so that the fault can be easily recovered by dynamic partial reconfiguration. The combination of the concurrent classifier error detection and fault location can be used to improve the robustness of a FPGA-based LLM efficiently, such as FlightLLM Read More