Perturbation-based error detection and correction (PBEDC) in dependable large-scale machine learning systems

Bookmark (0)
Please login to bookmark Close

Conventional error-tolerant schemes for Neural Networks (NNs) usually require either redundancy, or changes in normal operation, leading to considerable overheads. They are not feasible for large-scale Machine Learning (ML) systems that typically employ several complex networks. This paper proposes a Perturbation-Based Error Detection and Correction (PBEDC) scheme designed to perform error detection and correction by reutilizing the inference process. Dependable performance is defined by the ability to operate correctly in the presence of errors and is a key characteristic under consideration. PBEDC employs a compact set of representative samples that are selected to monitor a few check nodes with intermediate signals. The effectiveness of PBEDC is evaluated by taking Contrastive Language-Image Pre-Training (CLIP) networks as a case study. Compared with traditional schemes that use the final prediction as the check node, PBEDC achieves a superior error detection rate (> 95 ) and can handle single bit-flip errors in the weights (which cannot be captured in existing schemes). This also enables the correction of errors when the proposed scheme is combined with the use of parity codes. Furthermore, in this paper, the analysis and simulation results show that the number of PBEDC samples required for achieving a satisfactory error tolerance is very small; the complexity of the proposed scheme does not scale up with the network size and this advantage is very pronounced with large-scale ML systems.

​Conventional error-tolerant schemes for Neural Networks (NNs) usually require either redundancy, or changes in normal operation, leading to considerable overheads. They are not feasible for large-scale Machine Learning (ML) systems that typically employ several complex networks. This paper proposes a Perturbation-Based Error Detection and Correction (PBEDC) scheme designed to perform error detection and correction by reutilizing the inference process. Dependable performance is defined by the ability to operate correctly in the presence of errors and is a key characteristic under consideration. PBEDC employs a compact set of representative samples that are selected to monitor a few check nodes with intermediate signals. The effectiveness of PBEDC is evaluated by taking Contrastive Language-Image Pre-Training (CLIP) networks as a case study. Compared with traditional schemes that use the final prediction as the check node, PBEDC achieves a superior error detection rate (> 95 ) and can handle single bit-flip errors in the weights (which cannot be captured in existing schemes). This also enables the correction of errors when the proposed scheme is combined with the use of parity codes. Furthermore, in this paper, the analysis and simulation results show that the number of PBEDC samples required for achieving a satisfactory error tolerance is very small; the complexity of the proposed scheme does not scale up with the network size and this advantage is very pronounced with large-scale ML systems. Read More