A 3-D Multi-Precision Scalable Systolic FMA Architecture
View/ Open
Published version
Embargoed until: 5555-01-01
Embargoed until: 5555-01-01
Publisher
DOI
10.1109/TCSI.2024.3497724
Journal
IEEE Transactions on Circuits and Systems I: Regular Papers
ISSN
1549-8328
Metadata
Show full item recordAbstract
— Artificial Intelligence (AI) has almost become the default approach in a wide range of applications, such as computer vision, chatbots, and natural language processing. These AI-based applications require computing large-scale data with sufficient precision, typically in floating-point numbers, within a limited time window. A primary target for AI acceleration is matrix multiplication, mainly involving dot products through Multiply-Accumulate (MAC) operations. Current research employs the Fused Multiply-Add (FMA) operation, based on IEEE-754 Floating Point (FP) standard, to meet these requirements. However, current research focuses more on simplifying the internal digital circuits of the Processing Elements (PEs) performing FMA operations, rather than optimizing the FMA process specifically for MAC tasks. Current PE arrays often use a two-dimensional (2-D) systolic array design, without specific optimization for MAC operations, thus their parallelism is not fully utilized. Additionally, these designs lack reconfigurability and flexibility, leading to suboptimal performance on Field-Programmable Gate Arrays (FPGAs). Moreover, some designs adopt lower precision computing in AI inference for higher performance. However, some AI models still rely on high-precision computing to maintain the accuracy. Thus, multi-precision computing is commonly used in AI accelerators. To address these challenges, this paper proposes a novel Multi-Fused Multiply-Accumulate (MFMA) scheme and a corresponding three-dimensional (3-D) scalable systolic FP computing architecture. The MFMA scheme addresses the problem of the classical FMA scheme. It optimizes FMA for MAC operations with the Fused Multiply-Accumulate (FMAC) operation. Also, it combines multi-precision and mixed-precision FP computing methods for higher accuracy and lower overflow error. The proposed architecture integrates two 2-D systolic arrays into the PE for a 3-D systolic array, achieving higher parallelism and flexibility. The proposed scalable architecture can be customized to suit various FMAC operations. Compared with existing state-of-the-art FP architectures on FPGAs, our proposed architecture achieves 47%, 10%, and 159% energy efficiency improvements in FP32, FP16, and INT8 operations, respectively. Furthermore, our proposed architecture achieves energy efficiency improvements of 105%, 54%, and 262% under efficiency saturation conditions, outperforming the existing state-of-the-art design.