Presentation
22 February 2021 Ensuring reliable computation with unreliable devices in the era of AI/ML
Divya Prasad, Rob Aitken
Author Affiliations +
Abstract
Highly reliable technology and design are vital to the future of the semiconductor industry for three reasons namely, technology scaling, new materials and new workload. Firstly, geometry scaling of technology has exacerbated multiple reliability phenomena, and managing the reliability of these devices itself has limited performance and target design metrics. For example, interconnects are heavily resistive today at sub-7nm nodes, serving as a key bottleneck in high-performance designs. One of the main reasons is that the copper wires require thick barrier-liners that consume useful conductor area to maintain wire reliability. Worsening back-end-of-the-line electromigration (EM) at advanced nodes also forces designers to limit high-performance gates in design, therefore limiting the peak design performance. Secondly, the rapid increase in the number of new materials introduced to further Moore’s law scaling has exposed designers to work with devices of little or unknown reliability, potentially leading to too much pessimism in guard-banding while designing for these new devices. Understanding the underlying failure mechanisms and quantifying their impact is key to determining the right design practices. The introduction of new wire materials like Cobalt/Ruthenium after almost two decades of copper wires is one such example, having non-trivial implications on how we design power delivery and implement designs today. Lastly, the rapid growth in computing demand in the era of AI/ML has translated into new workloads that stress the underlying devices very uniquely and demand different levels of guarantee; design-for-reliability is imperative for “always-on” applications like High-Performance-Compute and mission-critical applications such as autonomous drive. Device-level understanding and faithful modeling of both, the physical effects such as aging, time-dependent-dielectric breakdown, etc., as well as electrical mechanisms that cause transient errors in design is paramount. Aging effects at the device level are typically combated by guard banding at the design level; bias-temperature-instability (BTI) aging effects and electromigration of wires that have “healing” capabilities could be offset by balancing bias states in design. Effects such as hot-carrier-injection (HCI) that cause damage to the drain of the transistor cannot be compensated for at the design level and the time-to-failure is modeled in such cases. For transient errors (soft errors) that could corrupt stored data due to particle strikes, novel circuit design techniques are utilized to reduce their probability; for example, a “popular vote” scheme could be used by replicating logic and strategically spacing them apart; however, this would have a negative implication on the design area. Hence it is key to determine which part of the design is most affected by such faults that are heavily workload dependent. Additionally, memory blocks, flip flops, and logic blocks are uniquely impacted by such faults requiring different compensating techniques. In this talk, a brief overview of the physical and electrical failure mechanisms at advanced nodes will be provided. Popular modeling and design practices for handling the reliability of modern designs shall be discussed and trends will be reviewed highlighting the importance of design-technology-reliability-co-optimization techniques to enable future designs.
Conference Presentation
© (2021) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Divya Prasad and Rob Aitken "Ensuring reliable computation with unreliable devices in the era of AI/ML", Proc. SPIE 11614, Design-Process-Technology Co-optimization XV, 1161404 (22 February 2021); https://doi.org/10.1117/12.2584774
Advertisement
Advertisement
KEYWORDS
Reliability

Copper

Logic

Electrical breakdown

Human-computer interaction

Instrument modeling

Particles

Back to Top