PVF: A novel metric for understanding AI techniques’ vulnerability in opposition to SDCs in mannequin parameters

PVF: A novel metric for understanding AI techniques’ vulnerability in opposition to SDCs in mannequin parameters
PVF: A novel metric for understanding AI techniques’ vulnerability in opposition to SDCs in mannequin parameters
  • We’re introducing parameter vulnerability factor (PVF), a novel metric for understanding and measuring AI techniques’ vulnerability in opposition to silent information corruptions (SDCs) in mannequin parameters.
  • PVF could be tailor-made to completely different AI fashions and duties, tailored to completely different {hardware} faults, and even prolonged to the coaching section of AI fashions.
  • We’re sharing outcomes of our personal case research utilizing PVF to measure the affect of SDCs in mannequin parameters, in addition to potential strategies of figuring out SDCs in mannequin parameters.

Reliability is a vital side of any profitable AI implementation. However the rising complexity and variety of AI {hardware} techniques additionally brings an elevated threat of {hardware} faults similar to bit flips. Manufacturing defects, growing older parts, or environmental elements can result in information corruptions – errors or alterations in information that may happen throughout storage, transmission, or processing and end in unintended adjustments in info.

Silent information corruptions (SDCs), the place an undetected {hardware} fault leads to misguided software conduct, have develop into more and more prevalent and troublesome to detect. Inside AI techniques, an SDC can create what’s known as parameter corruption, the place AI mannequin parameters are corrupted and their unique values are altered.

When this happens throughout AI inference/servicing it may well doubtlessly result in incorrect or degraded mannequin output for customers, finally affecting the standard and reliability of AI companies.

Determine 1 reveals an instance of this, the place a single bit flip can drastically alter the output of a ResNet mannequin. 

Determine 1: Flipping a random bit of 1 parameter within the 1st convolution (conv) layer in ResNet-18 drastically alters the mannequin’s output.

 

With this escalating thread in thoughts, there are two essential questions: How susceptible are AI fashions to parameter corruptions? And the way do completely different components (similar to modules and layers) of the fashions exhibit completely different vulnerability ranges to parameter corruptions?

Answering these questions is a vital a part of delivering dependable AI techniques and companies and affords worthwhile insights for guiding AI {hardware} system design, similar to when assigning AI mannequin parameters or software program variables to {hardware} blocks with differing fault safety capabilities. Moreover, it may well present essential info for formulating methods to detect and mitigate SDCs in AI techniques in an environment friendly and efficient method.

Parameter vulnerability factor (PVF) is a novel metric we’ve launched with the purpose to standardize the quantification of AI mannequin vulnerability in opposition to parameter corruptions. PVF is a flexible metric that may be tailor-made to completely different AI fashions/duties and can be adaptable to completely different {hardware} fault fashions. Moreover, PVF could be prolonged to the coaching section to judge the consequences of parameter corruptions on the mannequin’s convergence functionality.

What’s PVF?

PVF is impressed by the architectural vulnerability issue (AVF) metric used throughout the pc structure neighborhood. We outline a mannequin parameter’s PVF because the likelihood {that a} corruption in that particular mannequin parameter will result in an incorrect output. Much like AVF, this statistical idea could be derived from statistically intensive and significant fault injection (FI) experiments. 

PVF has a number of options:

Parameter-level quantitative evaluation

As a quantitative metric, PVF concentrates on parameter-level vulnerability, calculating the chance {that a} corruption in a particular mannequin parameter will result in an incorrect mannequin output. This “parameter” could be outlined at completely different scales and granularities, similar to a person parameter or a gaggle of parameters.

Scalability throughout AI fashions/duties

PVF is scalable and relevant throughout a variety of AI fashions, duties, and {hardware} fault fashions.

Offers insights for guiding AI system design

PVF can present worthwhile insights for AI system designers, guiding them in making knowledgeable selections about balancing fault safety with efficiency and effectivity. For instance, engineers may leverage PVF to assist map greater susceptible parameters to better-protected {hardware} blocks and discover tradeoffs on latency, energy, and reliability by enabling a surgical strategy to fault tolerance at selective places as an alternative of a catch-all/none strategy. 

Can be utilized as a normal metric for AI vulnerability/resilience analysis

PVF has the potential to unify and standardize such practices, making it simpler to match the reliability of various AI techniques/parameters and fostering open collaboration and progress within the trade and analysis neighborhood.

How PVF works

Much like AVF as a statistical idea, PVF must be derived by means of numerous FI  experiments which can be statistically significant. Determine 2 reveals an total stream to compute PVF by means of a FI course of. We’ve introduced a case research on the open-source DLRM inference with extra particulars and instance case research in our paper.

Determine 2: Computing PVF by means of FI.

Determine 3 illustrates the PVF of three DLRM parameter parts, embedding desk, bot-MLP, and top-MLP, beneath 1, 2, 4, 8, 16, 32, 64, and 128 bit flips throughout every inference. We observe completely different vulnerability ranges throughout completely different components of DLRM. For instance, beneath a single bit flip, the embedding desk has comparatively low PVF; that is attributed to embedding tables being extremely sparse, and parameter corruptions are solely activated when the actual corrupted parameter is activated by the corresponding sparse function. Nonetheless, top-MLP can have 0.4% beneath even a single bit flip. That is important – for each 1000 inferences, 4 inferences shall be incorrect. This highlights the significance of defending particular susceptible parameters for a given mannequin primarily based on the PVF measurement. 

Determine 3: The PVF of DLRM parameters beneath random bit flips.

We observe that with 128 bit flips throughout every inference, for MLP parts, PVF has elevated to 40% and 10% for top-MLP and bot-MLP parts respectively, whereas observing a number of NaN values. High-MLP element has greater PVF than bot-MLP. That is attributed to the top-MLP being nearer to the ultimate mannequin, and therefore has much less of an opportunity to be mitigated by inherent error masking likelihood of neural layers. 

The applicability of PVF

PVF is a flexible metric the place the definition of an “incorrect output” (which can fluctuate primarily based on the mannequin/job) could be tailored to go well with person necessities. To adapt PVF to varied {hardware} fault fashions the strategy to calculate PVF stays constant as depicted in Determine 2. The one modification required is the way through which the fault is injected, primarily based on the assumed fault fashions. 

Moreover, PVF could be prolonged to the coaching section to judge the consequences of parameter corruptions on the mannequin’s convergence functionality. Throughout coaching, the mannequin’s parameters are iteratively up to date to attenuate a loss perform. A corruption in a parameter might doubtlessly disrupt this studying course of, stopping the mannequin from converging to an optimum answer. By making use of the PVF idea throughout coaching, we might quantify the likelihood {that a} corruption in every parameter would end in such a convergence failure.

Dr. DNA and additional exploration avenues for PVF

The logical development after understanding AI vulnerability to SDCs is to determine and reduce their affect on AI techniques. To provoke this, we’ve launched Dr. DNA, a way designed to detect and mitigate SDCs that happen throughout deep studying mannequin inference. Particularly, we formulate and extract a set of distinctive SDC signatures from the distribution of neuron activations (DNA), primarily based on which we suggest early-stage detection and mitigation of SDCs throughout DNN inference. 

We carry out an in depth analysis throughout 10 consultant DNN fashions utilized in three frequent duties (imaginative and prescient, GenAI, and segmentation) together with ResNet, Imaginative and prescient Transformer, EfficientNet, YOLO, and so on., beneath 4 completely different error fashions. Outcomes present that Dr. DNA  achieves a 100% SDC detection charge for many circumstances, a 95% detection charge on common and a >90% detection charge throughout all circumstances, representing 20-70% enchancment over baselines. Dr. DNA may also mitigate the affect of SDCs by successfully recovering DNN mannequin efficiency with <1% reminiscence overhead and <2.5% latency overhead. 

Learn the analysis papers

PVF (Parameter Vulnerability Factor): A Novel Metric for Understanding AI Vulnerability Against SDCs in Model Parameters

Dr. DNA: Combating Silent Data Corruptions in Deep Learning using Distribution of Neuron Activations