Why discuss Adversarial Robustness?
Machine learning models have been shown to be vulnerable to adversarial attacks, which consist of perturbations added to inputs designed to fool the model that are often imperceptible to humans. In this document, we highlight the several methods of generating adversarial examples and methods of evaluating adversarial robustness.
History of Adversarial Attacks
Adversarial examples are inputs to machine learning models designed to intentionally fool them or to cause mispredictions. The canonical example is the one from Ian Goodfellow’s paper below.
While adversarial machine learning is still a very young field (less than 10 years old), there’s been an explosion of papers and work around attacking such models and finding their vulnerabilities, turning into a veritable arms race between defenders and attackers. Attackers essentially have the upper hand because breaking things is easier than fixing them. A great analogy to adversarial ML is cryptography in the 50s: researchers kept trying convoluted ways of securing systems, and researchers kept trying to break them, till they invented a convoluted algorithm that was probably too computationally expensive to break (DES).
To that end, let us try to define what an adversarial sample on a model looks like. Mathematically, let us assume that we have a model f with an input x that can produce a prediction y. Then, an adversarial example δ for the model f and input x can be defined such that:
Based on the above parameters, there is a large family of algorithms that can be used to generate such perturbations. Broadly, they can be split up as follows:
At a very high level we can model the threat of adversaries as follows:
At a very high-level, we can have 8 different kinds of attacks (2 x 4) highlighted below if we use the Lp norm as a robustness metric. There are several other domain-specific ways to quantify the magnitude of the perturbation δ, but the above can be generalized across all input types. Note that the attacks cited below are strictly for images, but the general principles can be applied to any model f.
Norm bound?Access to compute gradients?L0 normL1 normL2 normL∞ normY - White BoxSparseFool ,
JSMA Elastic-net attacks Carlini-Wagner PGD ,
Carlini-WagnerN - Black BoxAdversarial Scratches ,
Sparse-RS -GenAttack ,
sim GenAttack ,
SIMBA Table 1: Taxonomy of different adversarial attack types. This is not an exhaustive list.
As machine learning models become increasingly embedded in products and services all around us, their security vulnerabilities and threats become ever more important, and monitoring becomes even more critical. We’ve highlighted different methods an adversary might launch an adversarial attach against a pre-trained model, but as we find more examples, we will make sure to write another post and share the findings. In the meantime, check out our other data science blogs.
 Modas et al., SparseFool: A Few Pixels Make a Big Difference, CVPR 2019
 Papernot et al., Practical Black-Box Attacks against Machine Learning, ASIA CCS 2017
 Sharma et al., EAD: Elastic-Net Attacks to Deep Neural Networks. AAAI-2018
 Carlini et al., Towards Evaluating the Robustness of Neural Networks, IEEE Security & Privacy, 2017
 Madry et al., Towards Deep Learning Models Resistant to Adversarial Attacks, ICLR 2018
 Goodfellow et al., Explaining and Harnessing Adversarial Examples, ICLR 2015
 Jere et al., Scratch that! An Evolution-based Adversarial Attack against Neural Networks, arxiv preprint: https://arxiv.org/abs/1912.02316
 Croce et al., Sparse-RS: a versatile framework for query-efficient sparse black-box adversarial attacks, ECCV'20 Workshop on Adversarial Robustness in the Real World
 Alzantot et al., GenAttack: Practical Black-box Attacks with Gradient-Free Optimization, GECCO '19
 Guo et al., Simple Black-box Adversarial Attacks, ICML 2019