Adversarial images #
Classifiers based on neural networks are fickle things. In the diagram below, an image of what is clearly a panda was classified by a trained neural network as such with roughly 60% confidence. The authors then found that adding a little bit of noise to each pixel in the image, represented below as \(\epsilon\) times a pixelated grid, yielded an image that to the human eye still looks like the same panda, but the neural network was happy to pronounce it as a gibbon with almost 100% certainty!
Examples like this one are known as adversarial and the field of adversarial machine learning has received substantial interest from both the academic community and industry as well.
In this homework problem, you will attempt to create your own example of a manipulated image, but because it is just homework and I don’t want to do anything risky, we will yet-again work with the MNIST handwritten digit data set.
Background #
Let us start with a formal definition.
There are a several standard ways to build a classifier with the tools we have. If we are interested in only two classes, we can build a regression model that produces a single real number from an input feature vector. The class is then determined by setting a threshold that separates the two classes. To produce a classifier for \(k\) classes, a regression model can be asked to produce a vector in \(\mathbb{R}^k\) and the class is determined by the dimension with the largest entry.
Experimental setup #
In prior work, we have already trained a neural network classifier \(\mathcal{N} : \mathbb{R}^{784} \rightarrow \mathbb{R}\) that distinguishes between the digits 4
and 9
in the MNIST dataset. The class of an image is determined by a examining the network prediction; if this is above some threshold, the image belongs to one class, and if not, it belongs to the other.
4
as the digit 9
.
The key lies in the Jacobian
\[\frac{\partial \mathcal{N}}{\partial \vec{x}} = \Big( \frac{\partial \mathcal{N}}{\partial x_1}, \ldots, \frac{\partial \mathcal{N}}{\partial x_{784}} \Big).\]
When evaluated at a specific input vector, the entries of the Jacobian describe the influence perturbations of each of the input coordinates have on the output of \(\mathcal{N}.\)
Use companion notebook as you follow the outline below.
- Begin by training a neural network classifier that distinguishes between the digits
4
and9
. - Choose a particular image \(\vec{x}\) of a
4
from the testing set that your model classifies accurately. - Use the Jacobian to find a sparse adversarial perturbation \(\vec{s}\) for this image. Recall that a vector is sparse if it consists mostly of zeros. How sparse can you make your adversarial perturbation? Explain your reasoning and submit your original and modified images. Can one tell from your modified image whether an adversarial attack has been carried out?
- A sparse perturbation usually requires profound changes to a small number of coordinates of \(\vec{x}\). What if instead you could only make small changes to any individual pixel, but were permitted to modify as many pixels as you wanted? I am looking for a very specific mathematical answer. Carry this out, examine the resulting images, and report your results.
- Suppose that we have a neural network \(\mathcal{N} : \mathbb{R}^n \rightarrow \mathbb{R}\) that determines a classifier \(f: \mathbb{R}^n \rightarrow \{1,2\}\) as we have done above. Let \[\mathcal{S} = \{ \vec{x}_i\}_{i=1}^k \] be a collection of feature vectors. Describe a process you could use to find an adversarial perturbation \(\vec{s}\) that works for all vectors in \(\mathcal{S}\) at the same time; that is, \(f(\vec{x}_i) \neq f(\vec{x}_i + \vec{s})\) for all \(i\).
Why is the AI researcher in the following picture wearing such stylish glasses?