Adversarial images

Adversarial images #

Classifiers based on neural networks are fickle things. In the diagram below, an image of what is clearly a panda was classified by a trained neural network as such with roughly 60% confidence. The authors then found that adding a little bit of noise to each pixel in the image, represented below as \(\epsilon\) times a pixelated grid, yielded an image that to the human eye still looks like the same panda, but the neural network was happy to pronounce it as a gibbon with almost 100% certainty!

The canonincal example of an adversarial perturbation.
Image from 'Explaining and Harnessing Adversarial Examples' by Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy

Examples like this one are known as adversarial and the field of adversarial machine learning has received substantial interest from both the academic community and industry as well.

A modified stop sign which eludes image recognition via an applied physical perturbation
Image from 'Robust Physical World Attacks on Deep Learning Visual Classification' by Kevin Eykholt et al

In this homework problem, you will attempt to create your own example of a manipulated image, but because it is just homework and I don’t want to do anything risky, we will yet-again work with the MNIST handwritten digit data set.

Background #

Let us start with a formal definition.

Definition: Suppose that \(\vec{x} \in \mathbb{R}^n \) is an input vector to a classifier \[ f: \mathbb{R}^n \rightarrow \{1,2,3, \ldots, k\}.\] Then \(\vec{s} \in \mathbb{R}^n \) is an adversarial perturbation if \(f(\vec{x}) \neq f(\vec{x}+\vec{s})\); that is, adding \(\vec{s}\) changes the class assigned to the feature vector \(\vec{x}.\)

There are a several standard ways to build a classifier with the tools we have. If we are interested in only two classes, we can build a regression model that produces a single real number from an input feature vector. The class is then determined by setting a threshold that separates the two classes. To produce a classifier for \(k\) classes, a regression model can be asked to produce a vector in \(\mathbb{R}^k\) and the class is determined by the dimension with the largest entry.

Experimental setup #

In prior work, we have already trained a neural network classifier \(\mathcal{N} : \mathbb{R}^{784} \rightarrow \mathbb{R}\) that distinguishes between the digits 4 and 9 in the MNIST dataset. The class of an image is determined by a examining the network prediction; if this is above some threshold, the image belongs to one class, and if not, it belongs to the other.

Goal: Compute adversarial perturbations for a neural network digit classifier that misclassifies certain images of the digit 4 as the digit 9.

The key lies in the Jacobian

\[\frac{\partial \mathcal{N}}{\partial \vec{x}} = \Big( \frac{\partial \mathcal{N}}{\partial x_1}, \ldots, \frac{\partial \mathcal{N}}{\partial x_{784}} \Big).\]

When evaluated at a specific input vector, the entries of the Jacobian describe the influence perturbations of each of the input coordinates have on the output of \(\mathcal{N}.\)

Homework exercise:

Use companion notebook as you follow the outline below.

Begin by training a neural network classifier that distinguishes between the digits 4 and 9.
Choose a particular image \(\vec{x}\) of a 4 from the testing set that your model classifies accurately.
Use the Jacobian to find a sparse adversarial perturbation \(\vec{s}\) for this image. Recall that a vector is sparse if it consists mostly of zeros. How sparse can you make your adversarial perturbation? Explain your reasoning and submit your original and modified images. Can one tell from your modified image whether an adversarial attack has been carried out?
A sparse perturbation usually requires profound changes to a small number of coordinates of \(\vec{x}\). What if instead you could only make small changes to any individual pixel, but were permitted to modify as many pixels as you wanted? I am looking for a very specific mathematical answer. Carry this out, examine the resulting images, and report your results.
Suppose that we have a neural network \(\mathcal{N} : \mathbb{R}^n \rightarrow \mathbb{R}\) that determines a classifier \(f: \mathbb{R}^n \rightarrow \{1,2\}\) as we have done above. Let \[\mathcal{S} = \{ \vec{x}_i\}_{i=1}^k \] be a collection of feature vectors. Describe a process you could use to find an adversarial perturbation \(\vec{s}\) that works for all vectors in \(\mathcal{S}\) at the same time; that is, \(f(\vec{x}_i) \neq f(\vec{x}_i + \vec{s})\) for all \(i\).

Homework exercise:

Why is the AI researcher in the following picture wearing such stylish glasses?

Image from 'A General Framework for Adversarial Examples with Objectives' by Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K. Reiter