Machine learning: ReLU activation function

The ReLU activation function #

The ReLU activation function is ubiquitous. Some attribute the success of modern neural networks to its simplicity. The goal of this project is to investigate some of its basic properties. The following is one possible outline of how to proceed:

  • Universal approximation. We have shown that dense neural networks with activation functions that are squashing functions are universal approximators. But \(\text{ReLU}(x)\) is not a squashing function: \[\lim_{x \rightarrow \infty} \text{ReLU}(x) \neq 1.\] Does the universal approximation property still hold for ReLU neural networks? There are two ways to proceed:

    • Examine our proof of the Universal Approximation Theorem. Is the squashing property absolutely necessary, or can our proof be adapted to work for other activation functions as well.
    • Or examine the existing literature, for instance:
  • Speed. The ReLU has a simple derivative so at least in principle, computing Jacobians and gradient vectors should be faster for ReLU-based networks. Design an experiment to investigate whether this is indeed true: is there an appreciable difference in the time required to train a neural network that depends on its activation functions? You can time how long a cell takes to run by using ipython-autotime. See this link.

  • Other characteristics. There are other explanations that make intuitive sense. Investigate these, either my examining the existing literature, or designing your own experiments.

You have quite a bit of freedom of how to proceed. At the end of the project, my goal is for you to be a wiser consumer and advocate of this trendy activation function.