Activation Functions - When to use them and how could they perform?

Phuc Truong
9 min readApr 16, 2018

Notice: First, this is NOT my paper at all, the sources of this article are from Analytics Vidhya, The Theory Of Everything. I just collect and pass it into my Medium and adding some of my notes to this, just for summarizing awesome ideas and studying purposes of our course in my company, for non-commercial purposes. If the authors of these posts do not allow me to do this, I am going to take this down, forever!

What is an Activation Function?

Activation functions are an extremely important feature of the artificial neural networks. They basically decide whether a neuron should be activated or not. Whether the information that the neuron is receiving is relevant for the given information or should it be ignored.

The activation function is the non linear transformation that we do over the input signal. This transformed output is then sen to the next layer of neurons as input. (image source: The Theory of Everything)

Can we do without an activation function?

Now the question which arises is that if the activation function increases the complexity so much, can we do without an activation function?

When we do not have the activation function the weights and bias would simply do a linear transformation. A linear equation is simple to solve but is limited in its capacity to solve complex problems. A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks. We would want our neural networks to work on complicated tasks like language translations or image classifications, and linear transformations would never be able to perform such tasks.

Linear functions can not take it all (image source: https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php)

Popular types of activation functions and when to use them

Binary Step Function

The first thing that comes when talking about an activation function would be a threshold based classifier. If the value Y is above a given threshold value then activate the neuron else leave it deactivated.

f(x) | = 1, x>=0
| = 0, x<0
f'(x) = 0, for all x

Pros:

It is extremely simple. It can be used while creating a binary classifier. When we simply need to say yes or no for a single class, step function would be the best choice.

Cons:

The function is more theoretical than practical since in most cases we would be classifying the data into multiple classes than just a single class. The step function would not be able to do that.
Moreover, the gradient of the step function is zero. This makes the step function not so useful since during back-propagation when the gradients of the activation functions are sent for error calculations to improve and optimize the results. The gradient of the step function reduces it all to zero and improvement of the models doesn’t really happen.

Actually, I really do not think these images are suitable or not when giving them to you guys as the sample of binary classifier, but I might be judged later on. Let it be the pre-processing procedures for images, before putting it into the magic of ML.

Linear Function

We saw the problem with the step function, the gradient being zero, it was impossible to update gradient during the back-propagation. Instead of a simple step function, we can try using a linear function. And we have this one:

f(x) = a*x, with 'a' is belong to R
if a = 4, we have…

Pros:

The input x, will be transformed following its a*x. This can be applied to various neurons and multiple neurons can be activated at the same time.

f'(x) = a

Cons:

As you saw on the image, the derivative of a linear function is constant, it does not depend upon the input value x. This means that every time we do a back- propagation, the gradient would be the same. And this is a big problem, we are not really improving the error since the gradient is pretty much the same. And not just that suppose we are trying to perform a complicated task for which we need multiple layers in our network. Now if each layer has a linear transformation, no matter how many layers we have the final output is nothing but a linear transformation of the input.

Sigmoid

And now, this is the time for our superstar. It is a widely used activation function. It is of the form and it represents

f(x)=1/(1+exp(-x))
This is a smooth function and is continuously differentiable.

Pros:

The biggest advantage that it has over step and linear function is that it is non-linear. This essentially means that the output is non linear as well. The function ranges from 0–1 having an S shape. A small change in x would also bring about large changes in the value of Y. So the function essentially tries to push the Y values towards the extremes. This is a very desirable quality when we’re trying to classify the values to a particular class.

And for the derivation of it, it’s smooth and is dependent on x. This means that during back-propagation we can easily use this function. The error can be back-propagated and the weights can be accordingly updated.

Cons:

Although, sigmoids are widely used even today but we still have some problems to address. As we saw on the plot, the function is pretty flat beyond the +3 and -3 region according to X axis. This means that once the function falls in that region the gradients become very small and the gradient is approaching to zero and the network is not really learning, and we are all dead.

Another problem that the values only range from 0 to 1. This meas that the sigmoid function is not symmetric around the origin and the values received are all positive. So not all times would we desire the values going to the next neuron to be all of the same sign. This can be addressed by scaling the sigmoid function. And now we have tanh function.

Tanh

The tanh function is very similar to the sigmoid function. It is actually just a scaled version of the sigmoid function.

tanh(x) = 2sigmoid(2x)-1 = 2/(1+exp(-2x)) -1
Tanh works similar to the sigmoid function but it is symmetric over the origin, it ranges from -1 to 1.

Pros:

All other properties are the same as that of the sigmoid function. It is continuous and differentiable at all points. The function as you can see is non linear so we can easily back-propagate the errors.

The gradient of the tanh function is steeper as compared to the sigmoid function.

Cons:

But the problem is similar to the sigmoid function when we still have the vanishing gradient problem. The graph of the tanh function is flat and the gradients are very low.

ReLU

The ReLU function is the REctified Linear Unit. It is the most widely used activation function.

f(x) = max(0,x)

Pros:

ReLU is the most widely used activation function while designing networks today. We can easily back-propagate the errors and have multiple layers of neurons being activated by the ReLU function, it does not activate all the neurons at the same time. If you look at the ReLU function when the input is negative it will convert it to zero and the neuron does not get activated. This means that at a time only a few neurons are activated making the network sparse making it efficient and easy for computation.

But, the magic is still not stop working. A smooth approximation to the rectifier is the analytic function

f(x) = log(1+exp(x))

And this is called the softplus function. The derivative of softplus is

f'(x) = exp(x)/(1+exp(x)) = 1/(1+exp(-x))

Back to the gradient of it

Cons:

If you look at the negative side of the graph, the gradient is zero and the weights are not updated during back propagation. This can create dead neurons which never get activated.

Leaky ReLU

Leaky ReLU is an improved version of the ReLU function. As we saw above the gradient of ReLU is 0 when x<0, which made the neurons die. Leaky ReLU is defined to address this problem. Instead of defining the ReLU function as 0 for x less than 0, we define it as a small linear component of x.

f(x)= | ax with x<0
| x with x>=0
What we have done here is that we have simply replaced the horizontal line with a non-zero, non-horizontal line. Here a is a small value, let’s set a = 0.01 or kind of…

Pros:

The reason why we replace the horizontal line like that is to remove the zero gradient. So in this case the gradient of the left side of the graph is non zero and so we would no longer encounter dead neurons in that region

Not really cons but:

In the case of a parameterised ReLU function, a is also a trainable parameter. The network also learns the value of a for faster and more optimum convergence. The parametrised ReLU function is used when the leaky ReLU function still fails to solve the problem of dead neurons and the relevant information is not successfully passed to the next layer.

And how about these others?

When updating the curve, to know in which direction and how much to change or update the curve depending upon the slope.That is why we use differentiation in almost every part of Machine Learning and Deep Learning.

Activation Functions: Neural Networks, SAGAR SHARMA

https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6

How to choose the right Activation Function

Now that we have seen so many activation functions, I think we need to know which activation function should be used in which situation, depend on your purposes. Good or bad — there is no rule of thumb. However, as we discussed on previous classes of Andrew Ng on Coursera, depending upon the properties of the problem we might be able to make a better choice for easy and quicker convergence of the network.

ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations. That is a good point to consider when we are designing deep neural nets.

Here are some of small notices that I collected

  • When you know the function you are trying to approximate has certain characteristics, you can choose an activation function which will approximate the function faster leading to faster training process.
  • Sigmoid functions and their combinations generally work better in the case of classifiers
  • Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient problem
  • ReLU function is a general activation function and is used in most cases these days
  • If we encounter a case of dead neurons in our networks the leaky ReLU function is the best choice
  • Always keep in mind that ReLU function should only be used in the hidden layers. At current time, ReLu works most of the time as a general approximator

My note

You can create your own function!!!

Many thanks for Vũ Lê, who supported me to complete, almostly, this article!

--

--