Loss functions: Why, what, where or when?

9 min readMar 29, 2019

Introduction

In machine learning (ML), the finally purpose rely on minimizing or maximizing a function called “objective function”. The group of functions that are minimized are called “loss functions”. Loss function is used as measurement of how good a prediction model does in terms of being able to predict the expected outcome. As I mentioned in the previous article: Activation Functions — When to use them and how could they perform?, they are used in neural or nodes. However, in supervised learning, with these results, ML will become meaningless if we do not tune parameters or optimize the value we got.

For example, in the data set used for detecting cat. There are hundred images mixed of cat and dog. With every cat pictures, we label it as 1 and 0 if not. Our mission is simple, put an image into the network, it return a result as floating number which earned to predict which class the image belongs to. If the output is 1, there is no denying to say that it’s a cat and on the contrary, it isn’t. However, Neural Network (NN) doesn’t work in this way and the fact that, the result we got is a real number, like 0.1, 0.5 or 0.8. And from this, we determine 0.5 or 0.8 is cat or not.

Obviously, we may conclude 0.8 is nearest to 1, so if an image return 0.8, the probability it is a cat is higher than 0.5. But ML is not simple like this, the ugly truth is there will be some cases that return 0.5, even 0.1 but it is a cat. On the contrary, 0.8 even 0.85 or 0.999 doesn’t mean it is a cat. In other resources, I knew that you heard about back-propagation for tuning parameters or kind of. But before that, the evaluation procedure needs a solution to ‘adapt or fix’ the received result with the value it should be. And, then Loss Function (A star) is born.

So, what...?

As I mentioned above, the output of a network or from Activation Functions sometime becomes meaningless without efficient evaluation. Before listing some functions that might become your ‘fiance’ in the future, I want to notice you that there is not a single Loss Function that works for all kind of data. It depends on a number of factors including the presence of outliers, algorithm, time efficiency of gradient descent, ease of finding the derivatives and confidence of predictions… or maybe an accident.

They can be categorized into 2 groups: Classification and Regression Loss. And of course, they are different. Classification is used for predicting a discrete class label and Regression is the task of predicting a continuous quantity. However, they have some overlap.

A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label. A regression algorithm may predict a discrete value, but the discrete value in the form of an integer quantity.

Some algorithms can be used for both methods with small modifications, such as decision trees and artificial neural networks. Some algorithms cannot, or cannot be adapted easily .

Importantly, the way that we evaluate classification and regression predictions varies and does not overlap, for example:
- Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.
- Regression predictions can be evaluated using root mean squared error, whereas classification predictions cannot

*Regression functions predict a quantity, and classification functions predict a label. —* Source: heartbeat

Firstly, classification loss

Classification predictive modeling is the task of approximating a mapping function (f) from input variables (x) to discrete output variables (y_hat). The output variables are often called labels or categories. The mapping function predicts the class or category for a given observation.

Let come back to previous example, put an image into the network, it can be classified as belonging to one of two types: “cat” or “not”. That’s called two-class classification or binary classification. So on, when a problem has more than two classes, or in this case, we want to classify into “cat”, “dog” or any other group, it called multi-class classification problem. The probabilities can be interpreted as the likelihood or confidence of a given example belonging to each class. A predicted probability can be converted into a class value by selecting the class label that has the highest probability.

Now, let walk through some

Cross-Entropy Loss (or Log Loss)

It measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

Cross-entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing. Now, we dig down into its formula.

M — number of classes (dog, cat, fish)
log — the natural log
y — binary indicator (0 or 1) if class label cc is the correct classification for observation o
p — predicted probability observation o is of class c

if M=2, means binary classification

f M>2M>2 (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.

Hinge loss

An alternative to cross-entropy for binary classification problems is the hinge loss function, primarily developed for use with Support Vector Machine (SVM) models.

It is intended for use with binary classification where the target values are in the set {-1, 1}.

The hinge loss function encourages examples to have the correct sign, assigning more error when there is a difference in the sign between the actual and predicted class values.

Reports of performance with the hinge loss are mixed, sometimes resulting in better performance than cross-entropy on binary classification problems.

Square loss

The hinge loss function has many extensions, often the subject of investigation with SVM models.

A popular extension is called the squared hinge loss that simply calculates the square of the score hinge loss. It has the effect of smoothing the surface of the error function and making it numerically easier to work with.

If using a hinge loss does result in better performance on a given binary classification problem, is likely that a squared hinge loss may be appropriate.

There are other classification loss functions like.]

Focal loss

Logistic loss

Exponential loss

Regression loss

1. Mean Square Error, Quadratic loss

It is the most commonly used regression loss function. As the name suggests, Mean square error is measured as the average of squared difference between predictions and actual observations. It’s only concerned with the average magnitude of error irrespective of their direction. However, due to squaring, predictions which are far away from actual values are penalized heavily in comparison to less deviated predictions.

Below is a plot of an MSE function where the true target value is 100, and the predicted values range between -10,000 to 10,000. The MSE loss (Y-axis) reaches its minimum value at prediction (X-axis) = 100. The range is 0 to ∞.

2. Mean Absolute Error, L1 Loss

It is another loss function used for regression models. MAE is the sum of absolute differences between our target and predicted variables. So it measures the average magnitude of errors in a set of predictions, without considering their directions. (If we consider directions also, that would be called Mean Bias Error (MBE), which is a sum of residuals/errors). The range is also 0 to ∞.

For comparing these 2 functions, why use them and more, see this.

3. Huber or Smooth Mean Absolute Error

Is less sensitive to outliers in data than the squared error loss. It’s also differentiable at 0. It’s basically absolute error, which becomes quadratic when error is small. How small that error has to be to make it quadratic depends on a hyper-parameter, 𝛿 (delta), which can be tuned. Huber loss approaches MAE when 𝛿 ~ 0 and MSE when 𝛿 ~ ∞ (large numbers.)

One big problem with using MAE for training of neural nets is its constantly large gradient, which can lead to missing minimal at the end of training using gradient descent. For MSE, gradient decreases as the loss gets close to its minimal, making it more precise.

Huber loss can be really helpful in such cases, as it curves around the minimal which decreases the gradient. And it’s more robust to outliers than MSE. Therefore, it combines good properties from both MSE and MAE. However, the problem with Huber loss is that we might need to train hyper-parameter delta which is an iterative process.

4. Log-Cosh Loss

Log-cosh is another function used in regression tasks that’s smoother than L2. Log-cosh is the logarithm of the hyperbolic cosine of the prediction error.

See more about this function, please following this link:

5. Quantile Loss

Quantile loss functions turn out to be useful when we are interested in predicting an interval instead of only point predictions. Prediction interval from least square regression is based on an assumption that residuals (y — y_hat) have constant variance across values of independent variables. We can not trust linear regression models which violate this assumption. We can not also just throw away the idea of fitting linear regression model as baseline by saying that such situations would always be better modeled using non-linear functions or tree based models. This is where quantile loss and quantile regression come to rescue as regression based on quantile loss provides sensible prediction intervals even for residuals with non-constant variance or non-normal distribution.