# Softmax Activation Function

In neural network models that predict a multinomial probability distribution, the softmax activation function s utilised as the activation function in the output layer. Softmax is used as the activation function for multi-class classification problems requiring class membership on more than two labels.

The softmax activation will output one value for each node in the output layer by definition. The output numbers will be probabilities (or can be interpreted as such), and the values will add to 1.0.

The data must be prepared before modelling a multi-class classification problem. The class labels in the target variable are first label encoded, which means that each class label is assigned an integer from 0 to N-1, where N is the number of class labels.

The target variables that have been label encoded (or integer encoded) are then one-hot encoded. This, like the softmax output, is a probabilistic representation of the class label. Each class label and its position are given a position in a vector. All values are set to 0 (impossible), and a 1 (certain) is used to indicate the class label’s position.

For example, three class labels will be integer encoded as 0, 1, and 2. Then encoded to vectors as follows:

- Class 0: [1, 0, 0]
- Class 1: [0, 1, 0]
- Class 2: [0, 0, 1]

This is called a one-hot encoding.

Under supervised learning, t represents the predicted multinomial probability distribution for each class utilised to correct the model.

The softmax function returns a probability of class membership for each class label and tries to approximate the predicted target for a given input as closely as possible.

For example, if class 1 was intended for one example, the target vector would be:

[0, 1], [0, 1], [0, 1], [0, 1], [

The softmax output might look like this, with class 1 receiving the most weight and the other classes receiving less.

[0.09003057 0.66524096 0] [0.09003057 0.66524096 0] .24472847]

Cross-entropy is frequently used to calculate the error between the expected and predicted multinomial probability distributions, and this error is then utilised to update the model. The cross-entropy loss function is what it’s called.