Derivative of Softmax Loss Function

A softmax classifier:

[p_j = frac{exp{o_j}}{sum_{k}exp{o_k}} ]

It has been used in a loss function of the form

[L = - sum_{j} y_j log p_j ]

where (o) is a vector. We need the derivative of (L) with respect to (o). We can get the partial of (o_i) :

[frac{partial{p_j}}{partial{o_i}} = p_i (1-p_i), quad i = j \ frac{partial{p_j}}{partial{o_i}} = - p_i p_j, quad i e j ]

Hence the derivative of Loss with respect to (o) is:

[egin{align} frac{partial{L}}{partial{o_i}} & = - sum_k y_k frac{partial{log p_k}}{partial{o_i}} \ & = - sum_k y_k frac{1}{p_k} frac{partial{p_k}}{partial{o_i}} \ & = -y_i(1-p_i) - sum_{k e i} y_k frac{1}{p_k} (-p_kp_i) \ & = -y_i + y_i p_i + sum_{k e i} y_k p_i \ & = p_i (sum_k y_k) - y_i \ end{align} ]

Given that (sum_k y_k = 1) as (y) is a vector with only one non-zero element, which is 1. By other words, this is a classification problem.

[frac{partial L}{partial o_i} = p_i - y_i ]

Reference

Derivative of Softmax loss function