In total m classes, input vector/feature is d -dimensional,
the weight vector for one of the classes need not be estimated. Without loss of generality, we thus set w(m)=0 and the only parameters to be learned are the weight vectors w(i) for i∈1,…,m−1 . For the remainder of the paper, we use w to denote the (d(m-1))-dimensional vector of parameters to be learned.
for ordinary softmax regression (also named as multinomial logistic regression-MLR), the probability that x belongs to class i is written as: