what is the softmax function, how does it work, and why is its formula like that?

2025.12.22 15:00
what is the softmax function, how does it work, and why is its formula like that?

Every now and then I need to remind myself what the softmax function is, why its formula is the way it is, and how temperature works in that formula. The Wikipedia article doesn't explain it in a way I understand, so hereby I'm writing it down so that in the future I can look here, read it, and understand.

Sometimes I have a list of things - for example delicacies - and I rate how much I like each one. For instance, I have four delicacies: fried onion, pudding, apple, and garlic sandwich. I give them ratings: 3, 2, 1, 4. Now I want to say, if someone puts these four things on the table in front of me, what is the probability that I'll take the first, second, third, or fourth thing. Probabilities of mutually exclusive events must sum to 1, and now these numbers sum to... let's check... 3 + 2 + 1 + 4 = 10... now they sum to 10. So I'll divide all numbers by 10. I get: 0.3, 0.2, 0.1, 0.4. Done. The formula I used is:

formula 1: b[i] = a[i] / sigma(j=1 to n) a[j]

Notice that with this formula nothing changes if I make all ratings ten times (or however many times I want) larger. You know, as if: whether I rate on the Polish school scale (grades from 1 to 5), or on some other scale. The series 3, 2, 1, 4 and the series 30, 20, 10, 40 after passing through formula 1 will give the same result - because the numbers are larger, but I'll also divide them by more.

Now a harder case: let's assume that when rating delicacies I sometimes assign negative ratings. For example, suppose I assigned these ratings: 3, 0, -2, -1. If I now try to apply formula (1), I'll have a problem, because now the sum comes out to... 3 + 0 - 2 - 1 = 0... it comes out to zero. I don't know how to divide by zero. And even if the sum didn't come out to zero, some probabilities would still come out negative, and I don't know the physical meaning of negative probabilities.

To solve this problem, before applying formula 1 I need to add an additional step - transforming the numbers so that all are greater than zero, but still preserving their order (if one is greater than another before transformation, let it be greater after transformation too). I know one simple function that grows but is nowhere negative: that's raising to the power x. The base can be any positive number. For example 10^x. Look, if I pass this list 3, 0, -2, -1 through the function 10^x, I get: 1000, 1, 1/100, 1/10. Now I have normal, non-negative numbers that I can feed into formula 1.

So now the formula I use to convert ratings into probabilities consists of two steps: first ten to the power x, and then dividing each number by the sum of all.

formula 2:
b[i] = 10 ^ a[i]
c[i] = b[i] / sigma(j=1 to n) b[j]

But let's see what happens if I try to pass that list we talked about at the beginning - 3, 2, 1, 4 - through formula 2. Will it come out the same as from formula 1?

After step 1 I get: 1000, 100, 10, 10000. The sum is 1000 + 100 + 10 + 10000 = 11110. After division I have: 0.09000900090009, 0.009000900090009, 0.0009000900090009, 0.9000900090009. Alright, let's not clown around with so many decimal places, we have approximately: 0.09, 0.009, 0.0009, 0.9. It sums to 1, but these are different probabilities than came out from the first formula. All probabilities came out smaller than from the first formula, except for the garlic sandwich, which had the highest rating, and now got a higher probability. Well yes, that's a side effect of this first step: it boosts the contrast between things rated low and high. Whether this is bad or good I don't know, it really depends on the specific application, but that's how it is. If I wanted there to be no such effect, to have a formula that on a list containing negative numbers gives reasonable results, and applied to a list not containing negative numbers gives a result like formula (1), then I don't know if it's possible, I haven't thought it through. Maybe it's not even possible? But somehow in practical applications (in machine learning) it doesn't bother anyone, and certainly not me. You can influence this contrast-boosting effect by choosing the exponentiation base. If instead of doing 10^x I did 5^x, the boost would be smaller: from the list 3, 2, 1, 4 would become the list 125, 25, 5, 625, which after dividing by the sum would give probabilities 0.160, 0.032, 0.006, 0.801 (see how now the garlic sandwich doesn't have 0.9 but 0.8)? And if I did 1^x there would be no boost, but quite the opposite - the first step of formula (1) would flatten this list so that all its elements would become equal - the list would become 1, 1, 1, 1, so I'd calculate the probability of each delicacy as ¼. And if I took a fractional base, I'd overdo it in the other direction: low ratings would become high. For example, with base ½ from the list 3, 2, 1, 4 would become the list 1/8, 1/4, 1/2, 1/16. See how now the garlic sandwich has the lowest rating? So it's not surprising that after dividing by the sum we get probabilities where the garlic sandwich is still lowest: 0.133, 0.267, 0.533, 0.067. So I once thought for a moment that maybe somewhere between 1 and 10 there exists such a base that would give the same probabilities as formula (1), but no. Or more precisely, such a base exists, but for each list of numbers it's different. In any case, professionals use the base "e" (you know, that mathematical constant, 2.71828), because they say it has some useful properties.

This formula (2) has yet another interesting difference compared to formula (1): this time there is a difference whether I assign ratings 3, 2, 1, 4, or ratings 30, 20, 10, 40. Because look: let's assume (for simplicity of calculations) that there are two ratings 1 and 2, and the base is 10. After passing through step 1 of formula (2) we get numbers 10 and 100. But if I gave ratings 10 and 20, then after passing through step 1 of formula (2) we get numbers 10000000000 and 100000000000000000000. Well that's not all the same, because ten is one tenth of a hundred, but 10000000000 is such a small part of 100000000000000000000 that it's tiny. So see: with the second formula we can alternatively (instead of changing the exponentiation base) boost or flatten contrast by multiplying or dividing all numbers by some number. And that's what's done in practice: the base is always "e", but before the first step we add one more step: dividing each number from the list by a certain parameter called temperature. We call it temperature because its effect is such that when it's larger, randomness increases in a sense - the contrast between ratings decreases, so the chances grow that when we apply these probabilities that come out at the end, we'll get not the element that's most probable, but some other one.

So altogether the formula for softmax consists of three steps - it looks like this:

formula 3:
b[i] = a[i] / temperature
c[i] = e ^ b[i]
d[i] = c[i] / sigma(j=1 to n) c[j]

For example, for numbers 3, 2, 1, 4 and temperature equal to 1, arrays a, b, c and d look like this:

a: [3, 2, 1, 4]
b: [3, 2, 1, 4]
c: [20.09, 7.39, 2.72, 54.60]
d: [0.237, 0.087, 0.032, 0.644]

Where array a is the input data, arrays b and c are intermediate calculations, and array d is the final result.

And for temperature 2, arrays a, b, c and d look like this:

a: [3, 2, 1, 4]
b: [1.5, 1, 0.5, 2]
c: [4.48, 2.72, 1.65, 7.39]
d: [0.277, 0.168, 0.102, 0.457]

Which can be calculated with this code:

import math

a = [3, 2, 1, 4]
temperature = 1

b = [x / temperature for x in a]
c = [math.e ** x for x in b]
sum_c = sum(c)
d = [x / sum_c for x in c]

print(f"a: {a}")
print(f"b: {[round(x, 2) for x in b]}")
print(f"c: {[round(x, 2) for x in c]}")
print(f"d: {[round(x, 3) for x in d]}")

comments:
2025.12.24 02:39 P.

Hm, ale gdybym ja wymyślał od zera, jak to zrobić - jak wybierać jedną rzecz z wielu na podstawie punktacji tak, żeby radzić sobie z ujemnymi ocenami i móc płynnie zmieniać kontrast między szansami rzeczy ocenianych wysoko a ocenianych nisko - to zrobiłbym inaczej, prościej. Problem z liczbami ujemnymi rozwiązałbym tak, że na początku wszystkie oceny przesunąłbym w górę o tyle, żeby najniższa ocena wynosiła 0. A potem regulację kontrastu robiłbym przesuwając wszystkie oceny trochę w górę, wszystkie o tyle samo - im więcej przesunę, tym bardziej zmniejszę kontrast. A więc mój wzór wynosiłby:

wzór 4:

b[i] = a[i] - min(a)
c[i] = b[i] + parametr_wygładzający_kontrast
d[i] = c[i] / sigma(j=1 to n) c[j]

Ale ten wzór miałby, wydaje mi się, jedną poważną wadę: zmniejszać kontrast mógłbym łatwo, zwiększając parametr_wygładzający_kontrast. Ale zwiększać kontrast nie byłoby łatwo. Bo co, przyjmę ujemną wartość dla parametr_wygładzający_kontrast? To niektóre oceny zejdą mi poniżej zera. Co prawda potem mógłbym wyzerować te oceny, które schodzą poniżej zera, ale to by wprowadziło brzydką nieliniowość. Więc może jednak to softmax, które jest, jest dobre.

back to homepage

RSS