Reading the variational autoencoder chapter from the "Understanding Deep Learning" book (which is available for free!). Not trivial, which is why I never got around to learning it, I guess. There are a lot of moving math parts to figure out. One of them is called "the reparametrization trick". So what is it about?
Let's say we have a distribution over a variable \(x\) parametrized by \(\theta\), \(P(x;\theta)\), and we also have some differentiable function \(f(x)\). We want to find \(\theta^*\) that maximizes the expectation of \(f\) over this distribution:
For simplicity, we assume everything is 1-d. Nowadays the solution is always gradient descent (or ascent in this case), so as a first step we'd like to calculate
Question: how do you differentiate a distribution? well, this is defined as
Can we switch the derivative and the integral? I've never been able to answer this confidently. It depends on Leibniz's integral rule. Let's assume we can, and we get
In practice this is almost never tractable, so we want to estimate it. We can use the following trick to convert it to an expectation and then use sampling to estimate it:
The expression \(\frac{\partial}{\partial \theta}\log{P(x;\theta)}\) is called the score function and is often denoted \(s(x;\theta)\). To sum up what we did so far, we found the following relation
which we can go on to estimate with sampling.
Well, the reparametrization trick takes another approach. Let's say we already estimated the expectation using sampling, ie we have \(N\) samples \(x_i\sim P(x;\theta\)) (abuse of notation but I mean sampled according to this distribution), then we can approximate the expectation with
What if we tried to differentiate that with respect to \(\theta\)? let's try using the chain rule:
What is \(\frac{\partial x_i}{\partial \theta}\)? \(x_i\) depends on \(\theta\) but in a stochastic manner. I don't know how to do this kind of derivative, though I know there's a thing called stochastic calculus. What else can we do? well, we can consider a special case: what if our random variable \(x\) satisfied an identity
where \(\varepsilon\) has some distribution that doesn't depend on \(\theta\)? for example, let's say \(x\sim \mathcal N(\mu, \sigma^2)\) is normally distributed where \(\theta=(\mu,\sigma)\). Another way to write \(x\) would be
Where \(\varepsilon\sim\mathcal N(0,1)\). In this case the stochasticity of \(x\) doesn't depend on the parameters and
In general, we would have
Putting it all together, we can estimate the gradient by sampling \(N\) times from the independent distribution \(\varepsilon_i\) and calculating