Anshul Samar
my blog

Variational Machine Translation


In the following post, I detail a learning algorithm I wrote in the Stanford Deep Learning Group. This is based on variational auto-encoding (VAE). See my previous post to learn more about VAE.

Say we wish do do machine translation in an encoder-decoder setting.

We have three datasets in hand: a corpus of the source language S, corpus of the target language T, and a parallel text translating source to target, P.

Here, I model translation as source-language \(\rightarrow\) hidden-space \(\rightarrow\) target-language. The hidden space of the underlying representation (the z variables) should stay the same, regardless of which language we use. Intuitively, this represents thought, ideas, or reason, underlying language. In English, I say “cat” and in Hindi, “billi”, but the underlying representation is invariant.

For any input sentence or word, we should be able to use the hidden variable z to both regenerate our original data and translate to the target language.

The hope is that by mapping the source to target using a parallel text - along with reconstructing monolingual texts - we better generalize to new unseen data. If successful, this would be beneficial for translating languages for which parallel texts are not as readily available, but monolingual corpi are.

We model sentences in the source language as generated by:

  1. \(z^{(i)} \sim P_{\theta_S}(z) \)
  2. \(x_S^{(i)} \sim P_{\theta_S}(x|z) \)

We model sentences in the target language as generated by:

  1. \(z^{(i)} \sim P_{\theta_T}(z) \)
  2. \(x_T^{(i)} \sim P_{\theta_T}(x|z) \)

We model sentences in our parallel corpus by:

  1. \(z^{(i)} \sim P_{\theta}(z) \)
  2. \(x_T^{(i)} \sim P_{\theta_T}(x|z) \)
  3. \(x_S^{(i)} \sim P_{\theta_S}(x|z) \)

Let \(q_{\phi_S}(z|x)\) be the approximate posterior for our source language and \(q_{\phi_T}(z|x)\) the approximate posterior for the target language.

We derive the ELBO as before:

\[\begin{align*} log p(x_T) &= log \int_z P_{\theta_S}(x|z)P_\theta(z) \\ &= log \int_z \frac{P_{\theta_S}(x|z)P_\theta(z)}{q_{\phi_S}(z|x)} q_{\theta_S}(z|x) \\ &\geq E_{q_{\theta_S}(z|x)}[log \frac{P_{\theta_S}(x|z)P_\theta(z)}{q_{\phi_S}(z|x)}] \\ &= -D_{KL}(q_{\phi_S}(z|x)||P_\theta(z)) + E_{q_{\phi_S}(z|x)}[log P_{\theta_S}(x|z)] \end{align*}\]

Similarly \(\begin{align*} log p(x_S) \geq -D_{KL}(q_{\phi_T}(z|x)||P_\theta(z)) + E_{q_{\phi_T}(z|x)}[log P_{\theta_T}(x|z)] \end{align*}\)

Now, for our parallel corpus, note that the log likelihood of seeing a pair of sentences \(s,t\) is:

\[\begin{align*} log p(s,t) &= log \int_z p_{\theta_S}(s|z)P_{\theta_T}(t|z)P_{\theta}(z) \\ &= log \int_z p_{\theta_S}(s|z)P_{\theta_T}(t|z)P_{\theta}(z) \frac{q_{\phi_T}(z|t)}{q_{\phi_S}(z|s)} \\ \end{align*}\]

In the last line, we use the fact that under the pair \(s,t\) in a parallel corpus, the approximate posteriors should have the same distribution, as both come from a shared latent space.

Continuing, we have:

\[\begin{align*} log p(s,t) &= log \int_z p_{\theta_S}(s|z)P_{\theta_T}(t|z)P_{\theta}(z) \frac{q_{\phi_T}(z|t)}{q_{\phi_S}(z|s)}\frac{q_{\phi_S}(z|t)}{q_{\phi_S}(z|s)} \\ &\geq E_{q_{\phi_S}(z|s)}[log \frac{q_{\phi_T}(s|z)}{q_{\phi_S}(z|t)} + log P_{\theta_S}(s|z) + log p_{\theta_T}(t|z) + log \frac{P_\theta(z)}{q_{\phi_S}(s|z)}] \\ &= -D_{KL}(q_{\phi_s}||q_{\phi_t}) + AE(s;\phi,\theta) + TR(s \rightarrow t;\phi,\theta) - D_{KL}(q_{\phi_s}||p_\theta) \end{align*}\]

Here \(AE\) represents the term \(E_{q_{\phi_S}(z|s)}[log P_{\theta_S}(s|z)]\). This is autoencoding - for source sentence s, we want to be able to reconstruct it using our approximate posterior and latent variable z.

TR represents translation term \(E_{q_{\phi_S}(z|s)}[log P_{\theta_T}(t|z)]\). Here, we want to use our approximate posterior to find latent variables for our input sentence s and use the decoder to output the correct translation in the target language.

The first negative KL term ensures that the two posteriors are the same - this makes intuitive sense as both source and target sentences should be generated from the same latent distribution. The second negative KL term acts as regularization against our prior \(p_\theta\).

We can minimize this loss as well as the complement loss for pair \(t,s\):

\[-D_{KL}(q_{\phi_t}||q_{\phi_s}) + AE(t;\phi,\theta) + TR(t \rightarrow s;\phi,\theta) - D_{KL}(q_{\phi_t}||p_\theta)\]

I have started working on an encoder-decoder machine translation implementation with SGVB. Please see this project for more details.

Many thanks to Ziang Xie and Misha Andriluka for their mentorship.