Diffusion Modeling Theory and Implementation for Learning and Generation of Data

Diffusion models have become the state of the art method for many popular generative modeling techniques, including image or speech generation, text-to-image synthesis, inpainting and image manipulation. Their popularity comes not only from their effectiveness in these tasks, but also from their easier training procedure, relative to the previous state of the art techniques. Generative Adversarial Networks used to be the dominating generative modeling technique for the years preceding the introduction of Denoising Diffusion Probabilistic Models (DDPMs) in 2020. Compared to diffusion models, they required a notoriously difficult training process, due to their bi-level optimization (min-max) structure, and also exhibited suboptimal performance with respect to multiple perceptual quality metrics. As a result, much attention has been focused on improving on the diffusion model, and in crafting versions dedicated to the mentioned tasks.

In this article, we will be focusing on the two main (and equivalent) diffusion models known as the DDPM and the Score-Based Generative Model (SBGM), and will organize the article into the following five sections. Section I will cover the theoretical background necessary to understand the architecture of diffusion models, while offering a visual perspective on what the model does and why. This introductory section will focus on the SBGM variant, as this offers an easier to understand visual parallel to the DDPM. Section II will shift the focus towards the DDPM variant, and will further build upon the theory in section I by introducing and explaining the mathematical equations used by the architecture. Section III will begin with the implementation of the DDPM model in an object-oriented program, and will explain the functionalities offered by the different classes. Section IV will then introduce the training procedure, where we will train our model on MNIST digits, and will evaluate its performance.

Section I: Background - A Visual Perspective

The first diffusion model was introduced in 2015, where the name was meant to highlight the parallel concept of the physical diffusion of particles. To understand diffusion in the context of image generation, imagine the space of all equally sized pixel-images where each point in the space corresponds to a specific image. In this space, we can find one element which corresponds exactly to some specific picture. If we were to add gaussian noise to it in small doses, the picture would eventually diffuse into pure gaussian noise, and resemble an element from the set of typical (in the information-theoretic sense) images from this space. Important to note is that the original picture is not an element of the typical set of images from this space, which means that sampling from our space will never yield our original picture. However, the diffused picture is an element from this typical set and therefore would be a possible observation when sampling from this space. The idea behind these models is to learn the transformations of how a specific type of picture diffuses into gaussian noise, such that the reverse procedure, namely organizing the random (sampled) noise into the original image, can be replicated.

Before explaining how a diffusion model is able to learn such transformations, we first need to dive deeper into how gaussian noise can be organized into specific pictures. Let us assume for a moment that the probability distribution, denoted by p(x), describing the statistics of specific pictures is known, and we are interested in sampling a picture x from this distribution. How do we do this?

 

Diagram by [5]

 

One might first think to perform gradient ascent upon the given (log-) distribution, as depicted by the path on the left, in order to reach a high-probability sample. The issue with this approach is the deterministic nature of these iterations, which means we will end up at the same high-probability samples with each iteration. As a solution to this, we instead inject a new sample of Gaussian noise in each step and superposition this onto our gradient in order to slightly randomize the direction taken at every iteration. Though this approach seems to be a post-hoc heuristic solution, we can in fact guarantee that any starting point which undergoes many of these steps (as long as they are kept small) will approach a true sample from our distribution, thanks to the asymptotically proven result found in [1]. This iterative process is a derivative of Langevin dynamics.

Using the procedure found above, we now know how to take any sampled gaussian random noise, and convert it into a sample (i.e. picture) given a distribution in which our target pictures are typical. However, in practice, such a distribution is not analytically available, so in order to execute the above iterative procedure, we must find a way to access the gradient of our log-probability distribution (also known as the score).

In order to gain access to the score, the idea will be to utilize a neural network which will learn to approximate it via our working data set of pictures. Such a neural network can learn through a loss function as simple as the squared loss, as shown below.

At first glance, the squared loss might seem impossible to implement, due to the explicit calling of the unknown distribution p(x) within the arguments of the expectation. However, Aapo Hyvärinen in his work [2] proved an equivalent reformulation of the squared loss which maintains the argument of the expectation to be fully in terms of the output of our neural network, thus allowing us to learn the score using the pictures in our data set only. We can furthermore interpret the reformulated loss function where the terms seems to operate like an objective function under a regularization to be minimized. The squared norm tells us that the gradient of at our training samples should be zero, meaning that our training-pictures should be the center of high-density areas in our probability distribution, while the term within the trace acts as a regularization restricting our gradient from degenerating to zero everywhere.

The most important factor we must be aware of when interpreting our loss function, is that our score is learned in areas where we have data available during training. You can further see that our loss is weighted by p(x), meaning that we completely ignore low density areas, where our predicted score will, as a result, be highly inaccurate.

 

Diagram by [4]

 

Since we start our denoising procedure by sampling from a normal distribution, we are overwhelmingly likely to begin in one of these inaccurate regions. Since we have essentially no data to inform us on useful scores in these regions, our gradient ascent steps will result in aimless movements throughout this space. So how do we resolve this?

The solution is to increase our set of training data by adding noise perturbations of different intensities to our pictures, such that we may increase our regions of accuracy in this space. The way we will organize the noise addition process is by taking all of our original pictures, and for each picture adding ‘M’ different intensities of gaussian noise. To make the process simpler, let M = 3.

 

Diagram by [5]

 

Recall we are interested in the reverse process (i.e. denoising), so we begin by taking our training data, and adding the strongest intensity of noise, rendering our pictures to be mostly pure noise. As a result, this density space will contain little information about our original pictures, however, it will lead to training scores (which although somewhat inaccurate) will serve as usable starting points for our first point sampled from gaussian noise. As you can see in the diagram above, the calculated score will then be fed into our langevin dynamic equation, where after ‘N’ steps, is guaranteed to lead to a sample stemming from this noisy probability distribution. Note that within each langevin dynamic step, the current iterate is fed through the neural network in order to approximate the necessary gradient. Upon completing all ‘N’ steps, we end the first step in our denoising process, and now continue by taking our original data set, adding the next (slightly lower) intensity gaussian noise onto it, and executing our ‘N’ neural network forward passes and langevin dynamics steps.

After ‘M’ denoising steps, we finally attain a typical picture stemming from a probability distribution which was unknown to us, and which we found by using only a pre-selected set of pictures.


Before we end this section of the article, we will introduce some slight additional specifications to the SBGM variant we have described so far. These modifications will both lay the groundwork for implementing our specific diffusion model for image generation in section IV, as well as offer a parallel and equivalent perspective to how our DDPM variant will operate in section II. The modifications we will make concern our loss function and probability density function. Previously, we showed how our squared loss can be written in a way where the argument of the expectation is expressed only in terms of the output of our neural network. We find that we can achieve greater effectiveness with lower complexity with regards to our specific image generation task if we presuppose a specific density structure, namely a gaussian density function. The modifications and their ramifications are shown below.

Section II: The Variational Inference Perspective

As described in the previous section, we start our denoising procedure by sampling an image from pure gaussian noise, and performing (partially randomized) gradient ascent through a probability density manifold which is corrupted by multiple different (iteratively decreasing) noise levels, until we reach a sample which is typical with respect to our target probability density function. In order to motivate the mathematical equations which describe this procedure and which will be implemented in our programming section, we begin by modelling this process by the following Markov chain.

The original objective shown above coincides with our aim of maximizing (under some learned variables) the probability of observing the pictures within our data set. We add the logarithm because this will simplify our future calculations, as we are indeed working with presupposed gaussian density functions, and the summation follows from our assumption of our training samples being statistically independent. As stated in the previous section, the underlying density function is unknown, so we resort to the following latent variable model shown above, which allows for us to model the (to be learned) transformations which take a sample of zero-mean, unit-variance gaussian noise, and iteratively transform (i.e. denoise) the sample into our original image. As can be seen from the last definition above, each step within the Markov chain follows from a neural network predicting the parameters of some gaussian, from which we then sample the next latent variable in the chain.

To highlight the parallels between the SBGM variant and the newly described DDPM variant, notice that each step in the Markov chain corresponds to ‘N’ Langevin dynamic steps. These ‘N’ steps were designed to replicate a sample from a distribution, and we can see here that the next latent variable is itself a sample from the same distribution. In fact, we can even see that the same noise levels (which were previously used to increase the region of reliable score-retrieval) are present in the same steps in the chain. The main explicit difference between SBGM and DDPM lies within the prediction by the neural network. In SBGM, we looked to predict the gradient of our gaussian distribution evaluated at our current point, with the purpose of using this point and predicted gradient to reach a sample guaranteed to come from this gaussian distribution, and which will serve as the next point in the denoising procedure. In DDPM, we look to use our current point to predict the mean of the gaussian which (under the corresponding noise level) will yield to us a sample, serving as the next point in our denoising Markov chain. As can be seen from both methods, though the prediction by the neural networks are different, the purpose they serve is exactly the same.

Since we will be implementing the DDPM variant, we must now ask ourselves, how will this neural network learn? We cannot define a loss function as easily as was done in the previous section, considering we do not have training data which tells us the mean of the gaussians from which these latent variables come from, nor do we have some clever reformulation we can use in the same fashion as Hyvärinen did in his paper. What we have available is our original objective, which is clearly intractable, however if we reformulate our original objective by inserting our latent variable model into our objective function, there comes a solution, namely via variational inference.

Variational inference is a methodology for approximating intractable probabilities and evidence within complex statistical models made up of observed and unobserved (i.e. unknown or latent) variables for the purpose of making future predictions or inferring further statistical properties. In variational methods, we try to find a distribution q within a family of tractable distributions Q which is meant to approximate the underlying intractable (posterior of the) distribution p. In variational Bayesian methods, this is accomplished by maximizing a lower bound to our objective distribution via a well-known analytical expression known as the Evidence Lower Bound (ELBO). Since our main goal is to find the parameters of the neural network which will yield the mean needed for the next denoising step, we will optimize this ELBO with respect to both q (as this will ensure our lower bound is tight) and our neural network parameters θ. The objective, its lower bound, and simplification (used for deriving our loss) is shown below, fitted to our Markov latent chain model.

We can see from the first inequality, that for this lower bound to be as tight as possible, the q distribution which we vary must be as close as possible to the posterior distribution of p, hopefully resulting in a Kullback-Leibler divergence of zero. Since we have predefined the Markov chain distributions to be normal, and since the posterior factorizes along this Markov chain, we can forgo having to find q within some greater family of distributions and can immediately ascertain that q must also be a normal distribution. Generally, when doing variational inference, the approximate posterior distribution q has parameters of its own which often need to be learned (as is done in variational autoencoders). However, diffusion models offer nice flexibility with negligible loss in expression when also forgoing this learning process by instead specifying q’s parameters analytically as constant hyperparameters. This pre-specification of q’s distribution-type and parameter expressions was also recommended by the original architects of the DDPM model in their 2020 paper [3], where they stated that the gains in expressive ability instead comes from a long chain of slowly-changing gaussian conditionals which allow the corresponding q-conditionals to fit them more easily.

Let us now recall that in our DDPM variant, we need a neural network with parameters θ to estimate the mean of the gaussian distribution designed to model each denoising step within our Markov chain. It turns out that when fixing our distribution q and maximizing our ELBO loss function, we achieve a loss function which enforces exactly that.

This simplified expression comes from the fact that all of our distributions are assumed to be gaussian, so when taking the KL-divergence, we arrive at a difference of squares, plus a constant which is independent of θ. Notice we have a supervised learning setting, where our neural network can use the noise scale and corrupted image at position ‘m’ in our Markov chain to learn the mean shown above (which may or may not be a function of some or all of those arguments). At this moment, the mean value of our q function, which the neural network attempts to approximate, has not yet been derived. We assume however, that a closed-form expression for this μ should be attainable via some combination of the arguments it calls. In order to find a closed form expression, we will use the following specification for q recommended by the authors in [3], shown below.

Let us recall that in variational inference, for our tractable lower bound (ELBO) to equal our original intractable objective function, the approximate posterior q must (ideally) equal the posterior p(z|x). Meaning, that our q distribution starts with some data sample x and models the forward (i.e. noising) process along our Markov chain. As can be seen by equation (1), in order to find an expression for this μ (which is so far unknown, but we believe to be analytically attainable) we must reformulate these gaussians to model the reverse Markov procedure, and from this new gaussian, extract its mean. To do this, we begin by finding that the above equations make it possible to describe any latent sample at position m, directly from our starting image, as shown below.

Notice how the largest noise scale is chosen such that a sample originating from this noise level approximately yields pure gaussian noise. We need this so that when we generate data, we can take a sample from a zero-mean, unit-variance gaussian, and denoise it into a typical picture coming from our target distribution. Finally, we can now give an expression for the conditional q distribution we have been looking for, and from that, extract an expression for the μ which our neural network can use to learn.

With this found expression for μ, there comes a glaring issue. As can be seen by the arguments of our neural network in equation (2), the variables which our neural network has available during training in conducting its forward pass, are any intermediate latent variable z and the corresponding noise scale at this point in the Markov chain. We can see that in this supervised learning setting, the “label” it is attempting to learn, is also a function of our target photo x. In the original DDPM paper, the authors stated that in one of their earlier experimental models, they formulated a loss function where the neural network is forced to learn the clean target picture x, and that this led to poorer results. This can be due to the fact that pictures exhibit many complex properties which can be difficult for a neural network to learn, in comparison to learning other variables with properties that are on a much smaller scale (such as zero-mean, unit-variance noise). We see that our loss function incurs burdens on our neural network similar to the burdens previously mentioned, since our target μ is also a function of our clean image.

To remedy this issue and increase the effectiveness of our model, we use the definition of the Gaussian q(zm) to parameterize a sample from this distribution, as shown in the following equation (5). Rearranging this equation allows us instead to reparametrize our neural network to estimate our target image x, given some latent representation z by predicting the noise ε.

Another way to view the effectiveness of this reparameterization, is that if we had a neural network f which were to only use the given latent variable and noise level to predict the clean, target image, it would have to additionally learn the transformation shown above which acts upon epsilon. Rather, we opt for learning the epsilon itself, which is why we denote it with a subscript marked by the parameters of the neural network, which as a result of our reparameterization, adopts this simple, final loss function.

Often C’ = 1 is used in practice, though a different value expressed as a function of the noise scales yields the exact ELBO.

Note that to achieve this simplified form, we simply define μθ to take the same expression as μ̃ in equation (4), where now the equations for μθ and μ̃ differ by their value of x0. We then plug in the value for x0 in equation (6) into μ̃, and plug in the estimated value for x0 in equation (7) into μθ. And finally, plug in these expression for μθ and μ̃ into our previous loss function (3).

To recap how our DDPM model works, as a refresher of section II before the programming implementation in section III, we began by noting that maximizing our main objective function is intractable, but found that our denoising steps can be formulated as a latent-variable Markov chain, which lends itself to be tractably solved via variational inference. Instead of having to learn the approximate posterior distribution q along with its parameters, we found that it would be effective and simplest to define it as a gaussian with constant hyperparameters and enforce it to adopt the same reverse conditional distribution as the Markov chain’s reverse conditional. This simplified our ELBO to become a simple squared loss error, which we now optimize only with respect to the model parameters θ. This supervised learning scenario required the mean of the gaussian in the reverse direction, but our approximate posterior q pointed in the forward direction. After deriving q for the reverse direction, we found that the expression our neural network effectively had to learn was rather complex, so using a reparameterization, we found that our neural network could solve the task by instead predicting a much simpler variable, namely the Gaussian noise used to corrupt an image. This simplified our loss function into the final form shown at the end of this section.

Now that we have derived all of the mathematical equations governing the behavior of our model, it will be very easy to understand how our model will learn and generate data.

  • Firstly, to train, we will take an MNIST digit from our training set; randomly pick a target link in our Markov chain; sample some zero-mean, unit-variance Gaussian noise; superposition this noise onto our training image, as is done in equation (5); feed this corrupted image into our neural network designed to predict the noise which corrupted the clean, original image; calculate the loss between the true noise and the predicted noise; and finally, backpropagate this error through the neural network to update our network parameters.

  • Secondly, to generate MNIST digits, we simply sample some zero-mean, unit-variance Gaussian noise and send this through the reverse direction of our Markov chain. This will iteratively denoise it for ‘M’ steps, and yield a new version of one of the ten MNIST digits.

Section III: Programming Implementation

Having completed our understanding of how our model learns and generates data, we must choose a neural network to assist our DDPM model. Since generating reasonable samples with diffusion requires more advanced architectures than a simple MLP or CNN stack, we will experiment with a simple convolutional ResNet, and a small U-Net variant. If you would like to try executing my code on GitHub, the ResNet offers a more computationally feasible alternative to the U-Net, and can train within a few minutes. If you have more time on your hands, and have a GPU available, then I suggest setting the ‘type’ parameter, shown below, to “unet” instead.

To implement the DDPM model, we do so by creating the following class.

       class DDPM(nn.Module)
           def __init__(self, M, type, hidden_dim, num_layers)
               super().__init__()
               self.M
               self.type 
               sigma 
               alpha_bar 
           def loss(x0)
               -> simplified_loss(x0, m, epsilon)
           def simplified_loss(x0, m, epsilon)
               -> loss_float
           def sample_zm_previous(x0_reconstructed, zm, m)
               -> zm_prev
           def estimate_x0(zm, m, predicted_noise)
               -> x0_reconstructed
           def sample(batch_size)
               -> x0_new

We can see that whenever this class is instantiated, it will initialize the above four parameters (along with the initialization parameters of the parent class), which are required to describe our denoising Markov chain. ‘M’ is an integer telling us how many denoising steps exist within our Markov chain. The string ‘type’ declares whether our DDPM will be assisted in its prediction capabilities via a ResNet or U-Net. And ‘sigma’ and ‘alpha_bar’ are vectors, with the first containing the noise scale for every link in the Markov chain, and the second containing the corresponding product of noise scales as previously defined. The final components missing from our denoising chain, are of course the conditional Gaussians which control each step of the process; these exist within definition 3 of our class.

Now to highlight the class definitions:

  • Def. 1: This is the loss function we use during training. Of the mentioned training steps at the end of section II, this loss function is responsible simply for choosing a link within our Markov chain by sampling a random integer between 1 and M, and then sampling the random Gaussian noise used to corrupt the clean image from our batch.

  • Def. 2: This function continues the above mentioned process by taking the random integer ‘m’ and random noise ‘ε‘ sampled by the previous function, and adding the noise onto our image, exactly as done in equation (5) so that we may reach the corrupted version of our image ‘zm’ which corresponds exactly to noisy image which should be present at the Markov link ‘m’. This function then feeds ‘zm’ into our ResNet/U-Net to predict the noise which was added on top of this noisy image, and with this predicted noise, we then take the squared difference. This is done for every image in our minibatch, and the squared differences are summed to return our loss.

  • Def. 3: This function is used post-training for image generation. It describes the conditional Gaussian which links each step in the Markov chain, where each link differs by its variance (i.e. noise scale) and mean (which ultimately is a function of the predicted noise by the neural network). Each denoising step in the Markov chain is attained by sampling from this Gaussian.

  • Def. 4: This function is used post-training for image generation. It is an auxiliary function to def. 3 which helps calculate the mean used by the conditional denoising Gaussian for each step in the Markov chain by using the predicted noise calculated by the neural network, and constructing an approximate clean image exactly as done in equation (7). This approximate clean image is then used by def. 3 to calculate the mean it needs for the corresponding denoising step, exactly as done in equation (4).

For the final Definition 5, we will review it in more explicit detail, as it brings together many mentioned functions to complete the entire, post-training denoising procedure.

        def sample(batch_size)
              zm = randn(batch_size, 1, 28, 28)
              zm_prev = zm
              for m in reversed(range(M)):
                   predicted_noise = NeuralNetwork.forward(zn, m)
                   x0_rec = estimate_x0(zm, m, predicted_noise)
                   zm_prev = sample_zm_previous(x0_rec, zm, m)
                   zm = zm_prev
              x0_rec = estimate_x0(zm, m, predicted_noise)
        return x0_rec

Given our trained model, we generate ‘batch_size’ many samples by calling this function. It begins by initializing ‘zm’ by sampling pure noise from a normal distribution with the given dimensions, since we are interested in generating MNIST digits. We then begin the first step in our denoising Markov chain by sending this ‘zm’ through the ResNet/U-Net to predict a the most-likely realization of pure gaussian noise, which when added to some highly-corrupted structure present within the MNIST digits, led to this zero-mean, unit-variance noise. The next functions prepare the variables needed for constructing the parameters of the first conditional Gaussian link in this Markov chain. Once this conditional Gaussian is constructed, the next variable in the Markov chain, ‘zm_prev’ is generated by sampling from the prepared, parameterized conditional Gaussian. This process is repeated for all steps in the Markov chain, until we finally reach an image which resembles one of the ten MNIST digits.

Section IV: Program Execution

Now that we have prepared our DDPM model along with an assisting neural network, we are ready to train the model and evaluate its performance. To do this, we initialize our DDPM class and optimizer class, while setting the following training parameters.

        max_epochs = 20

        ddpm = DDPM(N=1000, type="resnet", hidden_dim=16, n_layers=2)
        opt = torch.optim.Adam(ddpm.parameters())

        for epoch in range(max_epochs):

              opt.zero_grad()
              loss.backward()
              opt.step()

Executing yields the following output.

        >>  Epoch 0
        >>     loss = 793.75
        >>     loss = 91.97
        >>     loss = 61.39
        >>     loss = 69.72
        >>     loss = 55.77
       ....
        >>  Epoch 20
        >>     loss = 29.48
        >>     loss = 28.83
        >>     loss = 27.44
        >>     loss = 28.92
        >>     loss = 27.76

And executing the following code,

        samples = ddpm.sample(4)
        visualize_mnist_samples(samples)

yields the following digits: 6, 3, 7, 4

Section V: References

[1]: Chapter 3 in: https://www.di.ens.fr/appstat/spring-2022/lecture_notes/SGLD.pdf

[2]: https://www.jmlr.org/papers/v6/hyvarinen05a.html

[3]: Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851. https://arxiv.org/abs/2006.11239

[4]: https://yang-song.net/blog/2021/score/

[5]: S. Günnemann, Advanced Machine Learning: Deep Generative Models. Technische Universität München, 2023.

Link to GitHub

To download the Python files and test the model for yourself, click here.

Previous
Previous

A Comprehensive Overview of Bitcoin and Blockchain Technology

Next
Next

Graph Spectral Clustering for User Preference Categorization in Social Networks