This post builds on the ideas presented in the video and the accompanying paper blog.

TL;DR; A score-based model is built to estimate $p (x)$ by using the score function. For any input point in the data space, the model predicts its score, which indicates the direction to higher probability (understanding the score function’s meaning is key). Inference uses Langevin dynamics sampling, iteratively applying the model (ideally many times) to move the input towards higher probability areas, resulting in a final sample from the highest probability area.

Recall that we still have problem on $Z_{θ}$ .

Score function

Instead of modeling the density function directly, we model the score function. The score function of a distribution $p (x)$ is defined as

\nabla_{x} lo g p (x)

Our model aims to estimate this quantity by approximating

s_{θ} (x) \approx \nabla_{x} lo g p (x)

an approach known as a score-based model. After that we do some math to get

s_{θ} (x) = \nabla_{x} lo g p_{θ} (x) = - \nabla_{x} f_{θ} (x) - = 0 \nabla_{x} lo g Z_{θ} = - \nabla_{x} f_{θ} (x) .

Note that $s_{θ} (x)$ is independent of the normalizing constant $Z_{θ}$ , which is highly advantageous.

Score matching

The score function, $\nabla_{x} lo g p (x),$ , clearly indicates the direction in which the data probability increases, in other words, score function can tell you for any point in your data space in which direction you have to move to get closer to actual data points (probability increase)

center

In case of images we can start with random Gaussian image and score will tell you which direction we should move to get closer to image manifold (target image).

To do that we just train score-base model to minimize model and the data distribution:

\frac{1}{2} E_{p (x)} [∣∣ \nabla_{x} lo g p (x) - s_{θ} (x) ∣ ∣_{2}^{2}]

Note: we use Euclidean distance to find distance between two vectors but formula is the same as L2 Norm/Euclidean norm so we can write $∣∣ p - d ∣ ∣_{2}$ and we commonly square this distance for (1) Ensure positive values. (2) Penalizes larger differences more. So final formula is $∣∣ p - d ∣ ∣_{2}^{2}$ then we add expected value (average) $E_{p (x)}$ to get average across all points with respect to $p (x)$

However, we do not know $\nabla_{x} lo g p (x)$ (the data score) or $p (x)$ (the data probability density function). Fortunately, by performing some mathematical derivations, we can eliminate $p (x)$ and arrive at the following expression:

\frac{1}{2} E_{p (x)} [s_{θ} (x)^{2}] + E_{p (x)} [\nabla_{x} s_{θ} (x)]

we can train minimizing this formula, so both term should be zero:

$s_{θ} (x)^{2} = 0$ means score are 0 or we at a data point.
$\nabla_{x} s_{θ} (x) = 0$ gradient of score at the data point is 0 this means it should be a local maximum.

Note: $E_{p (x)}$ is still requiring $p (x)$ in formula, but in practice expectation is average so we can train approximate using sample from $p (x)$ (our training data).

So now our goal is to learn a model that given some point in our data space, predict the direction where we should move to get closer to data, this is much easier than learn to predict PDF function, since score function is don’t need the normalizing constant $Z_{θ}$ .

center

But we ran in to new problems:

Expensive Training ( $\nabla_{x} s_{θ} (x)$ ); When input space is large, you need to compute gradient for each input dimension to do that we need to do backpropagation for all input variable.
Low Coverage of Data Space ( $s_{θ} (x)^{2}$ ); Our model is not trained on input from the entire space (of course, we train on some data) so when we have input that come from random position outside data space that we trained on, model will be inaccurate in other word model don’t know correct direction, you can see that estimated scores are only accurate in high density regions (score point towards center).

Note: Why $\nabla_{x} s_{θ} (x)$ is computable (but expensive) while $Z_{θ} = \int e^{- f_{θ} (x)} d x$ is not computable ? Answer: Integration is total area under the curve so we must compute integral over the entire input of $e^{- f_{θ} (x)}$ which have high dimensional in other hand differential only take specific point $x$ .

center

Noise Perturbation

To overcome Low Coverage of Data Space problem; We just add some Gaussian noise to data point

\tilde{x} = x + ϵ

where

ϵ \sim N (0, σ^{2} I)

Now we denote our pdf as (new noise-perturbed pfd is depending on $σ$ which is noise level)

p (x) noise p_{σ} (\tilde{x})

and we train Noise Perturbed Objective

\frac{1}{2} E_{p_{σ} (\tilde{x})} [∥ \nabla_{\tilde{x}} lo g p_{σ} (\tilde{x}) - s_{θ} (\tilde{x}) ∥_{2}^{2}]

here is what happens when we perturb a mixture of two Gaussian perturbed by additional Gaussian noise.

This improve the accuracy of model on low density region, but we still got tradeoff here

If we use high $σ$ (more noise):

Cover more low density regions for better score estimation (resolve Low Coverage of Data Space).
But it over-corrupts the data and alters it significantly from the original distribution

If we use low $σ$ (less noise):

Facing Low Coverage of Data Space problem.
Not over-corrupts the original data.

Connection to Denoising Autoencoders

someone found the connection between them and can overcome Expensive Training problem by reformulate score-matching objective from

\frac{1}{2} E_{p_{σ} (\tilde{x})} [∥ \nabla_{\tilde{x}} lo g p_{σ} (\tilde{x}) - s_{θ} (\tilde{x}) ∥_{2}^{2}]

to this:

\frac{1}{2} E_{x \sim p (x), \tilde{x} \sim p_{σ} (\tilde{x} ∣ x)} [∣∣ s_{θ} (\tilde{x}) - \nabla_{\tilde{x}} lo g p_{σ} (\tilde{x} ∣ x) ∣ ∣_{2}^{2}]

This is huge, since normally in score-matching objective we don’t know $p_{σ} (\tilde{x})$ so we can’t compute $\nabla_{\tilde{x}} lo g p_{σ} (\tilde{x})$ , we do some math as describe above, and we got Expensive Training problem, but with this new objective we just need to know $p_{σ} (\tilde{x} ∣ x)$ and we know that in denoising autoencoders!

We just need to find $p_{σ} (\tilde{x} ∣ x)$ by using Multivariate Gaussian distribution then compute $\nabla_{\tilde{x}} lo g p_{σ} (\tilde{x} ∣ x)$ and our new objective is to minimize this:

\frac{1}{2} E_{x \sim p (x)}, \tilde{x} \sim p_{σ} (\tilde{x} ∣ x) [s_{θ} (\tilde{x}) + \frac{1}{σ ^{2}} ϵ_{2}^{2}]

This is beautiful, $\frac{1}{σ ^{2}} ϵ$ is noise that we add to input and model need to predict $s_{θ} (\tilde{x}) = - \frac{1}{σ ^{2}} ϵ$ in other word model need to predict direction(score) that back to original data point.

center

How to generate new sample ?

After training our score-based model we can do get denoise sample by iteratively predict new direction using Langevin Dynamics Sampling:

\tilde{x}_{i + 1} \leftarrow \tilde{x}_{i} + α \cdot s_{θ} (\tilde{x_{i}}) + 2 α \cdot ϵ i = 0, 1, \cdot \cdot \cdot, K

At each step, we update the current sample $\tilde{x}_{i}$ by adding the product of a small step size $α$ and the estimated score $s_{θ} (\tilde{x}_{i})$ . We also add the term $2 α ϵ$ to introduce noise, which prevents all samples from collapsing to a single point. So our sample will become closer to data manifold.

center

Multiple Noise Perturbation

Recall that in Low Coverage of Data Space problem we have tradeoff when select $σ$ , to achieve the best of both worlds, we use multiple scales of noise perturbations simultaneously.

We start from pre-specify $σ$ between $[0.01, 25]$ which have total length $L$ with increasing standard deviations $σ_{1} < σ_{2} < \dots < σ_{L}$ now got:

\tilde{x} = x + ϵ

where

ϵ \sim N (0, σ_{i}^{2} I) i = 1, 2, \dots, L

now $σ_{i}$ is difference noise, and we add those noise as model input to give more information to model

s_{θ} (\tilde{x}, i)

and jointly train model for all noises

i = 1 \sum L λ (i) E_{p_{σ_{i}} (x)} [∥ \nabla_{x} lo g p_{σ_{i}} (x) - s_{θ} (x, i) ∥_{2}^{2}]

After model is trained, we can use same sample technique except we start with the largest noise $i = L, L - 1, \dots, 1$ since we know that large noise help model in low density region, This method is called annealed Langevin dynamics, also model that trained on difference noised scales is called Noise Conditional Score Networks (NCSNs).

center

In first paper they use 1000 noise scales.

Link to Stochastic Process

What is SDEs

How score-matching link to SDEs

In Multiple Noise Perturbation we use 1000 scales but If we want to cover as much noise scale as possible like 0.001 very small scale to very large scale like $\infty$ we no longer can write as discrete scale anymore (i.e., 1, 2, …, 1000 scales) it will be much better If we turn into function of time $X (t)$ where $t$ is time that control noise added to image (like control scale), so $X (t)$ is also known as Stochastic Process, how?

We start at our added noise data:

\tilde{x} = x + ϵ ϵ \sim N (0, σ_{i}^{2} I)

Compare to SDEs equation

d X_{t} = d e t er mini s t i c (Dr i f t) f (t, X_{t}) d t + s t oc ha s t i c (D i ff u s i o n) σ (t, X_{t}) d W_{t}

You can see that change in $x$ is just by adding noise $(ϵ)$ in other word change in $x$ only influence by stochastic term. We can write SDEs as:

d x = σ (t, X_{t}) d W_{t}

We know that stochastic term (noise that we add) is only change with time $t$ so we can remove input $X_{t}$ . Now we can write SDEs as:

d x = σ (t) d W_{t}

If we got forward SDE, we can get reverse SDE:

d x = [f (x, t) - g^{2} (t) score f u n c t i o n \nabla_{x} lo g p_{σ} (x)] d t + g (t) d W_{t}

you can see that it use score function to reverse

Note that in many diffusion model will have technique to add noise differently, so detail in above step is not exactly the same, but idea and step is the same. center

This method offers several advantages over using multiple discrete noise levels:

Continuous noise handling
Improved generalization
More accurate score estimation
Better sampling through continuous annealing
Flexibility in designing the noise schedule
Closer connection to the underlying data distribution

Summary

Score-function

The score function, by definition, gives you the gradient of the log probability density function at a given point. In other words, it tells you the direction in which the data density increases most steeply. Using just the score function, you can:

Guide Sampling:
By following the gradient directions (e.g., via Langevin dynamics), you can iteratively refine a random noise input into a sample that resembles data drawn from the target distribution.
Denoising:
In tasks like denoising, the score function indicates how to adjust a noisy input to move it toward higher probability regions, effectively “cleaning” the data.
Solve Inverse Problems:
The gradient information can be used to guide optimization processes to recover or reconstruct signals from corrupted or incomplete data.

They can be applied to both audio and images

Example Image Generation

Code for this example.

If we want to generate image (new sample), we all know that score-function alone only give direction at given point, so we need to do sampling to make new sample before we link to SDE we use Langevin Dynamics Sampling but after we link to SDE we can use reverse SDE instead.

center

Steps:

Define how we perturb our data (add noise to sample), in this example we use

\tilde{x} = x + ϵ ϵ \sim N (0, σ^{2} I)

Define forward SDE from step (1) using forward SDE formula. In this case we got

d x (t) = σ^{t} d w (t), t \in [0, 1]

Solve forward SDE to get conditional probability, which will be plug into loss function later.
Plug conditional probability from step (2) into loss function.
Train score-based model using loss from (4).
After trained model, we create new sample (image) using these methods.

Other diffusion paper (e.g. DDPM, DDIM) design step (1) differently.

New loss function

Recall this loss function.

L (θ; σ) = \frac{1}{2} E_{x \sim p (x), \tilde{x} \sim p_{σ} (\tilde{x} ∣ x)} [∣∣ s_{θ} (\tilde{x}) - \nabla_{\tilde{x}} lo g p_{σ} (\tilde{x} ∣ x) ∣ ∣_{2}^{2}]

this loss only use one noise level $(σ)$ , from this section we learned that we need to train with multiple noise and this link to SDEs, so loss become

L (θ) = θ min E_{t \sim U (0, T)} [λ (t) E_{x (0) \sim p_{0} (x)} E_{x (t) \sim p_{0 t} (x (t) ∣ x (0))} [∥ s_{θ} (x (t), t) - \nabla_{x (t)} lo g p_{0 t} (x (t) ∣ x (0)) ∥_{2}^{2}]]

These are changes:

Replace $x$ with stochastic process $x (t)$ : $x (t)$ means noisy image at time $t$ while $x (0)$ is original image.
Replacing fixed $σ$ with time $t$ : $t$ becomes the parameter controlling the noise level range from 0 (original image) to $T$ (noisy image). (recall this image)
Making noise distribution time-dependent: $p_{σ} (\tilde{x} ∣ x)$ becomes $p_{0 t} (x (t) ∣ x (0))$ .
Making the score model time-dependent: $s_{θ} (\tilde{x})$ becomes $s_{θ} (x (t), t)$ .
Averaging the loss over different times (noise levels): Introducing $E_{t \sim U (0, T)}$
Weighting function $λ (t)$ : Added to solve this problem

Build Time-Dependent Score-Based Model

Can be any architecture as long as input and output have same dimension, but effective one is U-net. Our model ideally produces $s_{θ} (x (t), t)$ so, input is $x$ (image) and $t$ (time or noise level) but how can we make model understand $t$ ?

Recall from link to SDEs section we know that $t$ is time in stochastic process which as $t$ increase noise added to image, to make model understand time $t$ the easiest way is just adding $t$ to every intermediate layer in U-net, but

we can’t code time as 0,1,2,..T, since $t$ is continuous in $[0, T]$
we can’t add single integer like 0,1,...,T we must represent time $t$ as vector instead, since neural network won’t learn much from single integer.

To solve these problems we must create projection function that can map any time $t$ (continuous) to higher dimension vector so we can add this vector to U-net, and it can learn time $t$ information, but what projection function we should use ?

To create such a projection function we must do like this:

In order to convert 1D time $t$ to high-dimension
1. Create fixed random N-dimension vector.
2. We can multiply any time $t$ with that vector.
3. We will get vector that represent time $t$ , difference $t$ will always have difference vector.
From (1) is just linear function this is not enough, since time can be complex and nonlinear. We further use Fourier features which is better to capture complex function, it is crucial that they are designed to be sufficiently expressive to encode the temporal information effectively.
what I mean is our projection function return $[sin (2 π x), cos (2 π x)]$ where $x$ is vector from (1).

Note: Projection means mapping data into a new space.

Lastly we normalize output of network by $\frac{1}{variance}$ reason here.

Create new sample

Recall that for any SDE of the form

d x = f (x, t) d t + g (t) d w

the reverse-time SDE is given by

d x = [f (x, t) - g (t)^{2} \nabla_{x} lo g p_{t} (x)] d t + g (t) d \overset{ˉ}{w}

Since we have chosen the forward SDE to be

d x = σ^{t} d w, t \in [0, 1]

The reverse-time SDE is given by

d x = - σ^{2 t} \nabla_{x} lo g p_{t} (x) d t + σ^{t} d \overset{ˉ}{w}

To sample from our time-dependent score-based model $s_{θ} (x, t)$ , we first draw a sample from the prior distribution $p_{1} \approx N (x; 0, \frac{1}{2} (σ^{2} - 1) I)$ , and then solve the reverse-time SDE with numerical methods.

1. Sampling with Numerical SDE Solvers (Euler–Maruyama)

For an SDE of the form

d x = f (x, t) d t + g (t) d w

the Euler–Maruyama update rule is

x_{t + Δ t} = x_{t} + f (x_{t}, t) Δ t + g (t) Δ t z

where $z$ is a sample from a standard normal distribution, $z \sim N (0, I)$ In the context of the reverse-time SDE, the method approximates

d x = - σ^{2 t} s_{θ} (x, t) d t + σ^{t} d \overset{ˉ}{w}

with the discretized iteration:

x_{t - Δ t} = Dr i f t x_{t} + σ^{2 t} s_{θ} (x_{t}, t) Δ t + D i ff u s i o n σ^{t} Δ t z_{t}

$s_{θ} (x, t) \approx \nabla_{x} lo g p_{t} (x)$ is the estimated score function.
$g (t)$ is the diffusion coefficient.
$Δ t$ is the time step.
$z_{t} \sim N (0, I)$ is Gaussian noise.

We can loop this to generate new sample

2. Sampling with Predictor-Corrector Methods

We combine Predictor (Euler–Maruyama) with Corrector (Langevin MCMC). Recall the classical Langevin MCMC update is given by

x_{i + 1} = x_{i} + ϵ \nabla_{x} lo g p (x_{i}) + 2 ϵ z_{i}

but instead of try to get next time step from this, we refine current time step and use Euler–Maruyama to get next time step instead, so Corrector update is

x \leftarrow x + ϵ s_{θ} (x, t) + 2 ϵ z

this is detail of how to select step size $ϵ$ , then after we refine using Corrector, we use Predictor to get next time step. In summary:

Refines the sample using Langevin MCMC (corrector step) to reduce discretization error.
Propagates the refined sample using an Euler–Maruyama update (predictor step).

3. Sampling with Numerical ODE Solvers

In this paper they found that, For any Forward SDE of the form

d x = f (x, t) d t + g (t) d w

and have Reverse SDE:

d x = [f (x, t) - g (t)^{2} \nabla_{x} lo g p_{t} (x)] d t + g (t) d \overset{ˉ}{w}

It turns out that Reverse SDE have an associated ODE

d x = [f (x, t) - \frac{1}{2} g (t)^{2} \nabla_{x} lo g p_{t} (x)] d t

which is known as the probability flow ODE. Despite being deterministic (i.e., without the random noise term), this ODE has trajectories whose marginal distributions $p_{t} (x)$ match those of the original SDE. This means that if you solve this ODE from time $T$ to $0$ , starting with a sample from $p_{T}$ , you’ll obtain a sample from $p_{0}$ (typically the data distribution).

Below is a schematic figure showing how trajectories from this probability flow ODE differ from SDE trajectories, while still sampling from the same distribution. center

Therefore, we can start from a sample from $p_{T}$ , integrate the ODE in the reverse time direction, and then get a sample from $p_{0}$ . In particular, for the SDE in our running example, we can integrate the following ODE from $t = T$ to $0$ for sample generation

d x = - \frac{1}{2} σ^{2 t} s_{θ} (x, t) d t

Achira

Explorer

Score-based models

Score function

Score matching

Noise Perturbation

Connection to Denoising Autoencoders

How to generate new sample ?

Multiple Noise Perturbation

Link to Stochastic Process

How score-matching link to SDEs

Summary

Score-function

Example Image Generation

New loss function

Build Time-Dependent Score-Based Model

Create new sample

1. Sampling with Numerical SDE Solvers (Euler–Maruyama)

2. Sampling with Predictor-Corrector Methods

3. Sampling with Numerical ODE Solvers

Graph View

Table of Contents

Backlinks