Translate to conditional probability

The Multivariate normal distribution is given by the formula

N (y; μ, Σ) = \frac{1}{( 2 π ) ^{d} ∣ Σ ∣} exp (- \frac{1}{2} (y - μ)^{⊤} Σ^{- 1} (y - μ))

where:

$d$ is the dimension of the vector,
$μ$ is the mean vector,
$Σ$ is the covariance matrix, and
$∣ Σ ∣$ denotes the determinant of $Σ$ .

In your case, the model is

\tilde{x} = x + ϵ, ϵ \sim N (0, σ^{2} I)

This means that conditioned on $x$ , the random variable $\tilde{x}$ is distributed as a multivariate normal with:

Mean: $μ = x$
Covariance: $Σ = σ^{2} I .$

Substitute these into the multivariate normal formula:

Determinant and Inverse of the Covariance:
Since $Σ = σ^{2} I$ (a diagonal matrix with each diagonal entry equal to $σ^{2}$ ):
- The determinant is $∣ σ^{2} I ∣ = (σ^{2})^{d} = σ^{2 d} .$
- The inverse is $(σ^{2} I)^{- 1} = \frac{1}{σ ^{2}} I .$
Plug into the Formula:

p (\tilde{x} ∣ x) = \frac{1}{( 2 π ) ^{d} σ ^{2 d}} exp (- \frac{1}{2} (\tilde{x} - x)^{⊤} (\frac{1}{σ ^{2}} I) (\tilde{x} - x))

Simplify:
1. The identity matrix $I$ doesn’t change the vector, so
$(\tilde{x} - x)^{⊤} I (\tilde{x} - x) = (\tilde{x} - x)^{⊤} (\tilde{x} - x) = ∥ \tilde{x} - x ∥^{2}$
1. Thus, the exponent simplifies to
$- \frac{1}{2 σ ^{2}} ∥ \tilde{x} - x ∥^{2}$
1. Final expression
$p (\tilde{x} ∣ x) = \frac{1}{( 2 π σ ^{2} ) ^{d /2}} exp (- \frac{1}{2 σ ^{2}} ∥ \tilde{x} - x ∥^{2})$

In summary, starting from $\tilde{x} = x + ϵ$ with $ϵ \sim N (0, σ^{2} I)$ , the conditional probability $p (\tilde{x} ∣ x)$ is

p (\tilde{x} ∣ x) = N (\tilde{x}; x, σ^{2} I) = \frac{1}{( 2 π σ ^{2} ) ^{d /2}} exp (- \frac{1}{2 σ ^{2}} ∥ \tilde{x} - x ∥^{2})

Why
$(\tilde{x} - x)^{T} (\tilde{x} - x) = ∥ \tilde{x} - x ∥^{2}$
Answer: The Euclidean distance between two vectors $x$ and $\tilde{x}$ is defined as the length of the vector difference $\tilde{x} - x .$ Concretely, if
$\tilde{x} - x = \tilde{x}_{1} - x_{1} \tilde{x}_{2} - x_{2} ⋮ \tilde{x}_{n} - x_{n}$
then the Euclidean distance is given by
$∥ \tilde{x} - x ∥ = (\tilde{x}_{1} - x_{1})^{2} + (\tilde{x}_{2} - x_{2})^{2} + \dots + (\tilde{x}_{n} - x_{n})^{2}$
When you square this distance, you get
$∥ \tilde{x} - x ∥^{2} = (\tilde{x}_{1} - x_{1})^{2} + (\tilde{x}_{2} - x_{2})^{2} + \dots + (\tilde{x}_{n} - x_{n})^{2}$
This is exactly the same as computing the dot product of the vector $\tilde{x} - x$ with itself:
$(\tilde{x} - x)^{⊤} (\tilde{x} - x) = (\tilde{x}_{1} - x_{1})^{2} + (\tilde{x}_{2} - x_{2})^{2} + \dots + (\tilde{x}_{n} - x_{n})^{2}$
Thus, we have
$(\tilde{x} - x)^{⊤} (\tilde{x} - x) = ∥ \tilde{x} - x ∥^{2}$

From SDE to conditional probability

The key idea is to solve the stochastic differential equation (SDE) by integrating its diffusion term and then computing the variance of the resulting stochastic integral. Here’s a step-by-step explanation:

1. Writing the SDE and Its Integral Form

You start with the SDE:

d x (t) = σ^{t} d w (t), t \in [0, 1]

where $w (t)$ is a standard Wiener process (Brownian motion). Integrating both sides from 0 to $t$ gives:

x (t) = x (0) + \int_{0}^{t} σ^{s} d w (s)

2. Distribution of the Stochastic Integral

The integral

\int_{0}^{t} σ^{s} d w (s)

is a stochastic integral with a deterministic integrand. By properties of such integrals, it is normally distributed with mean zero and covariance given by the Itô isometry:

Cov [\int_{0}^{t} σ^{s} d w (s)] = \int_{0}^{t} (σ^{s})^{2} d s I

where $I$ is the identity matrix.

3. Evaluating the Variance Integral

We need to compute the integral

\int_{0}^{t} σ^{2 s} d s

This can be computed by rewriting the integrand in exponential form:

σ^{2 s} = e^{2 s l o g σ}

Thus,

\int_{0}^{t} σ^{2 s} d s = \int_{0}^{t} e^{2 s l o g σ} d s

This is an elementary integral:

\int_{0}^{t} e^{2 s l o g σ} d s = [\frac{e ^{2 s l o g σ}}{2 lo g σ}]_{0}^{t} = \frac{e ^{2 t l o g σ} - 1}{2 lo g σ} = \frac{σ ^{2 t} - 1}{2 lo g σ}

4. Putting It All Together

Since

x (t) = x (0) + \int_{0}^{t} σ^{s} d w (s)

and the stochastic integral is Gaussian with mean 0 and covariance $\frac{σ ^{2 t} - 1}{2 l o g σ} I,$ it follows that:

x (t) ∣ x (0) \sim N (x (0), \frac{σ ^{2 t} - 1}{2 lo g σ} I)

Thus, the transition probability density is:

p_{0 t} (x (t) ∣ x (0)) = N (x (t); x (0), \frac{1}{2 lo g σ} (σ^{2 t} - 1) I)

Plug conditional probability into loss function

1. Gaussian Conditional Distribution and Its Score

The conditional distribution of $x (t)$ given $x (0)$ is a Gaussian:

p (x (t) ∣ x (0)) = N (x (t); x (0), variance \cdot I)

where the variance is given by

variance = \frac{1}{2 lo g σ} (σ^{2 t} - 1)

For a Gaussian distribution with mean $μ$ and covariance $σ^{2} I,$ its probability density function is:

p (x) = \frac{1}{( 2 π σ ^{2} ) ^{d /2}} exp (- \frac{∥ x - μ ∥ ^{2}}{2 σ ^{2}})

Taking the logarithm, we get:

lo g p (x) = - \frac{d}{2} lo g (2 π σ^{2}) - \frac{∥ x - μ ∥ ^{2}}{2 σ ^{2}}

The score function is defined as the gradient of the log-density with respect to $x$ :

\nabla_{x} lo g p (x) = \nabla_{x} (- \frac{∥ x - μ ∥ ^{2}}{2 σ ^{2}}) = - \frac{1}{σ ^{2}} (x - μ)

In our setting:

The “mean” $μ$ is $x (0) .$
The variance is $\frac{1}{2 l o g σ} (σ^{2 t} - 1) .$

Thus, the score (the gradient of the log-density) becomes:

\nabla_{x (t)} lo g p (x (t) ∣ x (0)) = - \frac{x ( t ) - x ( 0 )}{\frac{1}{2 l o g σ} ( σ ^{2 t} - 1 )}

in other words:

\nabla_{x (t)} lo g p (x (t) ∣ x (0)) = - \frac{x ( t ) - x ( 0 )}{variance}

2. Plugging into the Loss Function

The loss function is defined as:

L (θ) = θ min E_{t \sim U (0, T)} [λ (t) E_{x (0) \sim p_{0} (x)} E_{x (t) \sim p_{0 t} (x (t) ∣ x (0))} [∥ s_{θ} (x (t), t) - \nabla_{x (t)} lo g p_{0 t} (x (t) ∣ x (0)) ∥_{2}^{2}]]

Substituting the score:

∥ s_{θ} (x (t), t) - (- \frac{x ( t ) - x ( 0 )}{variance}) ∥_{2}^{2}

3. Complete loss function

Consider the loss

∥ s_{θ} (x (t), t) - (- \frac{x ( t ) - x ( 0 )}{variance}) ∥_{2}^{2}

we know that magnitude (scale) of true score varies according to time $(t)$ since variance increases with time $(t)$ .

variance = \frac{1}{2 lo g σ} (σ^{2 t} - 1)

This means the model must predict the correct scale for each time $t$ to balance the loss across all time steps. However, the model will likely struggle with this, so we need to assist it in finding the correct scale for each $t$ . To do that:

The “typical” difference $x (t) - x (0)$ is roughly $variance$ since
$x (t) - x (0) \sim N (0, variance \cdot I)$
that means typical size (or scale) is $variance$ . In other words, although individual samples vary, most values of $x (t) - x (0)$ are on the order of $variance$ . “On the order of” is a shorthand for saying “approximately proportional to” or “roughly of the same scale as.”
The “typical” magnitude of true score: Since the score is
$- \frac{x ( t ) - x ( 0 )}{variance}$
its magnitude is roughly
$\frac{∥ x ( t ) - x ( 0 ) ∥}{variance}$
Given that $∥ x (t) - x (0) ∥$ is typically on the order of $variance$ , the typical magnitude of the true score becomes approximately
$\frac{variance}{variance} = \frac{1}{variance} .$
that means on average at each time $t$ , it will have magnitude (scale) equal to $\frac{1}{variance}$
We simply divide the output of the model by $variance$ then we will have match score.

Lastly, we choose the weighting function
$λ (t) = variance = σ^{2}$
to avoid division (reason here), so our loss now:
$σ (t)^{2} s_{θ} (x (t), t) + \frac{x ( t ) - x ( 0 )}{σ ( t ) ^{2}}_{2}^{2}$
therefore
$σ (t)^{2} s_{θ} (x (t), t) + noise_{2}^{2} noise = x (t) - x (0)$

How to compute step size for PC

The step size is chosen adaptively to balance the contribution of the score (signal) and the injected noise in the Langevin MCMC update. Recall that the Langevin update is:

x \leftarrow x + ϵ s_{θ} (x, t) + 2 ϵ z

where:

$s_{θ} (x, t)$ is the score (i.e., an estimate of $\nabla_{x} lo g p_{t} (x))$
$ϵ$ is the step size, and
$z \sim N (0, I)$ is standard Gaussian noise.

The goal is to set $ϵ$ such that the update from the score is proportional to the noise level, scaled by a desired signal-to-noise ratio (SNR). That is, we want the magnitude of the score update to be $snr$ times the magnitude of the noise update.

Magnitude of the Score Update:
The change due to the score is approximately
$Δ x_{score} = ϵ ∥ s_{θ} (x, t) ∥$
where $∥ s_{θ} (x, t) ∥$ is the norm of the score.
Magnitude of the Noise Update:
The noise term has a typical magnitude of
$Δ x_{noise} = 2 ϵ noise_norm$
where $noise_norm$ is an estimate of the norm of a standard Gaussian noise vector (compute from $\prod dimensions$ of $x$ this is the typical norm of a standard Gaussian vector in that space)
Balancing the Two Terms:
To enforce a desired signal-to-noise ratio ( $snr$ ), we set:
$ϵ ∥ s_{θ} (x, t) ∥ = snr \times 2 ϵ noise_norm$
Solving for $ϵ$ :
Rearranging, we have:
$∥ s_{θ} (x, t) ∥ = snr 2 ϵ noise_norm ⟹ ϵ = \frac{snr 2 noise_norm}{∥ s _{θ} ( x , t ) ∥}$
Squaring both sides gives:
$ϵ = 2 (\frac{snr noise_norm}{∥ s _{θ} ( x , t ) ∥})^{2}$

So the step size is adaptively set to ensure that the update from the score is $snr$ times the size of the noise update, balancing the two contributions in the Langevin MCMC step. This adaptive choice helps maintain stability and improves the quality of the refined sample during the corrector step.

Achira

Explorer

Deriving the Formula