These are the overall frameworks or recipes for generating new data. They define the strategy for learning the data distribution and generating samples

TL;DR

The normalizing constant is crucial for accurate probability calculations in generative models. For discrete data (e.g. classification, token prediction), is easily computed, the space of possible outcomes is finite and often manageable. This makes calculating  feasible, leading to accurate probability density functions (PDFs) and strong model performance. However, with continuous data (e.g. image generation, audio waveforms, time-series), calculating is intractable, making it hard to obtain true PDFs and hindering performance. Score-based models solve this by learning the gradient of the log density (score function) instead of the density itself, bypassing and enabling impressive results in continuous data generation.

How to create a generative model

What is PDF?

Suppose we are given a dataset , where each point is drawn independently from an underlying data distribution  Given this dataset, the goal of generative modeling is to fit a model to the data distribution such that we can synthesize new data points at will by sampling from the distribution.

In order to build such a generative model, we first need a way to represent a probability distribution. One such way, as in likelihood-based models, is to directly model the probability density function (p.d.f.) or probability mass function (p.m.f.). Let  be a real-valued function parameterized by a learnable parameter . We can define a p.d.f. via

where

  • a normalizing constant to make area = 1
  • for non-negative value, also called an unnormalized probabilistic model, or energy-based model
  • is a learnable parameter

We can train  by maximizing the log-likelihood of the data

but to compute we need to find which is hard for several key reasons:

  1. High-Dimensional Integration: requires computing the integral over the entire input space. For high-dimensional data (like images), this becomes computationally intractable due to the curse of dimensionality.
  2. Parameter Dependence: The normalizing constant depends on the model parameters . This means it needs to be recomputed every time the parameters are updated during training, making the optimization process extremely expensive.
  3. Non-Analytical Form: For complex (like neural networks), the integral typically doesn’t have a closed-form solution. This means numerical methods would be needed, which are often impractical for high-dimensional spaces.
  4. Global Computation: Computing requires integrating over the entire input space, not just the observed data points. This makes it particularly challenging for high-dimensional data where most of the space contains negligible probability mass.

Generative models can be split into two main types

Explicit Density: Tries to explicitly model (e.g., VAEs, Flow models).

Implicit Density: Learns to generate samples without explicitly modeling (e.g., GANs, Diffusion Models).

CategorySubcategoryExample Models
Explicit Density (Likelihood)Tractable DensityComputableAutoregressive (GPT, PixelCNN), Normalizing Flows (RealNVP)
Approximate DensityApproximatedVariational Autoencoders (VAEs)
Implicit Density (Sampling)Generative Adversarial NetworksAvoidedGANs (StyleGAN, BigGAN)
Diffusion ModelsAvoidedDDPMs, Score-based (Stable Diffusion, DALL-E 2)

Note: Transformers are not generative model, but it’s a neural network architecture like a tool to integrate within these generative models.

1. likelihood-based models (Explicit Density)

Definition: These models directly define a probability distribution  over the data space, parametrized by . The goal is to find parameters  that maximize the likelihood of the observed data.

Normalizing Constant (): The key challenge here is often the normalizing constant , which ensures that the distribution integrates to 1.

Subcategories

  1. Tractable Density Models: These models are designed such that is either analytically computable or can be efficiently estimated. Examples include:
    • Autoregressive Models: They decompose the joint probability of a data point into a product of conditional probabilities which each conditional probability are normalized via softmax(discrete data), making the likelihood tractable.

    • Normalizing Flows: These models learn a series of invertible transformations to map a simple base distribution (like a Gaussian) to the complex data distribution. The change of variables formula makes it possible to compute the likelihood exactly. Examples include: RealNVP, Glow

  2. Approximate Density Models: is intractable to compute directly. These models rely on approximations:
    • Variational Autoencoders (VAEs): They use variational inference to approximate the posterior distribution and the marginal likelihood (evidence) by maximizing a lower bound called the Evidence Lower Bound (ELBO). The ELBO can be seen as a way to deal with the intractable .

2. Implicit Density Models

Definition: These models don’t explicitly define a probability density function. Instead, they learn a process to generate samples from the target distribution, often without having access to or .

Normalizing Constant (): Implicit models generally avoid dealing with altogether.

Subcategories

  1. Generative Adversarial Networks (GANs): GANs train two networks – a generator and a discriminator – in an adversarial game. The generator learns to produce realistic samples to fool the discriminator, while the discriminator learns to distinguish between real and generated samples. The generator implicitly learns to sample from the data distribution. Examples include: DCGAN, StyleGAN, BigGAN
  2. Diffusion Models: Inspired by non-equilibrium thermodynamics, diffusion models learn to reverse a gradual noising process. They start with the data distribution and progressively add noise until the data is completely destroyed. The model then learns to reverse this process, generating samples by gradually removing noise from a pure noise input.
    • Score-based Generative Models: Learn the score function (gradient of the log-density) of the data distribution at different noise levels. Sampling is done via methods like Langevin dynamics. Examples include: Stable Diffusion, Imagen, DALL-E 2, DDPMs