The Problem Normalization Addresses: Internal Covariate Shift

Before diving into the specifics, it’s crucial to understand the problem these techniques aim to solve: Internal Covariate Shift.

  • Covariate Shift: In machine learning, covariate shift refers to the change in the distribution of the input to a model between training and testing.
  • Internal Covariate Shift (ICS): In deep neural networks, ICS refers to the change in the distribution of network activations within the network during training. As the parameters of earlier layers change, the input distribution to later layers also changes.

Why is ICS a problem?

  • Training Instability: ICS can make training slower and more difficult. Each layer needs to constantly adapt to the changing input distribution, making it harder for the network to learn stable representations.
  • Vanishing/Exploding Gradients: ICS can contribute to the vanishing and exploding gradient problems, especially in deep networks.
  • Slower Convergence: The constant adaptation to shifting distributions slows down the convergence of the training process.

Normalization Techniques: The Solutions

FeatureBatch Normalization (BatchNorm)Layer Normalization (LayerNorm)Group Normalization (GroupNorm)
Normalization DimensionBatch (N) per channel (C)Feature (C, H, W) per sample (N)Batch (N) within channel groups
Batch Size DependenceHighly dependent, degrades with small batchesIndependent, works well with small batchesLess dependent than BatchNorm, works well with small batches
Suitable ArchitecturesCNNs (with large batches)RNNs, Transformers, CNNsCNNs (especially with small batches), Object Detection, Segmentation
Synchronization across GPUsRequired for batch statisticsNot requiredNot required
Inference BehaviorUses moving averages of batch statisticsConsistent with trainingConsistent with training
HyperparametersNone (besides learnable scale/shift)None (besides learnable scale/shift)Number of groups (G), learnable scale/shift
StrengthsEffective with large batches, speeds up training, regularization effectBatch size independent, works well with sequences, consistent inferenceGood for small batches, balances batch and feature normalization, effective in object detection
WeaknessesBatch size sensitive, not ideal for RNNs, synchronization needed, inference discrepanciesMay be less effective than BatchNorm for CNNs in some cases, might lose feature relationshipsHyperparameter tuning for number of groups, may not outperform BatchNorm with large batches