Normalization

The Problem Normalization Addresses: Internal Covariate Shift

Before diving into the specifics, it’s crucial to understand the problem these techniques aim to solve: Internal Covariate Shift.

Covariate Shift: In machine learning, covariate shift refers to the change in the distribution of the input to a model between training and testing.
Internal Covariate Shift (ICS): In deep neural networks, ICS refers to the change in the distribution of network activations within the network during training. As the parameters of earlier layers change, the input distribution to later layers also changes.

Why is ICS a problem?

Training Instability: ICS can make training slower and more difficult. Each layer needs to constantly adapt to the changing input distribution, making it harder for the network to learn stable representations.
Vanishing/Exploding Gradients: ICS can contribute to the vanishing and exploding gradient problems, especially in deep networks.
Slower Convergence: The constant adaptation to shifting distributions slows down the convergence of the training process.

Normalization Techniques: The Solutions

Feature	Batch Normalization (BatchNorm)	Layer Normalization (LayerNorm)	Group Normalization (GroupNorm)
Normalization Dimension	Batch (N) per channel (C)	Feature (C, H, W) per sample (N)	Batch (N) within channel groups
Batch Size Dependence	Highly dependent, degrades with small batches	Independent, works well with small batches	Less dependent than BatchNorm, works well with small batches
Suitable Architectures	CNNs (with large batches)	RNNs, Transformers, CNNs	CNNs (especially with small batches), Object Detection, Segmentation
Synchronization across GPUs	Required for batch statistics	Not required	Not required
Inference Behavior	Uses moving averages of batch statistics	Consistent with training	Consistent with training
Hyperparameters	None (besides learnable scale/shift)	None (besides learnable scale/shift)	Number of groups (G), learnable scale/shift
Strengths	Effective with large batches, speeds up training, regularization effect	Batch size independent, works well with sequences, consistent inference	Good for small batches, balances batch and feature normalization, effective in object detection
Weaknesses	Batch size sensitive, not ideal for RNNs, synchronization needed, inference discrepancies	May be less effective than BatchNorm for CNNs in some cases, might lose feature relationships	Hyperparameter tuning for number of groups, may not outperform BatchNorm with large batches

Achira