Week 3: Tricks of The Trade

Reading

The mandatory reading material for the week is

Chapters 8 (except 8.4), 9, and 11 (until 11.5) in the Prince book.

Exercises

During the exercise session, we will work on

Notebooks: 3.1-3.4 in Week3 Feedforward PyTorch
Problems (Prince): 9.4, 11.4

Slides

02456 Week 3.pdf

Notes

Week 3 – Tricks of the Trade

🔄 Recap • Neural networks consist of layers: h^{(\ell)} = \sigma(W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}) • Training involves: • Loss function (e.g. negative log-likelihood / cross-entropy). • Gradient descent via backpropagation. • Careful initialization (to avoid vanishing/exploding gradients). • Model complexity is determined by: • Number of hidden units. • Number of hidden layers. • More complexity → better fit but risk of overfitting.

⸻

📏 Measuring Performance • Generalisation error: E_{\text{gen}} = \mathbb{E}{(x,y)\sim p} \, L(f\theta(x), y) • True measure of performance = error on unseen data. • Estimated via Monte Carlo using held-out data. • Bias–variance tradeoff: • Low complexity → high bias, low variance (underfitting). • High complexity → low bias, high variance (overfitting). • Sweet spot: balance between bias and variance.

⸻

⚖️ Model Selection • Split dataset into train / validation / test. • Train on training set. • Tune hyperparameters on validation set. • Evaluate generalisation on test set. • Avoid using training error (optimistic) or validation error (biased by hyperparameter tuning).

⸻

🛡️ Regularisation • Explicit regularisation: • Weight decay (L2): penalizes large weights. • L1 regularisation: induces sparsity. • Implicit regularisation: • Gradient descent naturally biases toward simpler solutions. • Stochastic Gradient Descent (SGD) adds noise, discouraging overfitting.

⸻