K-nearest neighbors (KNN)

Classify observations according to the majority class of the K nearest neighbors.

· Define a distance measure of proximity between observations eg. euclidian distance

· It is general practice to standardize each variable to mean zero and variance 1.

· K is a positive integer of your choice. Small values give low bias, large values will give low variance.

image.png

Overfitting and underfitting

We prefer a simple model

We prefer models that work (low EPE)

These properties contradict.

Too simple: Our data set will not be accuratly described. Model assumptions are wrong.

Too complex: The model become too flexible. We fit noise in data. We need lots of data to support a model.

Crossvalidation