Machine Learning has become one of those buzz words that many people say without really knowing what it means, let alone understand what it requires to undertake. The main goal is to, as it’s name suggests, to teach computers how to make decisions. In general, we want computers to learn how to make this decisions so that we can task them with boring or repetitive tasks. But other time we use it to create autonomous tasks or to assists humans in improving productivity or making smarter business decisions.
Even thou becoming proficient in modern Machine Learning requires a working understanding of many concepts of and techniques on computer science, in the end it’s nothing more than good old statistical modeling. From linear regressions (generally attributed to Galton 1877) to neural networks (introduced by McCulloch in 1943) the theoretical underpinnings of Machine Learning have been around for decades, if not centuries. What has definitely changed in the past 10 years is our capacity to train them and embed them in other systems.
This past semester I taught an undergrad course on Machine Learning covering a wide range of methods, including LM, GLM, GAM, Decision Trees and Random Forest, Bagging, Boosting and Neural Networks. Below I share some notes form the introductory lecture (“What is statistical learning?”) and links to the practice exercise we covered during the course.
The introduction is based on James, Witten, Hastie, & Tibshirani (2013) An introduction to statistical learning. Course materials also included Hastie, Tibshirani & Friedman (2009) The elements of statistical learning and some of the original papers for the different methods. Practice exercises are based on James, Witten, Hastie, & Tibshirani (2013) An introduction to statistical learning, Boehmke & Greenwell (2020) Hands-on Machine Learning with R, Hothorn & Everitt (2014) A handbook of statistical analysis using R, Chollet & Allaire (2017) Deep Learning with R.
\[\begin{array}{rcl} \mathbf{E} \left[ \big( Y - \hat{Y} \big)^2 \right] & = & \mathbf{E} \left[ \big( f(X) + \varepsilon - \hat{f}(X) \big)^2 \right] \\ & = & \mathbf{E} \left[ \Big( \big( f(X) - \hat{f}(X) \big) + \varepsilon \Big)^2 \right] \\ & = & \mathbf{E} \left[ \big( f(X) - \hat{f}(X) \big)^2 + 2 \varepsilon \big( f(X) - \hat{f}(X) \big) + \varepsilon^2 \right] \\ & = & \mathbf{E} \left[ \big( f(X) - \hat{f}(X) \big)^2 \right] + 2 \mathbf{E} \left[ \varepsilon \big( f(X) - \hat{f}(X) \big) \right] + \mathbf{E} \left( \varepsilon^2 \right) \\ & = & \left[ f(X) - \hat{f}(X) \right]^2 + 2 \big( f(X) - \hat{f}(X) \big) \underbrace{ \mathbf{E} \left( \varepsilon \right) }_{= 0} + \mathbf{Var} \left( \varepsilon \right) \\ & = & \underbrace{ \left[ f(X) - \hat{f}(X) \right]^2 }_{ \text{reducible error} } + \underbrace{ \mathbf{Var} (\varepsilon) }_{ \substack{ \text{irreducible} \\ \text{error} } } \end{array}\]
\[\mathbf{E} \left[ \Big( y_0 - \hat{f}(x_0) \Big)^2 \right] = \mathbf{Var} \Big( \hat{f}(x_0) \Big) + \mathbf{Bias}^2 \Big( \hat{f}(x_0) \Big) + \mathbf{Var}(\varepsilon)\] where \(\mathbf{E} \left[ \Big( y_0 - \hat{f}(x_0) \Big)^2 \right]\) is the expected test MSE. That is, it’s the average test MSE that we would obtain if we repeately estimated \(f\) using a large number of training sets, and tested each at \(x_0\). The overall expected test MSE can be computed by averaging \(\mathbf{E} \left[ \Big( y_0 - \hat{f}(x_0) \Big)^2 \right]\) over all possible values of \(x_0\) in the test set.
Proof. Proof
\[\begin{array}{r c l} \mathbf{E} \left[ \Big( y_0 - \hat{f}(x_0) \Big)^2 \right] & = & \mathbf{E} \Big[ \big( y_0 \color{magenta}{- \mathbf{E}(\hat{f}(x_0)) + \mathbf{E}(\hat{f}(x_0)) } - \hat{f}(x_0) \big)^2 \Big] \\ & = & \mathbf{E} \left\{ \left( \left[ y_0 - \mathbf{E} \big( \hat{f}(x_0) \big) \right] + \left[ \mathbf{E} \big( \hat{f}(x_0) \big) - \hat{f}(x_0) \right] \right)^2 \right\} \\ & = & \mathbf{E} \left\{ \left[ y_0 - \mathbf{E} \big( \hat{f}(x_0) \big) \right]^2 + 2 \, \left[ y_0 - \mathbf{E} \big( \hat{f}(x_0) \big) \right] \left[ \mathbf{E} \big( \hat{f}(x_0) \big) - \hat{f}(x_0) \right] + \left[ \mathbf{E} \big( \hat{f}(x_0) \big) - \hat{f}(x_0) \right]^2 \right\} \\ & = & \underbrace{ \mathbf{E} \left\{ \left[ y_0 - \mathbf{E} \big( \hat{f}(x_0) \big) \right]^2 \right\} }_{\text{Statement A}} + 2 \underbrace{ \mathbf{E} \left\{ \left[ y_0 - \mathbf{E} \big( \hat{f}(x_0) \big) \right] \left[ \mathbf{E} \big( \hat{f}(x_0) \big) - \hat{f}(x_0) \right] \right\} }_{\text{Statement B}} + \underbrace{ \mathbf{E} \left\{ \left[ \mathbf{E} \big( \hat{f}(x_0) \big) - \hat{f}(x_0) \right]^2 \right\} }_{\text{Statement C}} \\ \end{array}\]
Proof. Statement A
\[\begin{array}{r c l} \mathbf{E} \left\{ \left[ y_0 - \mathbf{E} \big( \hat{f}(x_0) \big) \right]^2 \right\} & = & \mathbf{E} \left\{ y_0^2 - 2 y_0 \mathbf{E} \big( \hat{f}(x_0) \big) + \mathbf{E}^2 \big( \hat{f}(x_0) \big) \right\} \\ & = & \mathbf{E} \left\{ \big( f(x_0) + \varepsilon \big)^2 - 2 \big( f(x_0) + \varepsilon \big) \mathbf{E} \big( \hat{f}(x_0) \big) + \mathbf{E}^2 \big( \hat{f}(x_0) \big) \right\} \\ & = & \mathbf{E} \left\{ \color{dodgerblue}{ f^2(x_0) } + \color{orange}{ \varepsilon^2 } + \color{magenta}{ 2 \varepsilon f(x_0) } - \color{dodgerblue}{ 2 f(x_0) \mathbf{E} \big( \hat{f}(x_0) \big) } - \color{red}{ 2 \varepsilon \mathbf{E} \big( \hat{f}(x_0) \big) } + \color{dodgerblue}{ \mathbf{E}^2 \big( \hat{f}(x_0) \big) } \right\} \\ & = & \color{dodgerblue}{ \mathbf{E} \left[ f^2(x_0) \right] } + \color{orange}{ \mathbf{E} \left( \varepsilon^2 \right) } + \color{magenta}{ 2 \mathbf{E} \left[ \varepsilon f(x_0) \right] } - \color{dodgerblue}{ 2 \mathbf{E} \left[ f(x_0) \mathbf{E} \big( \hat{f}(x_0) \big) \right] } - \color{red}{ 2 \mathbf{E} \left[ \varepsilon \mathbf{E} \big( \hat{f}(x_0) \big) \right] } + \color{dodgerblue}{ \mathbf{E} \left[ \mathbf{E}^2 \big( \hat{f}(x_0) \big) \right] } \\ & = & \color{orange}{ \mathbf{Var} ( \varepsilon ) } + \color{dodgerblue}{ f^2(x_0) } + \color{magenta}{ 2 f(x_0) } \underbrace{ \color{magenta}{ \mathbf{E} ( \varepsilon ) } } _{= 0} - \color{dodgerblue}{ 2 f(x_0) \mathbf{E} \big( \hat{f}(x_0) \big) } - \color{red}{ 2 } \underbrace{ \color{red}{ \mathbf{E} ( \varepsilon ) } }_{= 0} \color{red}{ \mathbf{E} \big( \hat{f}(x_0) \big) } + \color{dodgerblue}{ \mathbf{E}^2 \big( \hat{f}(x_0) \big) } \\ & = & \color{orange}{ \mathbf{Var} ( \varepsilon ) } + \color{dodgerblue}{ f^2(x_0) - 2 f(x_0) \mathbf{E} \big( \hat{f}(x_0) \big) + \mathbf{E}^2 \big( \hat{f}(x_0) \big) } \\ & = & \color{orange}{ \mathbf{Var} ( \varepsilon ) } + \color{dodgerblue}{ \mathbf{Bias}^2 \big( \hat{f}(x_0) \big) } \end{array}\]
Proof. Statement B
\[\begin{array}{r c l} \mathbf{E} \left\{ \left[ y_0 - \mathbf{E} \big( \hat{f}(x_0) \big) \right] \left[ \mathbf{E} \big( \hat{f}(x_0) \big) - \hat{f}(x_0) \right] \right\} & = & \mathbf{E} \left\{ y_0 \mathbf{E} \big( \hat{f}(x_0) \big) - y_0 \hat{f}(x_0) - \mathbf{E}^2 \big( \hat{f}(x_0) \big) + \mathbf{E} \big( \hat{f}(x_0) \big) \hat{f}(x_0) \right\} \\ & = & \mathbf{E} \left[ y_0 \mathbf{E} \big( \hat{f}(x_0) \big) \right] - \mathbf{E} \left[ y_0 \hat{f}(x_0) \right] - \mathbf{E} \left[ \mathbf{E}^2 \big( \hat{f}(x_0) \big) \right] + \mathbf{E} \left[\mathbf{E} \big( \hat{f}(x_0) \big) \hat{f}(x_0) \right] \\ & = & \mathbf{E} ( \color{orange}{ y_0 }) \mathbf{E} \big( \hat{f}(x_0) \big) - \mathbf{E} \left[ \color{dodgerblue}{ y_0 } \hat{f}(x_0) \right] - \mathbf{E}^2 \big( \hat{f}(x_0) \big) + \mathbf{E} \big( \hat{f}(x_0) \big) \mathbf{E} \big( \hat{f}(x_0) \big) \\ & = & \mathbf{E} \big( \color{orange}{ f(x_0) + \varepsilon } \big) \mathbf{E} \big( \hat{f}(x_0) \big) - \mathbf{E} \left[ \big( \color{dodgerblue}{ f(x_0) + \varepsilon } \big) \hat{f}(x_0) \right] \color{magenta}{ - \mathbf{E}^2 \big( \hat{f}(x_0) \big) + \mathbf{E}^2 \big( \hat{f}(x_0) \big) } \\ & = & \mathbf{E} \big( \color{orange}{ f(x_0) } \big) \mathbf{E} \big( \hat{f}(x_0) \big) + \underbrace{ \mathbf{E} \big( \color{orange}{ \varepsilon } \big) }_{= 0} \mathbf{E} \big( \hat{f}(x_0) \big) - \mathbf{E} \left[ \big( \color{dodgerblue}{ f(x_0) + \varepsilon } \big) \hat{f}(x_0) \right] \\ & = & \color{orange}{ f(x_0) } \mathbf{E} \big( \hat{f}(x_0) \big) - \mathbf{E} \left[ \big( \color{dodgerblue}{ f(x_0) } \hat{f}(x_0) \right] + \mathbf{E} \left[ \color{dodgerblue}{ \varepsilon } \hat{f}(x_0) \big) \right] \\ & = & \color{magenta}{ f(x_0) \mathbf{E} \big( \hat{f}(x_0) \big) - f(x_0) \mathbf{E} \big( \hat{f}(x_0) \big) } + \mathbf{E} \left[ \color{dodgerblue}{ \varepsilon } \hat{f}(x_0) \big) \right] \\ & = & \mathbf{E} \left[ \color{dodgerblue}{ \varepsilon } \hat{f}(x_0) \big) \right] \\ & = & \underbrace{ \mathbf{E} ( \color{dodgerblue}{ \varepsilon } ) }_{= 0} \mathbf{E} \big( \hat{f}(x_0) \big) \\ & = & 0 \end{array}\]
Proof. Statement C
\[\begin{array}{r c l} \mathbf{E} \left\{ \left[ \mathbf{E} \big( \hat{f}(x_0) \big) - \hat{f}(x_0) \right]^2 \right\} & = & \mathbf{E} \left\{ \mathbf{E}^2 \big( \hat{f}(x_0) \big) - 2 \mathbf{E} \big( \hat{f}(x_0) \big) \hat{f}(x_0) + \hat{f}^2 (x_0) \right\} \\ & = & \mathbf{E} \left\{ \mathbf{E}^2 \big( \hat{f}(x_0) \big) \right\} - 2 \mathbf{E} \left\{ \mathbf{E} \big( \hat{f}(x_0) \big) \hat{f}(x_0) \right\} + \mathbf{E} \left\{ \hat{f}^2 (x_0) \right\} \\ & = & \mathbf{E}^2 \big( \hat{f}(x_0) \big) - 2 \mathbf{E}^2 \big( \hat{f}(x_0) \big) + \mathbf{E} \left( \hat{f}^2 (x_0) \right) \\ & = & - \mathbf{E}^2 \big( \hat{f}(x_0) \big) + \mathbf{E} \left( \hat{f}^2 (x_0) \right) \\ & = & \mathbf{Var} \big( \hat{f}(x_0) \big) \end{array}\]
Proof. If we now sum statements A and C we get that \[\mathbf{E} \left[ \Big( y_0 - \hat{f}(x_0) \Big)^2 \right] = \mathbf{Var} \Big( \hat{f}(x_0) \Big) + \mathbf{Bias}^2 \Big( \hat{f}(x_0) \Big) + \mathbf{Var}(\varepsilon)\]
\(\uparrow\) Flexibility \(\Rightarrow\) \(\uparrow\) Variance \(\, \downarrow\) Bias
In the following links you can find the scripts for the different practice exercises with covered on the course.
Last updated on: January 10, 2021