Chapter 7 Machine Learning

Parametric models such as generalized linear regression and logistic regression has advantages and disadvantages.

Strength:

  • The effects of individual predictors on the outcome are easily understood
  • Statistical inference, such as hypothesis testing or interval estimation, is straightforward
  • Methods and procedures for selecting, comparing, and summarizing these models are well-established and extensively studied

Disadvantages in the following scenarios:

  • Complex, non-linear relationships between predictors and the outcome
  • High degrees of interaction between predictors
  • Nominal outcome variables with several categories

In these situations, non-parametric or algorithmic modeling approaches have the potential to better capture the underlying trends in the data.

Here we introduce three models: classification and regression trees (CART), random forests, k-nearest neighbors.

  • Classification and regression trees (CART) are “trained” by recursively partitioning the 𝑝-dimensional space (defined by the explanatory variables) until an acceptable level of homogeneity or “purity” is achieved within each partition.

  • A major issue with tree-based models is that they tend to be high variance (leading to a high propensity towards over-fitting). Random forests are a non-parametric, tree-based modeling algorithm that is built upon the idea that averaging a set of independent elements yields an outcome with lower variability than any of the individual elements in the set.

    This general concept should seem familiar. Thinking back to your introductory statistics course, you should remember that the sample mean, \(\overline{x}\), of a dataset has substantially less variability (\(\frac{\sigma}{\sqrt{n}}\)) than the individual data-points themselves (\(\sigma\)).

Q: What is Bias-Variance Trade-Off in Machine Learning?
A:

  • Bias refers to error caused by a model for solving complex problems that is over simplified, makes significant assumptions, and misses important relationships in your data.
  • Variance error is variability of a target function’s form with respect to different training sets. Models with small variance error will not change much if you replace couple of samples in training set. Models with high variance might be affected even with small changes in training set. High variance models fit the data too well, and learns the noise in addition to the inherent patterns in the data.