Statistical Learning

From Clinfowiki
Jump to: navigation, search

Statistical Learning

Statistical learning refers to a set of procedures in which we attempt to identify a relationship between a set of predictors (or independent variables) and some outcome (or dependent variables). Typically, we seek represent the relationship between input and output variables mathematically through a function, f. Where f represents systematic information that X provides about Y. A typical relationship can be shown mathematically as follows

Y = f(X)+ϵ

where f(X) is some function that describes the relationship between a set of independent variable X and a dependent variable, Y. In the above equation, ϵ in an error term which reflects the error typically observed between the function, f, and the observed dependent variable, Y.

Why Estimate f?

In general we are interested in estimating f for one of two reasons: prediction, or inference. By knowing f (or at least having a good estimate of f), we are able to make predictions about Y given a set of input variable, X. This is particularly useful when Y has not yet been observed. For example, given a patients age, sex, race, total cholesterol, HDL cholesterol, LDL cholesterol, systolic blood pressure, history of diabetes, smoking history, and medication history we can predict whether or not the patient is likely to experience a cardiac event at some point in the future 1. Alternatively, we may not be interested in making predictions given a set of input variables but rather, we may want to describe the relationship between a set of input variable and output variable. That is, to answer the question is variable X related to variable Y? And if so, we would like to characterize this relationship in terms of the direction and magnitude of the relationship. In other words, is X positively or negatively associated with Y and if so, by how much will a unit change in X change Y?

Prediction

In many cases in medicine, we may be able to observe X but not Y. This is particularly the case when Y is a future outcome that cannot be observed along with X, as in the risk prediction example above. In such cases we are interested in making a prediction about whether or not Y is likely to happen given a set of observed variable, X. Alternatively, we may be interested in estimating the value of Y directly given a set of input variables, X. This is commonly referred to as regression. We can summarize the task of prediction mathematically as follows


Ŷ = f(X)

where f represents our estimate of f, and Ŷ represents our prediction for Y. In the prediction setting, we are often not concerned with the exact functional form of f, provided it produces accurate estimates of Y.

Inference

Another reason we may be interested in estimating f is so that we can make inference about the relationship between X and Y. In this setting, we are not particularly interested in making accurate predictions on Y but rather, we are interested in discovering and describing possible relationships between X and Y. That is, we are interested in testing whether or not an association exists between a set of independent variable and some dependent variable. This is often useful when trying to understand causal relationships between X and Y. However, it should be noted that establishing causality is often difficult, particularly with observational study design. As a result, relationships observed between X and Y should be interpreted with caution.

Estimating f

The overall goal of statistical learning is to identify a learning method to estimate f. In general, there are two approaches: parametric and non-parametric. Parametric approaches assume the relationship between X and Y takes on a particular functional form. For example, we may assume the relationship between a particular variable and some outcome is linear. In this case, we can assume the relationship between X and Y will take the following form


f(X)=βo + β1X1 + β2X2 + ... + βpXp

Where β0, β1, ..., βp are model parameters that must be estimated from training data. This approach is referred to as parametric since it reduces the problem of estimating f down to one of estimating a set of parameters (β0, β1, ..., βp).

In contrast to parametric methods described above, non-parametric methods do not make assumptions about the functional form of f. This may be advantageous in cases where our assumption about the relationship between X and Y is badly off. However, by not making assumptions about the funcitonal form of f, a larger number of observations is typically reuiqred to accurately estiamte f comapred to parametric methods.

Model Accuracy and Model Bias

Some of methods used to model the relationship between X and Y may be less flexible (e.g. a simple linear model) while others may be highly flexible (e.g. cubic splines). Models that are highly flexible can take on a variety of shapes and can model more complex relationship between X and Y (i.e. non-linear relationships). Depending on the context, less flexible models may be preferable over more fexible models since they are usually easier to interpret and are more useful for making inference between X and Y.

Supervised and Unsupervised Learning

Most statistical learning problems can be classified as either supervised or unsupervised problems. Supervised learning problems arise when both the dependent (Y) and independent (X) variables are observed. We then model the relationship between X and Y by estimating f with the intention of accurately predicting Y or to better understand the relationship between X and Y.

Unsupervised learning problems arise when we only observe the independent variable X and not Y. It is referred o as unsupervsed learning since we do not have the independent variable, Y, to supervise our analysis. Cluster analysis is an example of an unsupervised learning technique in which we try to uncover patterns within the data by clustering observations on the basis of X.

Regression Versus Classification

Just as most statistical learning problems can be organized as either supervised or unsupervised; we can further categorize most problems as either regression or classification problem. Regression problems arise when the dependent variable, Y, takes on a quantitative value (e.g. length of stay). Classification problems typically arise when the dependent variable takes on qualitative values (e.g. patient experiences cardiac event in the next 10 years as either yes/no). However, the distinction between regression and classification is not always clear as is the case when estimating risk or the probability of some event happening in the future.

References

1. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York, NY, USA: Springer New York Inc.; 2001. (Springer Series in Statistics).