When I consulted researchers on which statistical analysis to use for their data, a common first step was to think about the distribution of the target variable: Is it a count, like the number of emails received within an hour? Poisson distribution it is.
Thank you for this well-written and insightful article. Could you please point me to a reference that derives the connection between data distribution and loss function.
Proper scoring rules are another useful concept related to the above (see e.g. https://www.bundesbank.de/resource/blob/635562/7d3de0f3fc003e5b4864828143f268cf/mL/2012-06-01-eltville-11-gneiting-paper-data.pdf) that link statistical functionals (mean, median, etc) with loss functions. For example, forecasts of the conditional mean of the distribution can be assessed with a range of loss functions beyond the mean squared error, each with different sensitivities to over/underprediction - one particular function being the QLIKE loss = y/x - log(y/x) - 1 (x=forecast, y=observed) that is popular in the volatility forecasting literature.
Something worth mentioning is that maximizing the likelihood assumes the prior distribution (that of the parameter) is uniform or irrelevant. The a posteriori is proportional to the product of the likelihood and the prior distribution, from Bayes’ theorem. Sometimes, when we know enough about the priors, it’s worth maximizing the a posteriori, which could give out of the box regularisation. For example, in the linear regression problem, if we assume the prior distribution is Gaussian, then we get the same loss + an L2 regularisation term.
Bridging the Gap: From Statistical Distributions to Machine Learning Loss Functions
You can also mix, adapt and customize your distribution.
Thank you for this well-written and insightful article. Could you please point me to a reference that derives the connection between data distribution and loss function.
Proper scoring rules are another useful concept related to the above (see e.g. https://www.bundesbank.de/resource/blob/635562/7d3de0f3fc003e5b4864828143f268cf/mL/2012-06-01-eltville-11-gneiting-paper-data.pdf) that link statistical functionals (mean, median, etc) with loss functions. For example, forecasts of the conditional mean of the distribution can be assessed with a range of loss functions beyond the mean squared error, each with different sensitivities to over/underprediction - one particular function being the QLIKE loss = y/x - log(y/x) - 1 (x=forecast, y=observed) that is popular in the volatility forecasting literature.
Something worth mentioning is that maximizing the likelihood assumes the prior distribution (that of the parameter) is uniform or irrelevant. The a posteriori is proportional to the product of the likelihood and the prior distribution, from Bayes’ theorem. Sometimes, when we know enough about the priors, it’s worth maximizing the a posteriori, which could give out of the box regularisation. For example, in the linear regression problem, if we assume the prior distribution is Gaussian, then we get the same loss + an L2 regularisation term.