When I consulted researchers on which statistical analysis to use for their data, a common first step was to think about the distribution of the target variable: Is it a count, like the number of emails received within an hour? Poisson distribution it is.

Thank you for this well-written and insightful article. Could you please point me to a reference that derives the connection between data distribution and loss function.

Something worth mentioning is that maximizing the likelihood assumes the prior distribution (that of the parameter) is uniform or irrelevant. The a posteriori is proportional to the product of the likelihood and the prior distribution, from Bayes’ theorem. Sometimes, when we know enough about the priors, it’s worth maximizing the a posteriori, which could give out of the box regularisation. For example, in the linear regression problem, if we assume the prior distribution is Gaussian, then we get the same loss + an L2 regularisation term.

Proper scoring rules are another useful concept related to the above (see e.g. https://www.bundesbank.de/resource/blob/635562/7d3de0f3fc003e5b4864828143f268cf/mL/2012-06-01-eltville-11-gneiting-paper-data.pdf) that link statistical functionals (mean, median, etc) with loss functions. For example, forecasts of the conditional mean of the distribution can be assessed with a range of loss functions beyond the mean squared error, each with different sensitivities to over/underprediction - one particular function being the QLIKE loss = y/x - log(y/x) - 1 (x=forecast, y=observed) that is popular in the volatility forecasting literature.

You can also mix, adapt and customize your distribution.

Yes that's true. But loss functions still give more flexibility, because it allows you to leave the realm of distributions.

Thank you for this well-written and insightful article. Could you please point me to a reference that derives the connection between data distribution and loss function.

Something worth mentioning is that maximizing the likelihood assumes the prior distribution (that of the parameter) is uniform or irrelevant. The a posteriori is proportional to the product of the likelihood and the prior distribution, from Bayes’ theorem. Sometimes, when we know enough about the priors, it’s worth maximizing the a posteriori, which could give out of the box regularisation. For example, in the linear regression problem, if we assume the prior distribution is Gaussian, then we get the same loss + an L2 regularisation term.

Proper scoring rules are another useful concept related to the above (see e.g. https://www.bundesbank.de/resource/blob/635562/7d3de0f3fc003e5b4864828143f268cf/mL/2012-06-01-eltville-11-gneiting-paper-data.pdf) that link statistical functionals (mean, median, etc) with loss functions. For example, forecasts of the conditional mean of the distribution can be assessed with a range of loss functions beyond the mean squared error, each with different sensitivities to over/underprediction - one particular function being the QLIKE loss = y/x - log(y/x) - 1 (x=forecast, y=observed) that is popular in the volatility forecasting literature.