Lorem ipsum dolor sit amet, consectetur adicing elit ut ullamcorper. leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet. Leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet.

  /  Project   /  Blog: Confidence Intervals Without Pain

Blog: Confidence Intervals Without Pain

We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations available in your data set. In addition we propose a mechanism to sharpen the confidence intervals, to reduce their width by an order of magnitude. The methodology works with any estimator (mean, median, variance, quantile, correlation and so on) even when the data set violates the classical requirements necessary to make traditional statistical techniques work. In particular, our method also applies to observations that are auto-correlated, non identically distributed, non normal, and even non stationary.

No statistical knowledge is required to understand, implement, and test our algorithm, nor to interpret the results. Its robustness makes it suitable for black-box, automated machine learning technology. It will appeal to anyone dealing with data on a regular basis, such as data scientists, statisticians, software engineers, economists, quants, physicists, biologists, psychologists, system and business analysts, and industrial engineers.

1. Principle

We have tested our methodology in cases that are challenging when using traditional methods, such as a non-zero correlation coefficient for non-normal bi-variate data. Our technique is based on re-sampling and on the following, new fundamental theorem:

Theorem: The width L of any confidence interval is asymptotically equal (as n tends to infinity) to a power function of n, namely L = A / n^B where A and B are two positive constants depending on the data set, and n is the sample size.

The standard (textbook) case is when B = 1/2. Typically, B is a function of the intrinsic, underlying variance attached to each observation, in your data set. Any value of B larger than 1/2 results in confidence intervals converging faster than what traditional techniques are able to offer. In particular, it is possible to obtain B = 1.

Our new technique is described in details (with source code, spreadsheet and illustrations) in our previous article, here. When reading that article, you may skip section, 1, and focus on section 2, and especially section 3, where all the results are presented.

2. Example

Below is an illustration for the correlation coefficient using a bi-variate artificial data set that simulates somewhat random observations.

The resulting value of B is 0.46. The boosting technique (to improve B) has not been used here, but it is described and illustrated in section 3.3, in the article in question. The power function mentioned in our above theorem, fitted to this particular data set, is represented by the red, dotted curve, in the top chart below.

The simulated data set was built as follows:

3. About the Constant A

The constant A attached to the power function (see theorem), is related to the intrinsic variance present at the individual observation level, in your data. We provide its value for common estimators, in the ideal case when observations are independently and identically distributed with an underlying normal distribution. These values can still provide a good approximation in the general case.

The formula for the median is a particular case of the p-th quantile with p = 0.5. All the values in the second column represent the unknown theoretical value of the quantities involved. In practice, these values are replaced by their estimates computed on the data set. These estimates converge to the true values, so using one or another does not matter, as far as the correctness of the table is concerned.

The exact formula for the correlation when it is not zero and the normal assumption is not satisfied, is very complicated. See for instance this article. By contrast, dealing with any correlation (or even more complicated estimators such as regression parameters) is just as easy as dealing with the mean, if you use our methodology. Even if none of the standard assumptions is satisfied.

Finally, A, B and n provide more useful information about your estimator, than p-values.

To not miss this type of content in the future, subscribe to our newsletter. For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn, or visit my old web page here.

Source: Artificial Intelligence on Medium

(Visited 4 times, 1 visits today)
Post a Comment