> From: Paul E. McKenney <paulmck@xxxxxxxxxx> > [...] > > > > a's standard deviation is ~0.4. > > b's standard deviation is ~0.5. > > > > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9]. > > So, the measurements should be statistically significant to some degree. > > That single standard deviation means that you have 68% confidence that the > difference is real. This is not far above the 50% leval of random noise. > 95% is the lowest level that is normally considered to be statistically > significant. 95% means there is no overlap between two standard deviations of a and two standard deviations of b. This relies on either much less noise during testing or a big enough difference between a and b. > > The calculated standard deviations are via: > > https://www.gigacalculator.com/calculators/standard-deviation-calculat > > or.php > > Fair enough. Formulas are readily available as well, and most spreadsheets > support standard deviation. > > [...] > > > > Why don't you try applying this approach to the new data? You will > > > need the general binomial formula. > > > > Thank you Paul for the suggestion. > > I just tried it, but not sure whether my analysis was correct ... > > > > Analysis 1: > > a's median is 8.9. > > I get 8.95, which is the average of the 24th and 25th members of a in > numerical order. Yes, it should be 8.95. Thanks for correcting me. > > 35/48 b's data points are less than 0.1 less than a's median. > > For a's binomial distribution P(X >= 35) = 0.1%, where p=0.5. > > So, we have strong confidence that b is 100ms faster than a. > > I of course get quite a bit stronger confidence, but your 99.9% is good > enough. And I get even stronger confidence going in the other direction. > However, the fact that a's median varies from 8.7 in the old experiment to > 8.95 in this experiment does give some pause. These are after all supposedly > drawn from the same distribution. Or did you use a different machine or > different OS version or some such in the two sets of measurements? > Different time of day and thus different ambient temperature, thus different > CPU clock frequency? All the testing setups were identical except for the testing time. Old a median : 8.7 New a median : 8.95 Old b median : 8.2 New b median : 8.45 I'm a bit surprised that both new medians are exactly greater 0.25 more than the old medians. Coincidence? > Assuming identical test setups, let's try the old value of 8.7 from old a to new > b. There are 14 elements in new b greater than 8.6, for a probability of > 0.17%, or about 98.3% significance. This is still OK. > > In contrast, the median of the old b is 8.2, which gives extreme confidence. > So let's be conservative and use the large-set median. > > In real life, additional procedures would be needed to estimate the > confidence in the median, which turns oout to be nontrivial. When I apply Luckily, I could just simply pick up the medians in numerical order in this case. ;-) > this sort of technique, I usually have all data from each sample being on one > side of the median of the other, which simplifies things. ;-) I like all data points are on one side of the median of the other ;-) But this also relies on either much less noise during testing or a big enough difference between a and b, right? > The easiest way to estimate bounds on the median is to "bootstrap", but that > works best if you have 1000 samples and can randomly draw 1000 sub- > samples each of size 10 from the larger sample and compute the median of > each. You can sort these medians and obtain a cumulative distribution. Good to know "bootstap". > But you have to have an extremely good reason to collect data from 1000 > boots, and I don't believe we have that good of a reason. > 1000 boots, Oh my ... No. No. I don't have a good reason for that ;-) > > Analysis 2: > > a's median - 0.4 = 8.9 - 0.4 = 8.5. > > 24/48 b's data points are less than 0.4 less than a's median. > > The probability that a's data points are less than 8.5 is p = 7/48 > > = 0.1458 > This is only 85.4% significant, so... > > > For a's binomial distribution P(X >= 24) = 0.0%, where p=0.1458. > > So, looks like we have confidence that b is 400ms faster than a. > > ...we really cannot say anything about 400ms faster. Again, you need 95% > and preferably 99% to really make any sort of claim. You probably need > quite a few more samples to say much about 200ms, let alone 400ms. OK. Thanks for correcting me. > > Plus, you really should select the speedup and only then take the > measurements. Otherwise, you end up fitting noise. > > However, assuming identical tests setups, you really can calculate the median > from the full data set. > > > The calculated cumulative binomial distributions P(X) is via: > > > > https://www.gigacalculator.com/calculators/binomial-probability-calcul > > ator.php > > The maxima program's binomial() function agrees with it, so good. ;-) > > > I apologize if this analysis/discussion bored some of you. ;-) > > Let's just say that it is a lot simpler when you are measuring larger > differences in data with tighter distributions. Me, I usually just say "no" to > drawing any sort of conclusion from data sets that overlap this much. > Instead, I might check to see if there is some random events adding noise to > the boot duration, eliminate that, and hopefully get data that is easier to > analyze. Agree. > But I am good with the 98.3% confidence in a 100ms improvement. > > So if Joel wishes to make this point, he should feel free to take both of your > datasets and use the computation with the worse mean. Thank you so much Paul for your patience and detailed comments. -Qiuxu