RE: [PATCH v3] rcu: Add a minimum time for marking boot as completed

"Zhuo, Qiuxu" <qiuxu.zhuo@xxxxxxxxx> · Thu, 9 Mar 2023 15:17:09 +0000

> From: Paul E. McKenney <paulmck@xxxxxxxxxx>
> [...]
> >
> > a's standard deviation is ~0.4.
> > b's standard deviation is ~0.5.
> >
> > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9].
> > So, the measurements should be statistically significant to some degree.
> 
> That single standard deviation means that you have 68% confidence that the
> difference is real.  This is not far above the 50% leval of random noise.
> 95% is the lowest level that is normally considered to be statistically
> significant.

95% means there is no overlap between two standard deviations of a
and two standard deviations of b.

This relies on either much less noise during testing or a big enough 
difference between a and b. 

> > The calculated standard deviations are via:
> > https://www.gigacalculator.com/calculators/standard-deviation-calculat
> > or.php
> 
> Fair enough.  Formulas are readily available as well, and most spreadsheets
> support standard deviation.
> 
> [...]
>
> > > Why don't you try applying this approach to the new data?  You will
> > > need the general binomial formula.
> >
> >    Thank you Paul for the suggestion.
> >    I just tried it, but not sure whether my analysis was correct ...
> >
> >    Analysis 1:
> >    a's median is 8.9.
> 
> I get 8.95, which is the average of the 24th and 25th members of a in
> numerical order.

Yes, it should be 8.95. Thanks for correcting me. 

> >    35/48 b's data points are less than 0.1 less than a's median.
> >    For a's binomial distribution P(X >= 35) = 0.1%, where p=0.5.
> >    So, we have strong confidence that b is 100ms faster than a.
> 
> I of course get quite a bit stronger confidence, but your 99.9% is good
> enough.  And I get even stronger confidence going in the other direction.
> However, the fact that a's median varies from 8.7 in the old experiment to
> 8.95 in this experiment does give some pause.  These are after all supposedly
> drawn from the same distribution.  Or did you use a different machine or
> different OS version or some such in the two sets of measurements?
> Different time of day and thus different ambient temperature, thus different
> CPU clock frequency?

All the testing setups were identical except for the testing time. 

      Old a median   : 8.7
      New a median : 8.95

      Old b median   : 8.2
      New b median : 8.45

I'm a bit surprised that both new medians are exactly greater 0.25 more than 
the old medians.  Coincidence?

> Assuming identical test setups, let's try the old value of 8.7 from old a to new
> b.  There are 14 elements in new b greater than 8.6, for a probability of
> 0.17%, or about 98.3% significance.  This is still OK.
> 
> In contrast, the median of the old b is 8.2, which gives extreme confidence.
> So let's be conservative and use the large-set median.
> 
> In real life, additional procedures would be needed to estimate the
> confidence in the median, which turns oout to be nontrivial.  When I apply

Luckily, I could just simply pick up the medians in numerical order in this case. ;-)

> this sort of technique, I usually have all data from each sample being on one
> side of the median of the other, which simplifies things.  ;-)

I like all data points are on one side of the median of the other ;-)

But this also relies on either much less noise during testing or a big enough 
difference between a and b, right?

> The easiest way to estimate bounds on the median is to "bootstrap", but that
> works best if you have 1000 samples and can randomly draw 1000 sub-
> samples each of size 10 from the larger sample and compute the median of
> each.  You can sort these medians and obtain a cumulative distribution.

Good to know "bootstap".

> But you have to have an extremely good reason to collect data from 1000
> boots, and I don't believe we have that good of a reason.
>

1000 boots, Oh my ...
No. No. I don't have a good reason for that ;-)

> >    Analysis 2:
> >    a's median - 0.4 = 8.9 - 0.4 = 8.5.
> >    24/48 b's data points are less than 0.4 less than a's median.
> >    The probability that a's data points are less than 8.5 is p = 7/48
> > = 0.1458
> This is only 85.4% significant, so...
> 
> >    For a's binomial distribution P(X >= 24) = 0.0%, where p=0.1458.
> >    So, looks like we have confidence that b is 400ms faster than a.
> 
> ...we really cannot say anything about 400ms faster.  Again, you need 95%
> and preferably 99% to really make any sort of claim.  You probably need
> quite a few more samples to say much about 200ms, let alone 400ms.

OK. Thanks for correcting me. 

> 
> Plus, you really should select the speedup and only then take the
> measurements.  Otherwise, you end up fitting noise.
> 
> However, assuming identical tests setups, you really can calculate the median
> from the full data set.
> 
> >    The calculated cumulative binomial distributions P(X) is via:
> >
> > https://www.gigacalculator.com/calculators/binomial-probability-calcul
> > ator.php
> 
> The maxima program's binomial() function agrees with it, so good.  ;-)
> 
> >    I apologize if this analysis/discussion bored some of you. ;-)
> 
> Let's just say that it is a lot simpler when you are measuring larger
> differences in data with tighter distributions.  Me, I usually just say "no" to
> drawing any sort of conclusion from data sets that overlap this much.
> Instead, I might check to see if there is some random events adding noise to
> the boot duration, eliminate that, and hopefully get data that is easier to
> analyze.

Agree. 

> But I am good with the 98.3% confidence in a 100ms improvement.
> 
> So if Joel wishes to make this point, he should feel free to take both of your
> datasets and use the computation with the worse mean.

Thank you so much Paul for your patience and detailed comments. 

-Qiuxu