On Wed, Jun 03, 2020 at 08:05:48AM +0900, Akira Yokosawa wrote: > On Tue, 2 Jun 2020 08:28:09 -0700, Paul E. McKenney wrote: > > On Tue, Jun 02, 2020 at 11:27:37PM +0900, Akira Yokosawa wrote: > >> On Mon, 1 Jun 2020 16:45:45 -0700, Paul E. McKenney wrote: > >>> On Tue, Jun 02, 2020 at 07:51:31AM +0900, Akira Yokosawa wrote: > >>>> On Mon, 1 Jun 2020 09:13:49 -0700, Paul E. McKenney wrote: > >>>>> On Tue, Jun 02, 2020 at 12:10:06AM +0900, Akira Yokosawa wrote: > >>>>>> On Sun, 31 May 2020 18:18:38 -0700, Paul E. McKenney wrote: > >>>>>>> On Mon, Jun 01, 2020 at 08:11:06AM +0900, Akira Yokosawa wrote: > >>>>>>>> On Sun, 31 May 2020 09:50:23 -0700, Paul E. McKenney wrote: > >>>>>>>>> On Sun, May 31, 2020 at 09:30:44AM +0900, Akira Yokosawa wrote: > >>>>>>>>>> Hi Paul, > >>>>>>>>>> > >>>>>>>>>> This is misc updates in response to your recent updates. > >>>>>>>>>> > >>>>>>>>>> Patch 1/3 treats QQZ annotations for "nq" build. > >>>>>>>>> > >>>>>>>>> Good reminder, thank you! > >>>>>>>>> > >>>>>>>>>> Patch 2/3 adds a paragraph in #9 of FAQ.txt. The wording may need > >>>>>>>>>> your retouch for fluency. > >>>>>>>>>> Patch 3/3 is an independent improvement of runlatex.sh. It will avoid > >>>>>>>>>> a few redundant runs of pdflatex when you have some typo in labels/refs. > >>>>>>>>> > >>>>>>>>> Nice, queued and pushed, thank you! > >>>>>>>>> > >>>>>>>>>> Another suggestion to Figures 9.25 and 9.29. > >>>>>>>>>> Wouldn't these graphs look better with log scale x-axis? > >>>>>>>>>> > >>>>>>>>>> X range can be 0.001 -- 10. > >>>>>>>>>> > >>>>>>>>>> You'll need to add a few data points in sub-microsecond critical-section > >>>>>>>>>> duration to show plausible shapes in those regions, though. > >>>>>>>>> > >>>>>>>>> I took a quick look and didn't find any nanosecond delay primitives > >>>>>>>>> in the Linux kernel, but yes, that would be nicer looking. > >>>>>>>>> > >>>>>>>>> I don't expect to make further progress on this particular graph > >>>>>>>>> in the immediate future, but if you know of such a delay primitive, > >>>>>>>>> please don't keep it a secret! ;-) > >>>>>>>> > >>>>>>>> I find ndelay() defined in include/asm_generic/delay.h. > >>>>>>>> I'm not sure if it works as you would expect, though. > >>>>>>> > >>>>>>> I must be going blind, given that I missed that one! > >>>>>> > >>>>>> :-) :-) > >>>>>> > >>>>>>> I did try it out, and it suffers from about 10% timing errors. In > >>>>>>> contrast, udelay is usually less than 1%. > >>>>>> > >>>>>> You mean udelay(1)'s error is less than 10ns, whereas ndelay(1000)'s > >>>>>> error is about 100ns? > >>>>> > >>>>> Yuck. The 10% was a preliminary eyeballing. An overnight run showed it > >>>>> to be worst than that. 100ns gets me about 130ns, 200ns gets me about > >>>>> 270ns, and 500ns gets me about 600ns. So ndelay() is useful only for > >>>>> very short delays. > >>>> > >>>> To compensate the error, how about doing the appended? > >>>> Yes, this is kind of ugly... > >>>> > >>>> Another point you should be aware. It looks like arch/powerpc > >>>> does not have __ndelay defined. Which means ndelay() would cause > >>>> build error. Still, I might be missing something. > >>> > >>> That is quite clever! It does turn ndelay(1) into ndelay(0), but it > >>> probably costs more than a nanosecond to do the integer division, so > >>> that shouldn't be a problem. > >>> > >>> However, I believe that any such compensatory schemes should be done > >>> within ndelay() rather than by its users. > >> > >> I'm not brave enough to change the behavior of ndelay() seeing the > >> number of call sites in kernel code base, especially under drivers/. > >> > >> Looking at the updated Figures 9.25 and 9.29, the timing error of > >> ndelay() results in the discrepancy of "rcu" plots from the ideal > >> orthogonal lines in sub-microseconds regions (0.1, 0.2, and 0.5us). > >> I don't think you like such misleading plots. > >> > >> You could instead compensate the x-values you give to ndelay(). > >> > >> On x86, you know the resolution of xdelay() is 1.164153ns. > >> Which means if you want a time delay of 100ns, ndelay(86) will > >> be 100.117ns. > >> ndelay(172) will be 200.234ns and ndelay(429) will be 499.422ns. > >> ndelay(430) will be 500.586ns, which is the 2nd closest. > >> If you don't want to exceed 500ns, ndelay(430) would be your choice. > >> > >> I think this level of tweak is worthwhile, especially it will > >> result in a better looking plot of RCU scaling. > >> > >> Thoughts? > > > > Huh. > > > > What we could do is to do a calibration pass where we sample a > > fine-grained timesource, spin on a series of ndelay() calls that last for > > a few microseconds, then resample the fine-grained timestamp. We could > > then do a binary search so as to compute a corrected ndelay argument. > > We would then need to verify the corrected argument. > > > > This procedure would be architecture independent, and might also account > > for instruction-stream differences. > > This calibration part could be implemented and tested on a small system, > assuming you have sub-microsecond ndelay() and fine-grained timer. Just to be clear, my thought is to do a short calibration cycle on the system running the actual test as part of refperf initialization. > For example, powerpc I mentioned earlier uses the fallback definition > in linux/delay.h: > > #ifndef ndelay > static inline void ndelay(unsigned long x) > { > udelay(DIV_ROUND_UP(x, 1000)); > } > #define ndelay(x) ndelay(x) > #endif Indeed, any calibration would need to be careful of this! > > Is there a better way? Seems like there should be. ;-) > > There can be someone already has done a similar thing. Quite possibly. Thanx, Paul > Thanks, Akira > > > > > Thanx, Paul > > > >> PS: The bumps in Figures 9.25 and 9.29 in the sub-microsecond region > >> might be the effect of difference of instruction stream. > >> As we have seen in Figure 9.22, slight changes in the code path, > >> e.g. jump target alignment, can cause 10% -- 20% of performance > >> difference. > >> > >> Enforce inlining un_delay() might or might not help. Just guessing. > >> > >> > >>> Plus, as you imply, different > >>> architectures might need different adjustments. My concern is that > >>> different CPU generations within a given architecture might also need > >>> different adjustments. :-( > >>> > >>> Thanx, Paul > >>> > >>>> Thanks, Akira > >>>> > >>>> diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c > >>>> index 5db165ecd465..0a3764ea220c 100644 > >>>> --- a/kernel/rcu/refperf.c > >>>> +++ b/kernel/rcu/refperf.c > >>>> @@ -122,7 +122,7 @@ static void un_delay(const int udl, const int ndl) > >>>> if (udl) > >>>> udelay(udl); > >>>> if (ndl) > >>>> - ndelay(ndl); > >>>> + ndelay((ndl * 859) / 1000); // 5 : 2^32/1000000000 (4.295) > >>>> } > >>>> > >>>> static void ref_rcu_read_section(const int nloops) > >>>> > >>>> > >>>>