On 2022-01-27 00:29:37 [+0100], Mario Kleiner wrote: > Hi, first thank you for implementing these preempt disables according to Hi Mario, > the markers i left long ago. And sorry for the rather late reply. > > I had a look at the code, as of Linux 5.16, and did also a little test run > (of a standard kernel, not with PREEMPT_RT, only > CONFIG_PREEMPT_VOLUNTARY=y) on my Intel Kabylake GT2, so some thoughts: > > The area covers only register reads and writes. The part that worries me > > is: > > - __intel_get_crtc_scanline() the worst case is 100us if no match is > > found. > > > > This one can be a problem indeed on (maybe all?) modern Intel gpu's since > Haswell, ie. the last ~10 years. I was able to reproduce it on my Kabylake > Intel gpu. > > Most of the time that for-loop with up to 100 repetitions (~ 100 > udelay(1) + one mmio register read) (cfe. > https://elixir.bootlin.com/linux/v5.17-rc1/source/drivers/gpu/drm/i915/i915_irq.c#L856) > will not execute, because most of the time that function gets called from > the vblank irq handler and then that trigger condition (if > (HAS_DDI(dev_priv) && !position)) is not true. However, it also gets called > as part of power-saving on behalf of userspace context, whenever the > desktop graphics goes idle for two video refresh cycles. If the desktop > shows graphics activity again, and vblank interrupts need to get reenabled, > the probability of hitting that case is then ~1-4% depending on video mode. > How many loops it runs also varies. > > On my little Intel(R) Core(TM) i5-8250U CPU machine with a mostly idle > desktop, I observed about one hit every couple of seconds of regular use, > and each hit took between 125 usecs and almost 250 usecs. I guess udelay(1) > can take a bit longer than 1 usec? it should get very close to this. Maybe something else extended the time depending on what you observe. > So that's too much for preempt-rt. What one could do is the following: > > 1. In the for-loop in __intel_get_crtc_scanline(), add a preempt_enable() > before the udelay(1); and a preempt_disable() again after it. Or > potentially around the whole for-loop if the overhead of > preempt_en/disable() is significant? It is very optimized on x86 ;) > 2. In intel_get_crtc_scanline() also wrap the call to > __intel_get_crtc_scanline() into a preempt_disable() and preempt_enable(), > so we can be sure that __intel_get_crtc_scanline() always gets called with > preemption disabled. > > Why should this work ok'ish? The point of the original preempt disable > inside i915_get_crtc_scanoutpos > <https://elixir.bootlin.com/linux/v5.17-rc1/C/ident/i915_get_crtc_scanoutpos> > is that those two *stime = ktime_get() and *etime = ktime_get() clock > queries happen as close to the scanout position query as possible to get a > small confidence interval for when exactly the scanoutpos was > read/determined from the display hardware. error = (etime - stime) is the > error margin. If that margin becomes greater than 20 usecs, then the > higher-level code will consider the measurement invalid and repeat the > whole procedure up to 3 times before giving up. The preempt-disable is needed then? The task is preemptible here on PREEMPT_RT but it _may_ not come to this. The difference vs !RT is that an interrupt will preempt this code without it. > Normally, in my experience with different graphics chips, one would observe > error < 3 usecs, so the measurement almost always succeeds at first try, > only very rarely takes two attempts. The preempt disable is meant to make > sure that this stays the case on a PREEMPT_RT kernel. Was it needed? > The problem here are the relatively rare cases where we hit that up to 100 > iterations for-loop. Here even on a regular kernel, due to hardware quirks, > we already exceed the 20 usecs tolerance by a huge amount of more than 100 > usecs, leading to a retry of the measurement. And my tests showed that > often the two succeeding retries also fail, because of hardware quirks can > apparently create a blackout situation approaching 1 msec, so we lose > anyway, regardless if we get preempted on a RT kernel or not. That's why > enabling preemption on RT again during that for-loop should not make the > situation worse and at least keep RT as real-time as intended. > > In practice I would also expect that this failure case is the one least > likely to impair userspace applications greatly in practice. The cases that > mostly matter are the ones executed during vblank hardware irq, where the > for-loop never executes and error margin and preempt off time is only about > 1 usec. My own software which depends on very precise timestamps from the > mechanism never reported >> 20 usecs errors during startup tests or runtime > tests. That is without RT I assume? > > - intel_crtc_scanlines_since_frame_timestamp() not sure how long this > > may take in the worst case. > > > > > intel_crtc_scanlines_since_frame_timestamp() should be harmless. That > do-while loop just tries to make sure that two register reads that should > happen within the same video refresh cycle are happening in the same > refresh cycle. As such, the while loop will almost always just execute only > once, and at most two times, so that's at most 6 mmio register reads for > two loop iterations. > > In the long run one could try to test if > __intel_get_crtc_scanline_from_timestamp > <https://elixir.bootlin.com/linux/v5.17-rc1/C/ident/__intel_get_crtc_scanline_from_timestamp>() > wouldn't be the better choice for affected hardware always. Code and > register descriptions suggest the feature is supported by all potentially > affected hardware, so if it would turn out that that method works as > accurate and reliable as the old one, we could save the overhead and time > delays for that 100 for-loop iterations and make the whole timestamping > more reliable on modern hw. > > It was in the RT queue for a while and nobody complained. > > Disable preemption on PREEPMPT_RT during timestamping. > > > > > Do other patches exist to implement the preempt_dis/enable() also for AMD > and NVidia / nouveau / vc4? I left corresponding markers for No, nobody complained. Most likely the i915 is wider used since it is built-in into many chipsets which then run RT and some of them use the display in production. > radeon/amdgpu-kms and RaspberryPi's vc4 driver. Ideally all kms drivers > which use scanout position queries should have such code. Always a > preempt_disable() before the "if (stime) *stime = ktime_get();" statement, > and a preempt_enable() after the "if (etime) *etime = ktime_get();" > statement. > > Checking Linux 5.16 code, this should be safe - short preempt off interval > with only a few mmio register reads - for all kms drivers that support it > atm. I found the following functions to modify: > > amdgpu: amdgpu_display_get_crtc_scanoutpos() > radeon: radeon_get_crtc_scanoutpos() > msm: mdp5_crtc_get_scanout_position() and dpu_crtc_get_scanout_position() > ltdc: ltdc_crtc_get_scanout_position() > vc4: vc4_crtc_get_scanout_position() I that is "small" with locks and such, then it should work > nouveau: In nvkm_head_mthd_scanoutpos(), one needs to put the > preempt_disable() right before > … > > Is the plan to integrate these patches into the mainline kernel soon, as > part of ongoing preempt-rt upstreaming? I want to get the i915 in as part of RT upstreaming. But now I've been thinking to not allowing i915 on RT via Kconfig and worry about it afterwards. > thanks, > -mario Sebastian