Re: [PATCH 4/8] drm/i915: Use preempt_disable/enable_rt() where recommended

Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> · Fri, 11 Feb 2022 09:44:17 +0100

On 2022-01-27 00:29:37 [+0100], Mario Kleiner wrote:
> Hi, first thank you for implementing these preempt disables according to
Hi Mario,

> the markers i left long ago. And sorry for the rather late reply.
> 
> I had a look at the code, as of Linux 5.16, and did also a little test run
> (of a standard kernel, not with PREEMPT_RT, only
> CONFIG_PREEMPT_VOLUNTARY=y) on my Intel Kabylake GT2, so some thoughts:
> 
> The area covers only register reads and writes. The part that worries me
> > is:
> > - __intel_get_crtc_scanline() the worst case is 100us if no match is
> >   found.
> >
> 
> This one can be a problem indeed on (maybe all?) modern Intel gpu's since
> Haswell, ie. the last ~10 years. I was able to reproduce it on my Kabylake
> Intel gpu.
> 
> Most of the time that for-loop with up to 100 repetitions (~ 100
> udelay(1) + one mmio register read) (cfe.
> https://elixir.bootlin.com/linux/v5.17-rc1/source/drivers/gpu/drm/i915/i915_irq.c#L856)
> will not execute, because most of the time that function gets called from
> the vblank irq handler and then that trigger condition (if
> (HAS_DDI(dev_priv) && !position)) is not true. However, it also gets called
> as part of power-saving on behalf of userspace context, whenever the
> desktop graphics goes idle for two video refresh cycles. If the desktop
> shows graphics activity again, and vblank interrupts need to get reenabled,
> the probability of hitting that case is then ~1-4% depending on video mode.
> How many loops it runs also varies.
> 
> On my little Intel(R) Core(TM) i5-8250U CPU machine with a mostly idle
> desktop, I observed about one hit every couple of seconds of regular use,
> and each hit took between 125 usecs and almost 250 usecs. I guess udelay(1)
> can take a bit longer than 1 usec?

it should get very close to this. Maybe something else extended the time
depending on what you observe.

> So that's too much for preempt-rt. What one could do is the following:
> 
> 1. In the for-loop in __intel_get_crtc_scanline(), add a preempt_enable()
> before the udelay(1); and a preempt_disable() again after it. Or
> potentially around the whole for-loop if the overhead of
> preempt_en/disable() is significant?

It is very optimized on x86 ;)

> 2. In intel_get_crtc_scanline() also wrap the call to
> __intel_get_crtc_scanline() into a preempt_disable() and preempt_enable(),
> so we can be sure that __intel_get_crtc_scanline() always gets called with
> preemption disabled.
> 
> Why should this work ok'ish? The point of the original preempt disable
> inside i915_get_crtc_scanoutpos
> <https://elixir.bootlin.com/linux/v5.17-rc1/C/ident/i915_get_crtc_scanoutpos>
> is that those two *stime = ktime_get() and *etime = ktime_get() clock
> queries happen as close to the scanout position query as possible to get a
> small confidence interval for when exactly the scanoutpos was
> read/determined from the display hardware. error = (etime - stime) is the
> error margin. If that margin becomes greater than 20 usecs, then the
> higher-level code will consider the measurement invalid and repeat the
> whole procedure up to 3 times before giving up.

The preempt-disable is needed then? The task is preemptible here on
PREEMPT_RT but it _may_ not come to this. The difference vs !RT is that
an interrupt will preempt this code without it.

> Normally, in my experience with different graphics chips, one would observe
> error < 3 usecs, so the measurement almost always succeeds at first try,
> only very rarely takes two attempts. The preempt disable is meant to make
> sure that this stays the case on a PREEMPT_RT kernel.

Was it needed?

> The problem here are the relatively rare cases where we hit that up to 100
> iterations for-loop. Here even on a regular kernel, due to hardware quirks,
> we already exceed the 20 usecs tolerance by a huge amount of more than 100
> usecs, leading to a retry of the measurement. And my tests showed that
> often the two succeeding retries also fail, because of hardware quirks can
> apparently create a blackout situation approaching 1 msec, so we lose
> anyway, regardless if we get preempted on a RT kernel or not. That's why
> enabling preemption on RT again during that for-loop should not make the
> situation worse and at least keep RT as real-time as intended.
> 
> In practice I would also expect that this failure case is the one least
> likely to impair userspace applications greatly in practice. The cases that
> mostly matter are the ones executed during vblank hardware irq, where the
> for-loop never executes and error margin and preempt off time is only about
> 1 usec. My own software which depends on very precise timestamps from the
> mechanism never reported >> 20 usecs errors during startup tests or runtime
> tests.

That is without RT I assume?

> > - intel_crtc_scanlines_since_frame_timestamp() not sure how long this
> >   may take in the worst case.
> >
> >
> intel_crtc_scanlines_since_frame_timestamp() should be harmless. That
> do-while loop just tries to make sure that two register reads that should
> happen within the same video refresh cycle are happening in the same
> refresh cycle. As such, the while loop will almost always just execute only
> once, and at most two times, so that's at most 6 mmio register reads for
> two loop iterations.
> 
> In the long run one could try to test if
> __intel_get_crtc_scanline_from_timestamp
> <https://elixir.bootlin.com/linux/v5.17-rc1/C/ident/__intel_get_crtc_scanline_from_timestamp>()
> wouldn't be the better choice for affected hardware always. Code and
> register descriptions suggest the feature is supported by all potentially
> affected hardware, so if it would turn out that that method works as
> accurate and reliable as the old one, we could save the overhead and time
> delays for that 100 for-loop iterations and make the whole timestamping
> more reliable on modern hw.
> 
> It was in the RT queue for a while and nobody complained.
> > Disable preemption on PREEPMPT_RT during timestamping.
> >
> >
> Do other patches exist to implement the preempt_dis/enable() also for AMD
> and NVidia / nouveau / vc4? I left corresponding markers for

No, nobody complained. Most likely the i915 is wider used since it is
built-in into many chipsets which then run RT and some of them use the
display in production.

> radeon/amdgpu-kms and RaspberryPi's vc4 driver. Ideally all kms drivers
> which use scanout position queries should have such code. Always a
> preempt_disable() before the "if (stime) *stime = ktime_get();" statement,
> and a preempt_enable() after the "if (etime) *etime = ktime_get();"
> statement.
> 
> Checking Linux 5.16 code, this should be safe - short preempt off interval
> with only a few mmio register reads - for all kms drivers that support it
> atm. I found the following functions to modify:
> 
> amdgpu: amdgpu_display_get_crtc_scanoutpos()
> radeon: radeon_get_crtc_scanoutpos()
> msm: mdp5_crtc_get_scanout_position() and dpu_crtc_get_scanout_position()
> ltdc: ltdc_crtc_get_scanout_position()
> vc4: vc4_crtc_get_scanout_position()

I that is "small" with locks and such, then it should work

> nouveau: In nvkm_head_mthd_scanoutpos(), one needs to put the
> preempt_disable() right before
> 
…
> 
> Is the plan to integrate these patches into the mainline kernel soon, as
> part of ongoing preempt-rt upstreaming?

I want to get the i915 in as part of RT upstreaming. But now I've been
thinking to not allowing i915 on RT via Kconfig and worry about it
afterwards.

> thanks,
> -mario

Sebastian