Hi Daniel, On Wed, Apr 13, 2022 at 01:20:11PM +0200, Daniel Vetter wrote: > On Wed, 13 Apr 2022 at 01:36, Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote: > > On 4/8/2022 9:04 PM, Abhinav Kumar wrote: > > > > > > > > > On 4/7/2022 4:12 PM, Rob Clark wrote: > > >> On Thu, Apr 7, 2022 at 3:59 PM Abhinav Kumar > > >> <quic_abhinavk@xxxxxxxxxxx> wrote: > > >>> > > >>> Hi Rob and Daniel > > >>> > > >>> On 4/7/2022 3:51 PM, Rob Clark wrote: > > >>>> On Wed, Apr 6, 2022 at 6:27 PM Jessica Zhang > > >>>> <quic_jesszhan@xxxxxxxxxxx> wrote: > > >>>>> > > >>>>> > > >>>>> > > >>>>> On 3/31/2022 8:20 AM, Daniel Vetter wrote: > > >>>>>> The stuff never really worked, and leads to lots of fun because it > > >>>>>> out-of-order frees atomic states. Which upsets KASAN, among other > > >>>>>> things. > > >>>>>> > > >>>>>> For async updates we now have a more solid solution with the > > >>>>>> ->atomic_async_check and ->atomic_async_commit hooks. Support for > > >>>>>> that > > >>>>>> for msm and vc4 landed. nouveau and i915 have their own commit > > >>>>>> routines, doing something similar. > > >>>>>> > > >>>>>> For everyone else it's probably better to remove the use-after-free > > >>>>>> bug, and encourage folks to use the async support instead. The > > >>>>>> affected drivers which register a legacy cursor plane and don't > > >>>>>> either > > >>>>>> use the new async stuff or their own commit routine are: amdgpu, > > >>>>>> atmel, mediatek, qxl, rockchip, sti, sun4i, tegra, virtio, and > > >>>>>> vmwgfx. > > >>>>>> > > >>>>>> Inspired by an amdgpu bug report. > > >>>>>> > > >>>>>> v2: Drop RFC, I think with amdgpu converted over to use > > >>>>>> atomic_async_check/commit done in > > >>>>>> > > >>>>>> commit 674e78acae0dfb4beb56132e41cbae5b60f7d662 > > >>>>>> Author: Nicholas Kazlauskas <nicholas.kazlauskas@xxxxxxx> > > >>>>>> Date: Wed Dec 5 14:59:07 2018 -0500 > > >>>>>> > > >>>>>> drm/amd/display: Add fast path for cursor plane updates > > >>>>>> > > >>>>>> we don't have any driver anymore where we have userspace expecting > > >>>>>> solid legacy cursor support _and_ they are using the atomic > > >>>>>> helpers in > > >>>>>> their fully glory. So we can retire this. > > >>>>>> > > >>>>>> v3: Paper over msm and i915 regression. The complete_all is the only > > >>>>>> thing missing afaict. > > >>>>>> > > >>>>>> v4: Fixup i915 fixup ... > > >>>>>> > > >>>>>> References: https://bugzilla.kernel.org/show_bug.cgi?id=199425 > > >>>>>> References: > > >>>>>> https://lore.kernel.org/all/20220221134155.125447-9-maxime@xxxxxxxxxx/ > > >>>>>> > > >>>>>> References: https://bugzilla.kernel.org/show_bug.cgi?id=199425 > > >>>>>> Cc: Maxime Ripard <maxime@xxxxxxxxxx> > > >>>>>> Tested-by: Maxime Ripard <maxime@xxxxxxxxxx> > > >>>>>> Cc: mikita.lipski@xxxxxxx > > >>>>>> Cc: Michel Dänzer <michel@xxxxxxxxxxx> > > >>>>>> Cc: harry.wentland@xxxxxxx > > >>>>>> Cc: Rob Clark <robdclark@xxxxxxxxx> > > >>>>> > > >>>>> Hey Rob, > > >>>>> > > >>>>> I saw your tested-by and reviewed-by tags on Patchwork. Just curious, > > >>>>> what device did you test on? > > >>>> > > >>>> I was testing on strongbad.. v5.18-rc1 + patches (notably, revert > > >>>> 80253168dbfd ("drm: of: Lookup if child node has panel or bridge") > > >>>> > > >>>> I think the display setup shouldn't be significantly different than > > >>>> limozeen (ie. it's an eDP panel). But I didn't do much start/stop > > >>>> ui.. I was mostly looking to make sure cursor movements weren't > > >>>> causing fps drops ;-) > > >>>> > > >>>> BR, > > >>>> -R > > >>> > > >>> start ui/ stop ui is a basic operation for us to use IGT on msm-next. > > >>> So we cannot let that break. > > >>> > > >>> I think we need to check whats causing this splat. > > >>> > > >>> Can we hold back this change till then? > > >> > > >> Can you reproduce on v5.18-rc1 (plus 80253168dbfd)? I'm running a > > >> loop of stop ui / start ui and hasn't triggered a splat yet. > > >> > > >> Otherwise maybe you can addr2line to figure out where it crashed? > > >> > > >> BR, > > >> -R > > > > > > So this is not a crash. Its a warning splat coming from > > > > > > https://gitlab.freedesktop.org/drm/msm/-/blob/msm-next/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c#L785 > > > > > > > > > Looks like the complete_commit() which should signal the event has not > > > happened before the next cursor commit. > > > > > > Somehow this change is affecting the flow to miss the event signaling > > > that the event is done. > > > > > > We tried a couple of approaches but couldnt still fix the warning. > > > > > > Will continue to check further next week. > > > > > >> > > >>> Thanks > > >>> > > >>> Abhinav > > > > After checking this more this week, I think the current patch needs to > > be changed a bit. > > > > So, here you are removing the complete_all part and leaving that to the > > individual drivers, which is fine. > > > > But, you are also removing the continue part which I think seems > > incorrect and causing these warnings for MSM driver. > > > > @@ -2135,12 +2128,6 @@ int drm_atomic_helper_setup_commit(struct > > drm_atomic_state *state, > > continue; > > } > > > > - /* Legacy cursor updates are fully unsynced. */ > > - if (state->legacy_cursor_update) { > > - complete_all(&commit->flip_done); > > - continue; > > - } > > - > > > > Thats because MSM driver thinks that if the previous crtc_state->event > > was not consumed, then something went wrong and throws a warning. > > > > if (!new_crtc_state->event) { > > commit->event = kzalloc(sizeof(*commit->event), > > GFP_KERNEL); > > if (!commit->event) > > return -ENOMEM; > > > > new_crtc_state->event = commit->event; > > } > > > > But for a cursor update, we should not or need not populate the event at > > all because it is async. > > > > So i think we should still keep the continue, rest of the patch is fine. > > > > @@ -2128,6 +2128,9 @@ int drm_atomic_helper_setup_commit(struct > > drm_atomic_state *state, > > continue; > > } > > > > + if (state->legacy_cursor_update) > > + continue; > > + > > > > Let me know your comments. > > Thanks a lot for your excellent analysis. I need to think this through > some more and figure out what exactly we should be doing. We integrated this in the (downstream) RaspberryPi kernel, and it seems to trigger some weird regressions: - If we move the cursor under X, the primary plane update is stuck: https://github.com/raspberrypi/linux/issues/4988 - Switching back and forth between VT gets the kernel stuck (with a locking issue in fb_release?) https://github.com/raspberrypi/linux/issues/5011 I haven't been able to look into it at this point, I'll let you know if I find anything relevant Maxime
Attachment:
signature.asc
Description: PGP signature