Re: [PATCH v2] drm/atomic-helpers: Invoke end_fb_access while owning plane state

Alyssa Ross <hi@xxxxxxxxx> · Wed, 29 Nov 2023 14:49:36 +0100

Thomas Zimmermann <tzimmermann@xxxxxxx> writes:

> Hi
>
> Am 27.11.23 um 17:25 schrieb Alyssa Ross:
>> Thomas Zimmermann <tzimmermann@xxxxxxx> writes:
>> 
>>> Invoke drm_plane_helper_funcs.end_fb_access before
>>> drm_atomic_helper_commit_hw_done(). The latter function hands over
>>> ownership of the plane state to the following commit, which might
>>> free it. Releasing resources in end_fb_access then operates on undefined
>>> state. This bug has been observed with non-blocking commits when they
>>> are being queued up quickly.
>>>
>>> Here is an example stack trace from the bug report. The plane state has
>>> been free'd already, so the pages for drm_gem_fb_vunmap() are gone.
>>>
>>> Unable to handle kernel paging request at virtual address 0000000100000049
>>> [...]
>>>   drm_gem_fb_vunmap+0x18/0x74
>>>   drm_gem_end_shadow_fb_access+0x1c/0x2c
>>>   drm_atomic_helper_cleanup_planes+0x58/0xd8
>>>   drm_atomic_helper_commit_tail+0x90/0xa0
>>>   commit_tail+0x15c/0x188
>>>   commit_work+0x14/0x20
>>>
>>> For aborted commits, it is still ok to run end_fb_access as part of the
>>> plane's cleanup. Add a test to drm_atomic_helper_cleanup_planes().
>>>
>>> v2:
>>> 	* fix test in drm_atomic_helper_cleanup_planes()
>>>
>>> Reported-by: Alyssa Ross <hi@xxxxxxxxx>
>>> Closes: https://lore.kernel.org/dri-devel/87leazm0ya.fsf@xxxxxxxxx/
>>> Suggested-by: Daniel Vetter <daniel@xxxxxxxx>
>>> Fixes: 94d879eaf7fb ("drm/atomic-helper: Add {begin,end}_fb_access to plane helpers")
>>> Signed-off-by: Thomas Zimmermann <tzimmermann@xxxxxxx>
>>> Cc: <stable@xxxxxxxxxxxxxxx> # v6.2+
>>> ---
>>>   drivers/gpu/drm/drm_atomic_helper.c | 17 +++++++++++++++++
>>>   1 file changed, 17 insertions(+)
>> 
>> Got this basically immediately. :(
>
> I've never seen such problems on other systems. Is there anything 
> different about the Mac systems? How do you trigger these errors?

My understanding is that all sorts of things are different, but I don't
know too much about the details.  There's of course a chance that there
could be some other change in the Asahi Linux kernel that causes this
problem to surface — as I said, I reviewed the diff with mainline and
didn't see anything that looked relevant, but I could well have missed
something.  I don't think I can test mainline directly, as it doesn't
yet support enough of the hardware — for slightly older Apple Silicon
Mac models, I think enough is upstream that this would be possible, but
I don't have access to any.

I started off encountering these errors every few days.  I noticed them
because they would sometimes result in my system either starting to
freeze for 10 seconds at a time, or until I switched VT.  They seem to
correlate with the system being under high CPU load.  I was also able to
substantially increase the frequency with which they occurred by adding
logging to the kernel — even just drm.debug=0x10 makes a big difference,
and when I also added a few dump_backtrace() calls when I was trying to
understand the code and diagnose the problem, I would relatively
consistently encounter an Oops within a few minutes of load.

BTW: v3 is looking good so far.  I've only been testing it since this
morning, though, so I'll keep trying it out for a bit longer before I
declare the problem to have been solved and send a Tested-by.
Attachment:
signature.asc

Description: PGP signature