On 12/14/18 3:46 AM, Michel Dänzer wrote: > On 2018-12-13 9:59 p.m., Kazlauskas, Nicholas wrote: >> On 12/13/18 2:21 PM, Kazlauskas, Nicholas wrote: >>> On 12/13/18 11:01 AM, Kazlauskas, Nicholas wrote: >>>> On 12/13/18 10:48 AM, Michel Dänzer wrote: >>>>> On 2018-12-05 8:59 p.m., Nicholas Kazlauskas wrote: >>>>>> [Why] >>>>>> Legacy cursor plane updates from drm helpers go through the full >>>>>> atomic codepath. A high volume of cursor updates through this slow >>>>>> code path can cause subsequent page-flips to skip vblank intervals >>>>>> since each individual update is slow. >>>>>> >>>>>> This problem is particularly noticeable for the compton compositor. >>>>>> >>>>>> [How] >>>>>> A fast path for cursor plane updates is added by using DRM asynchronous >>>>>> commit support provided by async_check and async_update. These don't do >>>>>> a full state/flip_done dependency stall and they don't block other >>>>>> commit work. >>>>>> >>>>>> However, DC still expects itself to be single-threaded for anything >>>>>> that can issue register writes. Screen corruption or hangs can occur >>>>>> if write sequences overlap. Every call that potentially perform >>>>>> register writes needs to be guarded for asynchronous updates to work. >>>>>> The dc_lock mutex was added for this. >>>>>> >>>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106175 >>>>>> >>>>>> Cc: Leo Li <sunpeng.li@xxxxxxx> >>>>>> Cc: Harry Wentland <harry.wentland@xxxxxxx> >>>>>> Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@xxxxxxx> >>>>> >>>>> Looks like this change introduced (or at least exposed) a reference >>>>> counting bug resulting in use-after-free when Xorg shuts down[0]. See >>>>> the attached dmesg excerpt (note that I wrapped the !bo->pin_count check >>>>> in amdgpu_bo_unpin in WARN_ON_ONCE). >>>>> >>>>> >>>>> [0] Only with >>>>> https://gitlab.freedesktop.org/xorg/driver/xf86-video-amdgpu/commit/0d60233d26ec70d4e1faa343b438e33829c6d5e4 >>>>> , i.e. alternating between two BOs for the HW cursor, instead of always >>>>> using the same one. >>>>> >>>>> >>>> >>>> Thanks for the heads up, I don't think I had that patch in my >>>> xf86-video-amdgpu when testing the desktop stack. >>>> >>>> The async atomic helper does the: >>>> >>>> drm_atomic_helper_prepare_planes >>>> drm_atomic_helper_async_commit >>>> drm_atomic_helper_cleanup_planes >>>> >>>> ...sequence correctly from what I can tell, so maybe it's something with >>>> dm_plane_helper_prepare_fb or dm_plane_helper_cleanup_fb itself. >>>> >>>> One case where unref could be called (not following a ref) is during >>>> drm_atomic_helper_prepare_planes - if prepare_fb fails then cleanup_fb >>>> gets called regardless, and we only ref the fb if prepare_fb is in the >>>> success path. >>> >>> The prepare_fb/cleanup_fb calls are actually fine since cleanup_fb only >>> gets called on planes that had prepare_fb succeed in all cases as far as >>> I can tell. >>> >>> I think the bug here might be forgetting to set the plane->state to the >>> new_state. The cleanup fb callback decides whether to call it on the old >>> plane state or new plane state depending on if the commit was aborted or >>> not. I think every fast plane update might be treated as aborted in this >>> case. >> >> This is a bug with DRM, actually. >> >> Typically for a regular atomic commit the prepare_fb callback is called >> for the new_plane_state and cleanup_fb is called for the old_plane_state >> at the end of commit tail. >> >> However, for asynchronous commits this isn't the same - prepare_fb is >> called for new_plane_state and cleanup_fb is then immediately called >> after also for the new_plane_state. >> >> Looking at your stack trace I can see that this is exactly what causes >> the use after free, >> >> The CRTC has changed so it's stuck in the slow path (commit_tail is in >> the trace). However, the plane->state->fb has already been unpinned and >> unref. But the plane->state->fb is *not* NULL from the previous fast >> update, so when it gets to cleanup planes it tries to free the >> old_plane_state it unpins and unrefs the bo a second time. >> >> Then a new fast cursor update comes along (and the fb hasn't changed) so >> it tries to prepare_fb on the same freed bo. > > Do you have an idea for a fix? If not, I'm afraid we need to revert this > change again for now, as the consequences can be severe (in one case, > ext4 code started complaining, I couldn't reboot cleanly and had to fsck > afterwards). > > Yeah, I have a workaround that I will post for this. I'm not sure if this is intended behavior or not for DRM - it doesn't feel correct to me, but I can understand why it'd make sense in some cases. It really depends on what the cleanup_fb is intending to do. It certainly doesn't work with the pin/unpin model though. Nicholas Kazlauskas _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx