Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 15.03.23 um 18:54 schrieb Steven Rostedt:
On Wed, 15 Mar 2023 11:57:12 -0400
Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:

The WARN_ON triggered:

[   21.481449] mpls_gso: MPLS GSO support
[   21.488795] IPI shorthand broadcast: enabled
[   21.488873] ------------[ cut here ]------------
[   21.490101] ------------[ cut here ]------------

[   21.491693] WARNING: CPU: 1 PID: 38 at drivers/gpu/drm/ttm/ttm_bo.c:332 ttm_bo_release+0x2ac/0x2fc  <<<---- Line of the added WARN_ON()

[   21.492940] refcount_t: underflow; use-after-free.
[   21.492965] WARNING: CPU: 0 PID: 84 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
[   21.496116] Modules linked in:
[   21.497197] Modules linked in:

The problem here is that two backtraces mix together. So it's pretty hard to figure out what's going on.


[   21.500105] CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
[   21.500789] CPU: 0 PID: 84 Comm: kworker/0:1H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
[   21.501882] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[   21.503533] sched_clock: Marking stable (20788024762, 714243692)->(22140778105, -638509651)
[   21.504080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[   21.504089] Workqueue: ttm ttm_bo_delayed_delete
[   21.507196] Workqueue: events drm_fb_helper_damage_work
[   21.509235]
[   21.510291] registered taskstats version 1
[   21.510302] Running ring buffer tests...
[   21.511792]
[   21.513870] EIP: refcount_warn_saturate+0xb6/0xfc
[   21.515261] EIP: ttm_bo_release+0x2ac/0x2fc
[   21.516566] Code: 68 00 27 0c d8 e8 36 3b aa ff 0f 0b 58 c9 c3 90 80 3d 41 c2 37 d8 00 75 8a c6 05 41 c2 37 d8 01 68 2c 27 0c d8 e8 16 3b aa ff <0f> 0b 59 c9 c3 80 3d 3f c2 37 d8 00 0f 85 67 ff ff ff c6 05 3f c2
[   21.516998] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b 8b 43 10 85 c0 0f 84 a1 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 9c fd ff ff 8d 76 00 <0f> 0b e9 92 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
[   21.517905] EAX: 00000026 EBX: c129d150 ECX: 00000040 EDX: 00000002
[   21.518987] EAX: d78c8550 EBX: c129d134 ECX: c129d134 EDX: 00000001
[   21.519337] ESI: c129d0bc EDI: f6f91200 EBP: c2b8bf18 ESP: c2b8bf14
[   21.520617] ESI: c129d000 EDI: c126a7a0 EBP: c1839c24 ESP: c1839bec
[   21.521546] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
[   21.526154] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
[   21.526162] CR0: 80050033 CR2: 00000000 CR3: 18506000 CR4: 00150ef0
[   21.526166] Call Trace:
[   21.526189]  ? ww_mutex_unlock+0x3a/0x94
[   21.530300] CR0: 80050033 CR2: ff9ff000 CR3: 18506000 CR4: 00150ef0
[   21.531722]  ? ttm_bo_cleanup_refs+0xc4/0x1e0
[   21.533114] Call Trace:
[   21.534516]  ttm_mem_evict_first+0x3d3/0x568
[   21.535901]  ttm_bo_delayed_delete+0x9c/0xa4
[   21.537391]  ? kfree+0x6b/0xdc
[   21.538901]  process_one_work+0x21a/0x484
[   21.540279]  ? ttm_range_man_alloc+0xe0/0xec
[   21.540854]  worker_thread+0x14a/0x39c
[   21.541714]  ? ttm_range_man_fini_nocheck+0xe8/0xe8
[   21.543332]  kthread+0xea/0x10c
So I triggered it again, and the same backtrace is there.

[   21.544301]  ttm_bo_mem_space+0x1d0/0x1e4
It looks like the object is being reserved before it's fully removed. And
it's somewhere in this tty_bo_mem_space() (which comes from the
qxl_bo_create()).

I don't know this code at all, nor do I have any idea of what it's trying
to do. All I know is that this is triggering often (not always), and it has
to do with some race.

Now my config has lots of debugging enabled, which slows down the system
quite a bit. This also happens to open up race windows. Just because your
testing doesn't trigger it, doesn't mean that the race doesn't exist. It's
just likely to be very hard to hit.

[   21.544942]  ? process_one_work+0x484/0x484
[   21.545887]  ttm_bo_validate+0xc5/0x19c
[   21.546986]  ? kthread_complete_and_exit+0x1c/0x1c
[   21.547680]  ttm_bo_init_reserved+0x15e/0x1fc
[   21.548716]  ret_from_fork+0x1c/0x28
[   21.549650]  qxl_bo_create+0x145/0x20c
Here's the latest backtrace:

[  170.817449] ------------[ cut here ]------------
[  170.817455] ------------[ cut here ]------------
[  170.818210] refcount_t: underflow; use-after-free.
[  170.818228] WARNING: CPU: 0 PID: 267 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
[  170.819352] WARNING: CPU: 3 PID: 2382 at drivers/gpu/drm/ttm/ttm_bo.c:332 ttm_bo_release+0x278/0x2c8
[  170.820124] Modules linked in:
[  170.822127] Modules linked in:
[  170.823829]
[  170.823832] CPU: 0 PID: 267 Comm: kworker/0:10H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #998
[  170.824610]
[  170.825121] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  170.825124] Workqueue: ttm ttm_bo_delayed_delete
[  170.825498] CPU: 3 PID: 2382 Comm: kworker/3:3 Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #998
[  170.826996]
[  170.827367] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  170.828839] EIP: refcount_warn_saturate+0xb6/0xfc
[  170.829842] Workqueue: events drm_fb_helper_damage_work
[  170.831325] Code: 68 50 ab ef d3 e8 76 ce b2 ff 0f 0b 58 c9 c3 90 80 3d 4e e0 18 d4 00 75 8a c6 05 4e e0 18 d4 01 68 7c ab ef d3 e8 56 ce b2 ff <0f> 0b 59 c9 c3 80 3d 4c e0 18 d4 00 0f 85 67 ff ff ff c6 05 4c e0
[  170.831670]
[  170.833138] EAX: 00000026 EBX: c2aa6ef4 ECX: 00000002 EDX: 80000002
[  170.834080] EIP: ttm_bo_release+0x278/0x2c8
[  170.834910] ESI: c27744e0 EDI: f758eaa8 EBP: c3083f34 ESP: c3083f30
[  170.838453] Code: 00 90 89 f8 e9 91 fe ff ff 90 0f 0b 8b 43 10 85 c0 0f 84 d5 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 d0 fd ff ff 8d 76 00 <0f> 0b e9 c6 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
[  170.838714] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
[  170.839872] EAX: d3835cf0 EBX: c2aa6ed8 ECX: c2aa6ed8 EDX: 00000001
[  170.840546] CR0: 80050033 CR2: 005ce01c CR3: 03afa000 CR4: 00150ef0
[  170.841669] ESI: c2aa6e00 EDI: c11333d8 EBP: c3b15ca0 ESP: c3b15c68
[  170.844583] Call Trace:
[  170.845738] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
[  170.846732]  ttm_bo_delayed_delete+0x7c/0x84
[  170.847766] CR0: 80050033 CR2: 005b59b1 CR3: 03b12000 CR4: 00150ef0
[  170.848761]  process_one_work+0x18a/0x350
[  170.849177] Call Trace:
[  170.850250]  worker_thread+0x136/0x378
[  170.850942]  ? ttm_resource_free+0x57/0x6c
[  170.851928]  ? _raw_spin_unlock_irqrestore+0x25/0x40
[  170.852583]  ? ttm_bo_cleanup_memtype_use+0x5f/0x64
[  170.853002]  kthread+0xda/0xfc
[  170.853608]  ? ttm_bo_cleanup_refs+0xbb/0x1c8
[  170.854266]  ? rescuer_thread+0x340/0x340
[  170.855058]  ttm_mem_evict_first+0x324/0x3a0
[  170.855840]  ? kthread_complete_and_exit+0x1c/0x1c
[  170.856341]  ttm_bo_mem_space+0x1d8/0x1ec
[  170.857056]  ret_from_fork+0x1c/0x28
[  170.857702]  ttm_bo_validate+0xb3/0x13c
[  170.858393] ---[ end trace 0000000000000000 ]---
[  170.859167]  ttm_bo_init_reserved+0x12e/0x1cc
[  170.863261] systemd[1]: Starting Create System Users...
[  170.863964]  qxl_bo_create+0x14a/0x1f8
[  170.892037]  ? qxl_ttm_debugfs_init+0x48/0x48
[  170.892043]  qxl_alloc_bo_reserved+0x39/0x98
          Starting ESC[0;1;39mCreat[  170.892047]  ? kmalloc_trace+0x22/0x84
e System UsersESC[0m...
[  170.895826]  qxl_image_alloc_objects+0x91/0xfc
[  170.897122]  qxl_draw_dirty_fb+0x159/0x440
[  170.898207] systemd[1]: Finished Set the console keyboard layout.
[  170.898329]  qxl_framebuffer_surface_dirty+0xfb/0x1cc
[  170.901528]  ? qxl_create_plane+0xc4/0xc4
[  170.901533]  drm_fbdev_fb_dirty+0x129/0x1ac
[ESC[0;32m  OK  ESC[0m] Finished ESC[0[  170.901537]  drm_fb_helper_damage_work+0x8f/0x158
;1;39mSet the console keyboard layoutESC[0m.
[  170.905435]  process_one_work+0x18a/0x350
[  170.906578]  worker_thread+0x136/0x378
[  170.906832] systemd[1]: Finished Load/Save Random Seed.
[  170.907195]  ? _raw_spin_unlock_irqrestore+0x25/0x40
[  170.910267]  kthread+0xda/0xfc
[  170.910271]  ? rescuer_thread+0x340/0x340
[  170.910275]  ? kthread_complete_and_exit+0x1c/0x1c
[  170.910278]  ret_from_fork+0x1c/0x28
[  170.910282] ---[ end trace 0000000000000000 ]---

That looks like an eviction is racing with finishing this BO. Let me take a closer look.

It would help if you replace the WARN_ON(bo->delayed_delete.func) with an "if (WARN_ON(bo->delayed_delete.func)) return;" and disable the refcount warning.

This way we should not get other warnings except for the one we are looking for.

Regards,
Christian.



-- Steve




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux