Re: [git pull] drm for 6.10-rc1

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 15 May 2024 13:06:15 -0700

On Tue, 14 May 2024 at 23:21, Dave Airlie <airlied@xxxxxxxxx> wrote:
>
> In drivers the main thing is a new driver for ARM Mali firmware based
> GPUs, otherwise there are a lot of changes to amdgpu/xe/i915/msm and
> scattered changes to everything else.

Hmm. There's something seriously wrong with amdgpu.

I'm getting a ton of__force_merge warnings:

  WARNING: CPU: 0 PID: 1069 at drivers/gpu/drm/drm_buddy.c:199
__force_merge+0x14f/0x180 [drm_buddy]
  Modules linked in: hid_logitech_hidpp hid_logitech_dj uas
usb_storage amdgpu drm_ttm_helper ttm video drm_exec
drm_suballoc_helper amdxcp drm_buddy gpu_sched drm_display_helper
drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel drm
ghash_clmulni_intel igb atlantic nvme dca macsec ccp i2c_algo_bit
nvme_core sp5100_tco wmi ip6_tables ip_tables fuse
  CPU: 0 PID: 1069 Comm: plymouthd Not tainted 6.9.0-07381-g3860ca371740 #60
  Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS
MASTER/TRX40 AORUS MASTER, BIOS F7 09/07/2022
  RIP: 0010:__force_merge+0x14f/0x180 [drm_buddy]
  Code: 74 0d 49 8b 44 24 18 48 d3 e0 49 29 44 24 30 4c 89 e7 ba 01 00
00 00 e8 9f 00 00 00 44 39 e8 73 1f 49 8b 04 24 e9 25 ff ff ff <0f> 0b
4c 39 c3 75 a3 eb 99 b8 f4 ff ff ff c3 b8 f4 ff ff ff eb 02
  RSP: 0018:ffffb87a81cb7908 EFLAGS: 00010246
  RAX: ffff9b1915de8000 RBX: ffff9b1919478288 RCX: 000000000ffff800
  RDX: ffff9b19194782f8 RSI: ffff9b19194782d0 RDI: ffff9b19194782b0
  RBP: 0000000000000000 R08: ffff9b1919478288 R09: 0000000000001000
  R10: 0000000000000800 R11: 0000000000000000 R12: ffff9b192590fa18
  R13: 000000000000000d R14: 0000000010000000 R15: 0000000000000000
  FS:  00007fa06bfa9740(0000) GS:ffff9b281e000000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000555adb857000 CR3: 000000011b516000 CR4: 0000000000350ef0
  Call Trace:
   ? __force_merge+0x14f/0x180 [drm_buddy]
   drm_buddy_alloc_blocks+0x249/0x400 [drm_buddy]
   ? __cond_resched+0x16/0x40
   amdgpu_vram_mgr_new+0x204/0x3f0 [amdgpu]
   ttm_resource_alloc+0x31/0x120 [ttm]
   ttm_bo_alloc_resource+0xbc/0x260 [ttm]
   ttm_bo_validate+0x9f/0x210 [ttm]
   ttm_bo_init_reserved+0x103/0x130 [ttm]
   amdgpu_bo_create+0x246/0x400 [amdgpu]
   ? amdgpu_bo_destroy+0x70/0x70 [amdgpu]
   amdgpu_bo_create_user+0x29/0x40 [amdgpu]
   amdgpu_mode_dumb_create+0x108/0x190 [amdgpu]
   ? amdgpu_bo_destroy+0x70/0x70 [amdgpu]
   ? drm_mode_create_dumb+0xa0/0xa0 [drm]
   drm_ioctl_kernel+0xad/0xd0 [drm]
   drm_ioctl+0x330/0x4b0 [drm]
   ? drm_mode_create_dumb+0xa0/0xa0 [drm]
   amdgpu_drm_ioctl+0x41/0x80 [amdgpu]
   __x64_sys_ioctl+0xd2a/0xe00
   ? update_process_times+0x89/0xa0
   ? tick_nohz_handler+0xe2/0x120
   ? timerqueue_add+0x94/0xa0
   ? __hrtimer_run_queues+0x12b/0x250
   ? ktime_get+0x34/0xb0
   ? lapic_next_event+0x12/0x20
   ? clockevents_program_event+0x78/0xd0
   ? hrtimer_interrupt+0x118/0x390
   ? sched_clock+0x5/0x10
   do_syscall_64+0x68/0x130
   ? __irq_exit_rcu+0x53/0xb0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

and eventually the whole thing just crashes entirely, with a bad page
state in the VM:

  BUG: Bad page state in process kworker/u261:13  pfn:31fb9a
  page: refcount:0 mapcount:0 mapping:00000000ff0b239e index:0x37ce8
pfn:0x31fb9a
  aops:btree_aops ino:1
  flags: 0x2fffc600000020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x3fff)
  page_type: 0xffffffff()

which comes from a btrfs worker (btrfs-delayed-meta
btrfs_work_helper), but I would not be surprised if that was caused by
whatever odd thing is going on with the DRM code. IOW, it *looks* like
this code ends up just corrupting memory in horrible ways.

            Linus

                Linus