On Tue, 14 May 2024 at 23:21, Dave Airlie <airlied@xxxxxxxxx> wrote: > > In drivers the main thing is a new driver for ARM Mali firmware based > GPUs, otherwise there are a lot of changes to amdgpu/xe/i915/msm and > scattered changes to everything else. Hmm. There's something seriously wrong with amdgpu. I'm getting a ton of__force_merge warnings: WARNING: CPU: 0 PID: 1069 at drivers/gpu/drm/drm_buddy.c:199 __force_merge+0x14f/0x180 [drm_buddy] Modules linked in: hid_logitech_hidpp hid_logitech_dj uas usb_storage amdgpu drm_ttm_helper ttm video drm_exec drm_suballoc_helper amdxcp drm_buddy gpu_sched drm_display_helper drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel drm ghash_clmulni_intel igb atlantic nvme dca macsec ccp i2c_algo_bit nvme_core sp5100_tco wmi ip6_tables ip_tables fuse CPU: 0 PID: 1069 Comm: plymouthd Not tainted 6.9.0-07381-g3860ca371740 #60 Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40 AORUS MASTER, BIOS F7 09/07/2022 RIP: 0010:__force_merge+0x14f/0x180 [drm_buddy] Code: 74 0d 49 8b 44 24 18 48 d3 e0 49 29 44 24 30 4c 89 e7 ba 01 00 00 00 e8 9f 00 00 00 44 39 e8 73 1f 49 8b 04 24 e9 25 ff ff ff <0f> 0b 4c 39 c3 75 a3 eb 99 b8 f4 ff ff ff c3 b8 f4 ff ff ff eb 02 RSP: 0018:ffffb87a81cb7908 EFLAGS: 00010246 RAX: ffff9b1915de8000 RBX: ffff9b1919478288 RCX: 000000000ffff800 RDX: ffff9b19194782f8 RSI: ffff9b19194782d0 RDI: ffff9b19194782b0 RBP: 0000000000000000 R08: ffff9b1919478288 R09: 0000000000001000 R10: 0000000000000800 R11: 0000000000000000 R12: ffff9b192590fa18 R13: 000000000000000d R14: 0000000010000000 R15: 0000000000000000 FS: 00007fa06bfa9740(0000) GS:ffff9b281e000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555adb857000 CR3: 000000011b516000 CR4: 0000000000350ef0 Call Trace: ? __force_merge+0x14f/0x180 [drm_buddy] drm_buddy_alloc_blocks+0x249/0x400 [drm_buddy] ? __cond_resched+0x16/0x40 amdgpu_vram_mgr_new+0x204/0x3f0 [amdgpu] ttm_resource_alloc+0x31/0x120 [ttm] ttm_bo_alloc_resource+0xbc/0x260 [ttm] ttm_bo_validate+0x9f/0x210 [ttm] ttm_bo_init_reserved+0x103/0x130 [ttm] amdgpu_bo_create+0x246/0x400 [amdgpu] ? amdgpu_bo_destroy+0x70/0x70 [amdgpu] amdgpu_bo_create_user+0x29/0x40 [amdgpu] amdgpu_mode_dumb_create+0x108/0x190 [amdgpu] ? amdgpu_bo_destroy+0x70/0x70 [amdgpu] ? drm_mode_create_dumb+0xa0/0xa0 [drm] drm_ioctl_kernel+0xad/0xd0 [drm] drm_ioctl+0x330/0x4b0 [drm] ? drm_mode_create_dumb+0xa0/0xa0 [drm] amdgpu_drm_ioctl+0x41/0x80 [amdgpu] __x64_sys_ioctl+0xd2a/0xe00 ? update_process_times+0x89/0xa0 ? tick_nohz_handler+0xe2/0x120 ? timerqueue_add+0x94/0xa0 ? __hrtimer_run_queues+0x12b/0x250 ? ktime_get+0x34/0xb0 ? lapic_next_event+0x12/0x20 ? clockevents_program_event+0x78/0xd0 ? hrtimer_interrupt+0x118/0x390 ? sched_clock+0x5/0x10 do_syscall_64+0x68/0x130 ? __irq_exit_rcu+0x53/0xb0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 and eventually the whole thing just crashes entirely, with a bad page state in the VM: BUG: Bad page state in process kworker/u261:13 pfn:31fb9a page: refcount:0 mapcount:0 mapping:00000000ff0b239e index:0x37ce8 pfn:0x31fb9a aops:btree_aops ino:1 flags: 0x2fffc600000020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x3fff) page_type: 0xffffffff() which comes from a btrfs worker (btrfs-delayed-meta btrfs_work_helper), but I would not be surprised if that was caused by whatever odd thing is going on with the DRM code. IOW, it *looks* like this code ends up just corrupting memory in horrible ways. Linus Linus