On Fri, Jan 17, 2025 at 11:47:41AM +0200, Gwan-gyeong Mun wrote: > Hi, > This kernel oops, which I reported before, was caused by my incorrect > modification (incorrect applying of review comments) of this patch > "[v3,19/30] drm/xe: Add SVM device memory mirroring" > ( the kernel oops occurred because the xe_drm_pagemap_map_dma() and > xe_devm_add() functions were built in the form of empty functions. ) > > This issue disappeared after proper patch modifications were applied. > So please ignore the previously reported this kernel oops. > My post unfortunately had some bugs masked by Kconfig issue Himal pointed out. If you want to test of this code I suggest you pull this branch [1] - it should be stable with all fixes I mention on this rev. Matt [1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-stable-1-13-25/-/tree/stable-1.13.25?ref_type=heads > Br, > > G.G. > > On 1/7/25 2:19 PM, Gwan-gyeong Mun wrote: > > Hi Matthew Brost, > > > > After applying this patch series and the following to the latest drm- > > tip, while testing[1] with the mentioned IGT, I faced a kernel oops[3]. > > It makes prevent progressing of the mentioned igt tests. > > Could you please check the following oops log? > > > > (1) apply comments of "[v3,05/30] drm/gpusvm: Add support for GPU Shared > > Virtual Memory" > > (2) apply comments of "[v3,15/30] drm/xe: Add unbind to SVM garbage > > collector" > > (3) drop "[v3,27/30] drm/xe: Basic SVM BO eviction" patch > > > > The kernel config used, the entire dmesg, and detailed information can > > be found in [2]. > > > > br, > > > > G.G. > > > > [1] used igt command: xe_exec_system_allocator --run-subtest once-malloc > > [2] https://gitlab.freedesktop.org/elongbug/drm-tip/-/snippets/7823 > > > > [3] kernel oops dmesg > > [ 51.365230] Console: switching to colour VGA+ 80x25 > > [ 51.367772] [IGT] xe_exec_system_allocator: executing > > [ 51.383611] [IGT] xe_exec_system_allocator: starting subtest once-malloc > > [ 51.386066] xe 0000:00:04.0: [drm:vm_bind_ioctl_ops_create [xe]] > > op=0, addr=0x0000000000000000, range=0x0001000000000000, > > bo_offset_or_userptr=0x0000000000000000 > > [ 51.386171] xe 0000:00:04.0: [drm:vm_bind_ioctl_ops_create [xe]] MAP: > > addr=0x0000000000000000, range=0x0001000000000000 > > [ 51.389429] xe 0000:00:04.0: [drm:xe_svm_handle_pagefault [xe]] PAGE > > FAULT: asid=1, gpusvm=0xffff8881775e9188, vram=0,0, > > seqno=9223372036854775807, start=0x005584e8400000, end=0x005584e8410000, > > size=65536 > > [ 51.389529] xe 0000:00:04.0: [drm:xe_svm_handle_pagefault [xe]] > > ALLOCATE VRAM: asid=1, gpusvm=0xffff8881775e9188, vram=0,0, > > seqno=9223372036854775807, start=0x005584e8400000, end=0x005584e8410000, > > size=65536 > > [ 51.389935] xe 0000:00:04.0: [drm:xe_svm_handle_pagefault [xe]] ALLOC > > VRAM: asid=1, gpusvm=0xffff8881775e9188, pfn=3126960, npages=16 > > [ 51.390048] xe 0000:00:04.0: [drm:xe_svm_invalidate [xe]] INVALIDATE: > > asid=1, gpusvm=0xffff8881775e9188, seqno=3, start=0x00005584e8400000, > > end=0x00005584e8410000, event=6 > > [ 51.390440] xe 0000:00:04.0: [drm:xe_svm_invalidate [xe]] NOTIFIER: > > asid=1, gpusvm=0xffff8881775e9188, vram=0,0, seqno=9223372036854775807, > > start=0x005584e8400000, end=0x005584e8410000, size=65536 > > [ 51.390948] Oops: general protection fault, probably for non- > > canonical address 0x3fff88842fc80000: 0000 [#1] PREEMPT SMP NOPTI > > [ 51.391624] CPU: 1 UID: 0 PID: 76 Comm: kworker/u17:0 Not tainted > > 6.13.0-rc4-drm-tip-test+ #48 > > [ 51.392088] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > BIOS 1.15.0-1 04/01/2014 > > [ 51.392527] Workqueue: xe_gt_page_fault_work_queue pf_queue_work_func > > [xe] > > [ 51.392947] RIP: 0010:zone_device_page_init+0x5d/0x240 > > [ 51.393228] Code: 04 dd ff e8 d5 d2 a1 00 5a 85 c0 0f 85 ba 00 00 00 > > e8 d7 bb df ff 85 c0 0f 84 9d 01 00 00 48 8b 45 38 a8 03 0f 85 ec 00 00 > > 00 <65> 48 ff 00 e8 aa d2 a1 00 85 c0 0f 85 0d 01 00 00 48 c7 c7 20 cb > > [ 51.394247] RSP: 0018:ffffc9000039fb48 EFLAGS: 00010246 > > [ 51.394570] RAX: 4000000000000000 RBX: ffffea000bedac00 RCX: > > 0000000000000000 > > [ 51.394950] RDX: 0000000000000046 RSI: ffffffff824c67b4 RDI: > > ffffffff824e58f5 > > [ 51.395328] RBP: ffffea000bedac08 R08: 0000000000000015 R09: > > 0000000000000004 > > [ 51.395709] R10: 0000000000000001 R11: 0000000000000004 R12: > > 0000000000000001 > > [ 51.396093] R13: ffff888170fd8d40 R14: ffff88817f922640 R15: > > ffffea000bedac00 > > [ 51.396472] FS: 0000000000000000(0000) GS:ffff88842fc80000(0000) > > knlGS:0000000000000000 > > [ 51.396925] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 51.397237] CR2: 0000563d1f7ecbe4 CR3: 000000017c212000 CR4: > > 0000000000750ef0 > > [ 51.397618] PKRU: 55555554 > > [ 51.397768] Call Trace: > > [ 51.397904] <TASK> > > [ 51.398024] ? __die_body.cold+0x19/0x26 > > [ 51.398238] ? die_addr+0x38/0x60 > > [ 51.398420] ? exc_general_protection+0x19e/0x450 > > [ 51.398678] ? asm_exc_general_protection+0x22/0x30 > > [ 51.398942] ? zone_device_page_init+0x5d/0x240 > > [ 51.399188] ? zone_device_page_init+0x49/0x240 > > [ 51.399433] drm_gpusvm_migrate_to_devmem+0x379/0x9e0 [drm_gpusvm] > > [ 51.399768] xe_svm_handle_pagefault+0x62c/0xa60 [xe] > > [ 51.400110] ? xe_vm_find_overlapping_vma+0xa4/0x1d0 [xe] > > [ 51.400475] pf_queue_work_func+0x1ba/0x450 [xe] > > [ 51.400777] process_one_work+0x1fe/0x580 > > [ 51.400996] worker_thread+0x1d1/0x3b0 > > [ 51.401201] ? __pfx_worker_thread+0x10/0x10 > > [ 51.401433] kthread+0xeb/0x120 > > [ 51.401609] ? __pfx_kthread+0x10/0x10 > > [ 51.401813] ret_from_fork+0x2d/0x50 > > [ 51.402008] ? __pfx_kthread+0x10/0x10 > > [ 51.402211] ret_from_fork_asm+0x1a/0x30 > > [ 51.402427] </TASK> > > [ 51.402551] Modules linked in: xe drm_ttm_helper gpu_sched > > drm_suballoc_helper drm_gpuvm drm_exec drm_gpusvm i2c_algo_bit drm_buddy > > video wmi ttm drm_display_helper drm_kms_helper crct10dif_pclmul > > crc32_pclmul e1000 ghash_clmulni_intel i2c_piix4 i2c_smbus fuse > > [ 51.403779] ---[ end trace 0000000000000000 ]--- > > [ 51.404106] RIP: 0010:zone_device_page_init+0x5d/0x240 > > [ 51.404393] Code: 04 dd ff e8 d5 d2 a1 00 5a 85 c0 0f 85 ba 00 00 00 > > e8 d7 bb df ff 85 c0 0f 84 9d 01 00 00 48 8b 45 38 a8 03 0f 85 ec 00 00 > > 00 <65> 48 ff 00 e8 aa d2 a1 00 85 c0 0f 85 0d 01 00 00 48 c7 c7 20 cb > > [ 51.405408] RSP: 0018:ffffc9000039fb48 EFLAGS: 00010246 > > [ 51.405725] RAX: 4000000000000000 RBX: ffffea000bedac00 RCX: > > 0000000000000000 > > [ 51.406110] RDX: 0000000000000046 RSI: ffffffff824c67b4 RDI: > > ffffffff824e58f5 > > [ 51.406518] RBP: ffffea000bedac08 R08: 0000000000000015 R09: > > 0000000000000004 > > [ 51.406905] R10: 0000000000000001 R11: 0000000000000004 R12: > > 0000000000000001 > > [ 51.407312] R13: ffff888170fd8d40 R14: ffff88817f922640 R15: > > ffffea000bedac00 > > [ 51.407691] FS: 0000000000000000(0000) GS:ffff88842fc80000(0000) > > knlGS:0000000000000000 > > [ 51.408135] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 51.408484] CR2: 0000563d1f7ecbe4 CR3: 000000017c212000 CR4: > > 0000000000750ef0 > > [ 51.408877] PKRU: 55555554 > > [ 51.409047] BUG: sleeping function called from invalid context at ./ > > include/linux/percpu-rwsem.h:49 > > [ 51.409528] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: > > 76, name: kworker/u17:0 > > [ 51.409976] preempt_count: 0, expected: 0 > > [ 51.410212] RCU nest depth: 1, expected: 0 > > [ 51.410435] INFO: lockdep is turned off. > > [ 51.410648] CPU: 1 UID: 0 PID: 76 Comm: kworker/u17:0 Tainted: G > > D 6.13.0-rc4-drm-tip-test+ #48 > > [ 51.411180] Tainted: [D]=DIE > > [ 51.411338] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > BIOS 1.15.0-1 04/01/2014 > > [ 51.411859] Workqueue: xe_gt_page_fault_work_queue pf_queue_work_func > > [xe] > > [ 51.412269] Call Trace: > > [ 51.412404] <TASK> > > [ 51.412525] dump_stack_lvl+0x69/0xa0 > > [ 51.412724] __might_resched.cold+0xe5/0x120 > > [ 51.412956] exit_signals+0x1a/0x360 > > [ 51.413150] do_exit+0x122/0xbd0 > > [ 51.413328] ? __pfx_worker_thread+0x10/0x10 > > [ 51.413562] make_task_dead+0x88/0x90 > > [ 51.413783] rewind_stack_and_make_dead+0x16/0x20 > > [ 51.414045] RIP: 0000:0x0 > > [ 51.414191] Code: Unable to access opcode bytes at 0xffffffffffffffd6. > > [ 51.414595] RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: > > 0000000000000000 > > [ 51.414993] RAX: 0000000000000000 RBX: 0000000000000000 RCX: > > 0000000000000000 > > [ 51.415369] RDX: 0000000000000000 RSI: 0000000000000000 RDI: > > 0000000000000000 > > [ 51.415746] RBP: 0000000000000000 R08: 0000000000000000 R09: > > 0000000000000000 > > [ 51.416123] R10: 0000000000000000 R11: 0000000000000000 R12: > > 0000000000000000 > > [ 51.416501] R13: 0000000000000000 R14: 0000000000000000 R15: > > 0000000000000000 > > [ 51.416899] </TASK> > > > > > > On 12/18/24 1:33 AM, Matthew Brost wrote: > > > Version 3 of GPU SVM has been promoted to the proper series from an RFC. > > > Thanks to everyone (especially Sima and Thomas) for their numerous > > > reviews on revision 1, 2 and for helping to address many design issues. > > > > > > This version has been tested with IGT [1] on PVC, BMG, and LNL. Also > > > tested with level0 (UMD) PR [2]. > > > > > > Major changes in v2: > > > - Dropped mmap write abuse > > > - core MM locking and retry loops instead of driver locking to avoid > > > races > > > - Removed physical to virtual references > > > - Embedded structure/ops for drm_gpusvm_devmem > > > - Fixed mremap and fork issues > > > - Added DRM pagemap > > > - Included RFC documentation in the kernel doc > > > > > > Major changes in v3: > > > - Move GPU SVM and DRM pagemap to DRM level > > > - Mostly addresses Thomas's feedback, lots of small changes documented > > > in each individual patch change log > > > > > > Known issues in v3: > > > - Check pages still exists, changed to threshold in this version which > > > is better but still need to root cause cross process page finding on > > > small user allocations. > > > - Dropped documentation patch, fairly large rewrite and will send out > > > independently once finished. > > > > > > Matt > > > > > > [1] https://patchwork.freedesktop.org/series/137545/#rev3 > > > [2] https://github.com/intel/compute-runtime/pull/782 > > > > > > Matthew Brost (27): > > > drm/xe: Retry BO allocation > > > mm/migrate: Add migrate_device_pfns > > > mm/migrate: Trylock device page in do_swap_page > > > drm/gpusvm: Add support for GPU Shared Virtual Memory > > > drm/xe: Select DRM_GPUSVM Kconfig > > > drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR flag > > > drm/xe: Add SVM init / close / fini to faulting VMs > > > drm/xe: Nuke VM's mapping upon close > > > drm/xe: Add SVM range invalidation and page fault handler > > > drm/gpuvm: Add DRM_GPUVA_OP_DRIVER > > > drm/xe: Add (re)bind to SVM page fault handler > > > drm/xe: Add SVM garbage collector > > > drm/xe: Add unbind to SVM garbage collector > > > drm/xe: Do not allow CPU address mirror VMA unbind if the GPU has > > > bindings > > > drm/xe: Enable CPU address mirror uAPI > > > drm/xe: Add migrate layer functions for SVM support > > > drm/xe: Add SVM device memory mirroring > > > drm/xe: Add drm_gpusvm_devmem to xe_bo > > > drm/xe: Add GPUSVM device memory copy vfunc functions > > > drm/xe: Add Xe SVM populate_devmem_pfn GPU SVM vfunc > > > drm/xe: Add Xe SVM devmem_release GPU SVM vfunc > > > drm/xe: Add BO flags required for SVM > > > drm/xe: Add SVM VRAM migration > > > drm/xe: Basic SVM BO eviction > > > drm/xe: Add SVM debug > > > drm/xe: Add modparam for SVM notifier size > > > drm/xe: Add always_migrate_to_vram modparam > > > > > > Thomas Hellström (3): > > > drm/pagemap: Add DRM pagemap > > > drm/xe: Add dma_addr res cursor > > > drm/xe: Add drm_pagemap ops to SVM > > > > > > drivers/gpu/drm/Kconfig | 8 + > > > drivers/gpu/drm/Makefile | 1 + > > > drivers/gpu/drm/drm_gpusvm.c | 2220 +++++++++++++++++++ > > > drivers/gpu/drm/xe/Kconfig | 10 + > > > drivers/gpu/drm/xe/Makefile | 1 + > > > drivers/gpu/drm/xe/xe_bo.c | 20 +- > > > drivers/gpu/drm/xe/xe_bo.h | 1 + > > > drivers/gpu/drm/xe/xe_bo_types.h | 4 + > > > drivers/gpu/drm/xe/xe_device_types.h | 15 + > > > drivers/gpu/drm/xe/xe_gt_pagefault.c | 17 +- > > > drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 24 + > > > drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h | 2 + > > > drivers/gpu/drm/xe/xe_migrate.c | 175 ++ > > > drivers/gpu/drm/xe/xe_migrate.h | 10 + > > > drivers/gpu/drm/xe/xe_module.c | 7 + > > > drivers/gpu/drm/xe/xe_module.h | 2 + > > > drivers/gpu/drm/xe/xe_pt.c | 393 +++- > > > drivers/gpu/drm/xe/xe_pt.h | 5 + > > > drivers/gpu/drm/xe/xe_pt_types.h | 2 + > > > drivers/gpu/drm/xe/xe_res_cursor.h | 116 +- > > > drivers/gpu/drm/xe/xe_svm.c | 948 ++++++++ > > > drivers/gpu/drm/xe/xe_svm.h | 83 + > > > drivers/gpu/drm/xe/xe_tile.c | 5 + > > > drivers/gpu/drm/xe/xe_vm.c | 375 +++- > > > drivers/gpu/drm/xe/xe_vm.h | 15 +- > > > drivers/gpu/drm/xe/xe_vm_types.h | 57 + > > > include/drm/drm_gpusvm.h | 445 ++++ > > > include/drm/drm_gpuvm.h | 5 + > > > include/drm/drm_pagemap.h | 103 + > > > include/linux/migrate.h | 1 + > > > include/uapi/drm/xe_drm.h | 19 +- > > > mm/memory.c | 13 +- > > > mm/migrate_device.c | 116 +- > > > 33 files changed, 5061 insertions(+), 157 deletions(-) > > > create mode 100644 drivers/gpu/drm/drm_gpusvm.c > > > create mode 100644 drivers/gpu/drm/xe/xe_svm.c > > > create mode 100644 drivers/gpu/drm/xe/xe_svm.h > > > create mode 100644 include/drm/drm_gpusvm.h > > > create mode 100644 include/drm/drm_pagemap.h > > > > > >