Re: [PATCH v4 00/33] Introduce GPU SVM and Xe SVM implementation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Matt,

I report this VM_WARN_ON_ONCE_FOLIO(), which also occurred when testing with v3, but also occurs in the same callstack when testing with this version.

G.G.

[  249.486325] [IGT] xe_exec_system_allocator: executing
[ 249.530682] [IGT] xe_exec_system_allocator: starting subtest once-malloc-race [ 249.536822] xe 0000:00:04.0: [drm:vm_bind_ioctl_ops_create [xe]] op=0, addr=0x0000000000000000, range=0x0001000000000000, bo_offset_or_userptr=0x0000000000000000 [ 249.536981] xe 0000:00:04.0: [drm:vm_bind_ioctl_ops_create [xe]] MAP: addr=0x0000000000000000, range=0x0001000000000000 [ 249.539658] xe 0000:00:04.0: [drm:xe_svm_handle_pagefault [xe]] PAGE FAULT: asid=17, gpusvm=ffff888179f09188, vram=0,0, seqno=9223372036854775807, start=0x005562fec30000, end=0x005562fec40000, size=65536 [ 249.539801] xe 0000:00:04.0: [drm:xe_svm_handle_pagefault [xe]] ALLOCATE VRAM: asid=17, gpusvm=ffff888179f09188, vram=0,0, seqno=9223372036854775807, start=0x005562fec30000, end=0x005562fec40000, size=65536 [ 249.540518] xe 0000:00:04.0: [drm:xe_svm_handle_pagefault [xe]] ALLOC VRAM: asid=17, gpusvm=ffff888179f09188, pfn=17179850416, npages=16 [ 249.540709] xe 0000:00:04.0: [drm:xe_svm_invalidate [xe]] INVALIDATE: asid=17, gpusvm=ffff888179f09188, seqno=3, start=0x00005562fec30000, end=0x00005562fec40000, event=6 [ 249.541133] xe 0000:00:04.0: [drm:xe_svm_invalidate [xe]] NOTIFIER: asid=17, gpusvm=ffff888179f09188, vram=0,0, seqno=9223372036854775807, start=0x005562fec30000, end=0x005562fec40000, size=65536 [ 249.542416] xe 0000:00:04.0: [drm:xe_svm_copy [xe]] COPY TO VRAM - 0x0000000157564000 -> 0x00000002fb6b0000, NPAGES=16 [ 249.543466] xe 0000:00:04.0: [drm:xe_svm_handle_pagefault [xe]] GET PAGES: asid=17, gpusvm=ffff888179f09188, vram=0,0, seqno=9223372036854775807, start=0x005562fec30000, end=0x005562fec40000, size=65536 [ 249.543476] xe 0000:00:04.0: [drm:xe_svm_invalidate [xe]] INVALIDATE: asid=17, gpusvm=ffff888179f09188, seqno=5, start=0x00005562fec30000, end=0x00005562fec40000, event=6 [ 249.543585] xe 0000:00:04.0: [drm:xe_svm_invalidate [xe]] NOTIFIER: asid=17, gpusvm=ffff888179f09188, vram=0,0, seqno=9223372036854775807, start=0x005562fec30000, end=0x005562fec40000, size=65536 [ 249.543800] xe 0000:00:04.0: [drm:xe_svm_copy [xe]] COPY TO SRAM - 0x00000002fb6b0000 -> 0x000000017687c000, NPAGES=16 [ 249.544575] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x5562fec30 pfn:0x157564 [ 249.545266] anon flags: 0x4000000000020018(uptodate|dirty|swapbacked|zone=2) [ 249.545786] raw: 4000000000020018 dead000000000100 dead000000000122 ffff88817c2cad19 [ 249.546368] raw: 00000005562fec30 0000000000000000 00000001ffffffff 0000000000000000 [ 249.546957] page dumped because: VM_WARN_ON_ONCE_FOLIO(!memcg && !mem_cgroup_disabled())
[  249.547534] ------------[ cut here ]------------
[ 249.547903] WARNING: CPU: 2 PID: 398 at ./include/linux/memcontrol.h:730 folio_lruvec_lock_irqsave+0x121/0x1e0 [ 249.548608] Modules linked in: xe drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec drm_gpusvm i2c_algo_bit drm_buddy video wmi ttm drm_display_helper drm_kms_helper crct10dif_pclmul e1000 crc32_pclmul ghash_clmulni_intel i2c_piix4 i2c_smbus fuse [ 249.550223] CPU: 2 UID: 0 PID: 398 Comm: xe_exec_system_ Not tainted 6.13.0-drm-tip-test+ #59 [ 249.550863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[  249.551445] RIP: 0010:folio_lruvec_lock_irqsave+0x121/0x1e0
[ 249.551876] Code: ff ff 0f 1f 44 00 00 80 3d ea 97 4b 01 00 0f 85 47 ff ff ff 48 c7 c6 c8 4b 44 82 48 89 df e8 36 60 f5 ff c6 05 ce 97 4b 01 01 <0f> 0b e9 2a ff ff ff e8 a3 b6 e0 ff 85 c0 75 bb be ff ff ff ff 48
[  249.553067] RSP: 0018:ffffc90001e1b7e0 EFLAGS: 00010246
[ 249.553465] RAX: 000000000000004c RBX: ffffea00055d5900 RCX: 0000000000000000 [ 249.553923] RDX: 0000000000000000 RSI: ffffffff824dbf9f RDI: 00000000ffffffff [ 249.554391] RBP: 0000000000000000 R08: 00000000ffff7fff R09: ffff88842fbfffa8 [ 249.554882] R10: ffff88842f940000 R11: 0000000000000002 R12: ffffc90001e1b808 [ 249.555351] R13: ffffffff812d7a10 R14: ffffc90001e1b808 R15: ffffea00055d5900 [ 249.555854] FS: 00007f05ce05bf00(0000) GS:ffff88842fd00000(0000) knlGS:0000000000000000
[  249.556382] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 249.556851] CR2: 00007f05cf99f460 CR3: 0000000165226000 CR4: 0000000000750ef0
[  249.557324] PKRU: 55555554
[  249.557531] Call Trace:
[  249.557679]  <TASK>
[  249.557804]  ? __warn.cold+0xb7/0x155
[  249.558040]  ? folio_lruvec_lock_irqsave+0x121/0x1e0
[  249.558330]  ? report_bug+0xe6/0x170
[  249.558560]  ? handle_bug+0x53/0x90
[  249.558755]  ? exc_invalid_op+0x13/0x60
[  249.558962]  ? asm_exc_invalid_op+0x16/0x20
[  249.559187]  ? __pfx_lru_add+0x10/0x10
[  249.559407]  ? folio_lruvec_lock_irqsave+0x121/0x1e0
[  249.559707]  folio_batch_move_lru+0x89/0x160
[  249.559941]  ? find_held_lock+0x2b/0x80
[  249.560151]  ? __pfx_lru_add+0x10/0x10
[  249.560368]  __folio_batch_add_and_move+0x1a8/0x350
[  249.560652]  folio_putback_lru+0xe/0x40
[  249.560865]  __migrate_device_finalize+0xbc/0x370
[  249.561123]  drm_gpusvm_migrate_to_ram+0x276/0x3a0 [drm_gpusvm]
[  249.561460]  do_swap_page+0x129e/0x2160
[  249.561710]  ? __pfx_default_wake_function+0x10/0x10
[  249.561985]  ? rcu_is_watching+0xd/0x40
[  249.562196]  __handle_mm_fault+0x566/0x940
[  249.562488]  handle_mm_fault+0xae/0x280
[  249.562699]  do_user_addr_fault+0x168/0x700
[  249.562930]  exc_page_fault+0x72/0x230
[  249.563135]  asm_exc_page_fault+0x22/0x30
[  249.563363] RIP: 0010:_copy_from_user+0x41/0x90
[ 249.563639] Code: 00 00 48 83 ec 08 e8 7e a2 be ff 48 b8 00 f0 ff ff ff 7f 00 00 48 39 d8 48 19 c0 0f 01 cb 48 09 c3 4c 89 e1 48 89 ef 48 89 de <f3> a4 0f 1f 00 0f 01 ca 48 85 c9 75 10 48 83 c4 08 48 89 c8 5b 5d
[  249.564621] RSP: 0018:ffffc90001e1bcb0 EFLAGS: 00050206
[ 249.564943] RAX: 0000000000000000 RBX: 00005562fec37090 RCX: 0000000000000008 [ 249.565331] RDX: 0000000000000000 RSI: 00005562fec37090 RDI: ffffc90001e1bcd8 [ 249.565736] RBP: ffffc90001e1bcd8 R08: 0000000000000188 R09: 0000000000000000 [ 249.566113] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000008 [ 249.566500] R13: 00005562fec37090 R14: ffffc90001e1be10 R15: 0000000000000001
[  249.566915]  do_compare+0x33/0x110 [xe]
[  249.567195]  xe_wait_user_fence_ioctl+0x182/0x410 [xe]
[  249.567576]  ? __pfx_woken_wake_function+0x10/0x10
[  249.567839]  ? __pfx_xe_wait_user_fence_ioctl+0x10/0x10 [xe]
[  249.568219]  drm_ioctl_kernel+0xa4/0x100
[  249.568469]  drm_ioctl+0x21f/0x4d0
[  249.568655]  ? __pfx_xe_wait_user_fence_ioctl+0x10/0x10 [xe]
[  249.569020]  ? _raw_spin_unlock_irqrestore+0x53/0x80
[  249.569299]  ? lockdep_hardirqs_on+0xba/0x140
[  249.569575]  ? _raw_spin_unlock_irqrestore+0x3c/0x80
[  249.569845]  xe_drm_ioctl+0x4f/0x80 [xe]
[  249.570092]  __x64_sys_ioctl+0x7e/0xb0
[  249.570308]  do_syscall_64+0x64/0x140
[  249.570535]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  249.570811] RIP: 0033:0x7f05cf841ced
[ 249.571006] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 [ 249.571994] RSP: 002b:00007ffea49d2280 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 249.572436] RAX: ffffffffffffffda RBX: 00007ffea49d2388 RCX: 00007f05cf841ced [ 249.572842] RDX: 00007ffea49d2310 RSI: 00000000c048644a RDI: 0000000000000003 [ 249.573228] RBP: 00007ffea49d22d0 R08: 00007ffea49d2388 R09: 00007f05cf9140a0 [ 249.573631] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffea49d2310 [ 249.574009] R13: 00000000c048644a R14: 0000000000000003 R15: 0000000000000001
[  249.574405]  </TASK>
[  249.574529] irq event stamp: 30955
[ 249.574755] hardirqs last enabled at (30963): [<ffffffff811a6c3e>] __up_console_sem+0x5e/0x80 [ 249.575220] hardirqs last disabled at (30972): [<ffffffff811a6c23>] __up_console_sem+0x43/0x80 [ 249.575725] softirqs last enabled at (30378): [<ffffffff8110d147>] __irq_exit_rcu+0xb7/0x110 [ 249.576181] softirqs last disabled at (30359): [<ffffffff8110d147>] __irq_exit_rcu+0xb7/0x110
[  249.576645] ---[ end trace 0000000000000000 ]---


On 1/29/25 9:51 PM, Matthew Brost wrote:
Version 4 of GPU SVM. Thanks to everyone (especially Sima, Thomas,
Alistair, Himal) for their numerous reviews on revision 1, 2, 3  and for
helping to address many design issues.

This version has been tested with IGT [1] on PVC, BMG, and LNL. Also
tested with level0 (UMD) PR [2].

Major changes in v2:
- Dropped mmap write abuse
- core MM locking and retry loops instead of driver locking to avoid races
- Removed physical to virtual references
- Embedded structure/ops for drm_gpusvm_devmem
- Fixed mremap and fork issues
- Added DRM pagemap
- Included RFC documentation in the kernel doc

Major changes in v3:
- Move GPU SVM and DRM pagemap to DRM level
- Mostly addresses Thomas's feedback, lots of small changes documented
   in each individual patch change log

Major changes in v4:
- Pull documentation patch in
- Fix Kconfig / VRAM migration issue
- Address feedback which came out of internal multi-GPU implementation

Known issues in v4:
- Check pages still exists, changed to threshold in this version which
   is better but still need to root cause cross process page finding on
   small user allocations.

Matt

[1] https://patchwork.freedesktop.org/series/137545/#rev3
[2] https://github.com/intel/compute-runtime/pull/782

Matthew Brost (29):
   drm/xe: Retry BO allocation
   mm/migrate: Add migrate_device_pfns
   mm/migrate: Trylock device page in do_swap_page
   drm/gpusvm: Add support for GPU Shared Virtual Memory
   drm/xe: Select DRM_GPUSVM Kconfig
   drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR flag
   drm/xe: Add SVM init / close / fini to faulting VMs
   drm/xe: Nuke VM's mapping upon close
   drm/xe: Add SVM range invalidation and page fault handler
   drm/gpuvm: Add DRM_GPUVA_OP_DRIVER
   drm/xe: Add (re)bind to SVM page fault handler
   drm/xe: Add SVM garbage collector
   drm/xe: Add unbind to SVM garbage collector
   drm/xe: Do not allow CPU address mirror VMA unbind if the GPU has
     bindings
   drm/xe: Enable CPU address mirror uAPI
   drm/xe/uapi: Add DRM_XE_QUERY_CONFIG_FLAG_HAS_CPU_ADDR_MIRROR
   drm/xe: Add migrate layer functions for SVM support
   drm/xe: Add SVM device memory mirroring
   drm/xe: Add drm_gpusvm_devmem to xe_bo
   drm/xe: Add GPUSVM device memory copy vfunc functions
   drm/xe: Add Xe SVM populate_devmem_pfn GPU SVM vfunc
   drm/xe: Add Xe SVM devmem_release GPU SVM vfunc
   drm/xe: Add BO flags required for SVM
   drm/xe: Add SVM VRAM migration
   drm/xe: Basic SVM BO eviction
   drm/xe: Add SVM debug
   drm/xe: Add modparam for SVM notifier size
   drm/xe: Add always_migrate_to_vram modparam
   drm/doc: gpusvm: Add GPU SVM documentation

Thomas Hellström (4):
   drm/pagemap: Add DRM pagemap
   drm/xe/bo: Introduce xe_bo_put_async
   drm/xe: Add dma_addr res cursor
   drm/xe: Add drm_pagemap ops to SVM

  Documentation/gpu/rfc/gpusvm.rst            |   84 +
  Documentation/gpu/rfc/index.rst             |    4 +
  drivers/gpu/drm/Kconfig                     |    9 +
  drivers/gpu/drm/Makefile                    |    1 +
  drivers/gpu/drm/drm_gpusvm.c                | 2240 +++++++++++++++++++
  drivers/gpu/drm/xe/Kconfig                  |   10 +
  drivers/gpu/drm/xe/Makefile                 |    1 +
  drivers/gpu/drm/xe/xe_bo.c                  |   63 +-
  drivers/gpu/drm/xe/xe_bo.h                  |   14 +
  drivers/gpu/drm/xe/xe_bo_types.h            |    4 +
  drivers/gpu/drm/xe/xe_device.c              |    3 +
  drivers/gpu/drm/xe/xe_device_types.h        |   22 +
  drivers/gpu/drm/xe/xe_gt_pagefault.c        |   17 +-
  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c |   24 +
  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |    2 +
  drivers/gpu/drm/xe/xe_migrate.c             |  175 ++
  drivers/gpu/drm/xe/xe_migrate.h             |   10 +
  drivers/gpu/drm/xe/xe_module.c              |    7 +
  drivers/gpu/drm/xe/xe_module.h              |    2 +
  drivers/gpu/drm/xe/xe_pt.c                  |  393 +++-
  drivers/gpu/drm/xe/xe_pt.h                  |    5 +
  drivers/gpu/drm/xe/xe_pt_types.h            |    2 +
  drivers/gpu/drm/xe/xe_query.c               |    5 +-
  drivers/gpu/drm/xe/xe_res_cursor.h          |  116 +-
  drivers/gpu/drm/xe/xe_svm.c                 |  946 ++++++++
  drivers/gpu/drm/xe/xe_svm.h                 |   84 +
  drivers/gpu/drm/xe/xe_tile.c                |    5 +
  drivers/gpu/drm/xe/xe_vm.c                  |  375 +++-
  drivers/gpu/drm/xe/xe_vm.h                  |   15 +-
  drivers/gpu/drm/xe/xe_vm_types.h            |   57 +
  include/drm/drm_gpusvm.h                    |  445 ++++
  include/drm/drm_gpuvm.h                     |    5 +
  include/drm/drm_pagemap.h                   |  105 +
  include/linux/migrate.h                     |    1 +
  include/uapi/drm/xe_drm.h                   |   22 +-
  mm/memory.c                                 |   13 +-
  mm/migrate_device.c                         |  116 +-
  37 files changed, 5245 insertions(+), 157 deletions(-)
  create mode 100644 Documentation/gpu/rfc/gpusvm.rst
  create mode 100644 drivers/gpu/drm/drm_gpusvm.c
  create mode 100644 drivers/gpu/drm/xe/xe_svm.c
  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
  create mode 100644 include/drm/drm_gpusvm.h
  create mode 100644 include/drm/drm_pagemap.h





[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux