On Tue, 2024-06-11 at 17:10 +0200, David Hildenbrand wrote: > > > > > > which checks mmap_assert_write_locked(). > > > > > > Setting VMA flags would be racy with the mmap lock in read mode. > > > > > > > > > remap_pfn_range() documents: "this is only safe if the mm semaphore is > > > held when called." which doesn't spell out if it needs to be held in > > > write mode (which I think it does) :) > > > > Logically this makes sense to me. At the same time it looks like > > fixup_user_fault() expects the caller to only hold mmap_read_lock() as > > I do here. In there it even retakes mmap_read_lock(). But then wouldn't > > any fault handling by its nature need to hold the write lock? > > Well, if you're calling remap_pfn_range() right now the expectation is > that we hold it in write mode. :) > > Staring at some random users, they all call it from mmap(), where you > hold the mmap lock in write mode. > > > I wonder why we are not seeing that splat with vfio all of the time? > > That mmap lock check was added "recently". In 1c71222e5f23 we started > using vm_flags_set(). That (including the mmap_assert_write_locked()) > check was added via bc292ab00f6c almost 1.5 years ago. > > Maybe vfio is a bit special and was never really run with lockdep? > > > > > > > > > > > > My best guess is: if you are using remap_pfn_range() from a fault > > > handler (not during mmap time) you are doing something wrong, that's why > > > you get that report. > > > > @Alex: I guess so far the vfio_pci_mmap_fault() handler is only ever > > triggered by "normal"/"actual" page faults where this isn't a problem? > > Or could it be a problem there too? > > > > I think we should see it there as well, unless I am missing something. Well good news for me, bad news for everyone else. I just reproduced the same problem on my x86_64 workstation. I "ported over" (hacked it until it compiles) an x86 version of my trivial vfio-pci user-space test code that mmaps() the BAR 0 of an NVMe and MMIO reads the NVMe version field at offset 8. On my x86_64 box this leads to the following splat (still on v6.10-rc1). [ 555.396773] ------------[ cut here ]------------ [ 555.396774] WARNING: CPU: 3 PID: 1424 at include/linux/rwsem.h:85 remap_pfn_range_notrack+0x625/0x650 [ 555.396778] Modules linked in: vfio_pci <-- 8< --> [ 555.396877] CPU: 3 PID: 1424 Comm: vfio-test Tainted: G W 6.10.0-rc1-niks-00007-gb19d6d864df1 #4 d09afec01ce27ca8218580af28295f25e2d2ed53 [ 555.396880] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Creator, BIOS P3.40 01/28/2021 [ 555.396881] RIP: 0010:remap_pfn_range_notrack+0x625/0x650 [ 555.396884] Code: a8 00 00 00 75 39 44 89 e0 48 81 c4 b0 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d e9 26 a7 e5 00 cc 0f 0b 41 bc ea ff ff ff eb c9 <0f> 0b 49 8b 47 10 e9 72 fa ff ff e8 8b 56 b5 ff e9 c0 fa ff ff e8 [ 555.396887] RSP: 0000:ffffaf8b04ed3bc0 EFLAGS: 00010246 [ 555.396889] RAX: ffff9ea747cfe300 RBX: 00000000000ee200 RCX: 0000000000000100 [ 555.396890] RDX: 00000000000ee200 RSI: ffff9ea747cfe300 RDI: ffff9ea76db58fd0 [ 555.396892] RBP: 00000000ffffffea R08: 8000000000000035 R09: 0000000000000000 [ 555.396894] R10: ffff9ea76d9bbf40 R11: ffffffff96e5ce50 R12: 0000000000004000 [ 555.396895] R13: 00007f23b988a000 R14: ffff9ea76db58fd0 R15: ffff9ea76db58fd0 [ 555.396897] FS: 00007f23b9561740(0000) GS:ffff9eb66e780000(0000) knlGS:0000000000000000 [ 555.396899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 555.396901] CR2: 00007f23b988a008 CR3: 0000000136bde000 CR4: 0000000000350ef0 [ 555.396903] Call Trace: [ 555.396904] <TASK> [ 555.396905] ? __warn+0x18c/0x2a0 [ 555.396908] ? remap_pfn_range_notrack+0x625/0x650 [ 555.396911] ? report_bug+0x1bb/0x270 [ 555.396915] ? handle_bug+0x42/0x70 [ 555.396917] ? exc_invalid_op+0x1a/0x50 [ 555.396920] ? asm_exc_invalid_op+0x1a/0x20 [ 555.396923] ? __pfx_is_ISA_range+0x10/0x10 [ 555.396926] ? remap_pfn_range_notrack+0x625/0x650 [ 555.396929] ? asm_exc_invalid_op+0x1a/0x20 [ 555.396933] ? track_pfn_remap+0x170/0x180 [ 555.396936] remap_pfn_range+0x6f/0xc0 [ 555.396940] vfio_pci_mmap_fault+0xf3/0x1b0 [vfio_pci_core 6df3b7ac5dcecb63cb090734847a65c799a8fef2] [ 555.396946] __do_fault+0x11b/0x210 [ 555.396949] do_pte_missing+0x239/0x1350 [ 555.396953] handle_mm_fault+0xb10/0x18b0 [ 555.396959] do_user_addr_fault+0x293/0x710 [ 555.396963] exc_page_fault+0x82/0x1c0 [ 555.396966] asm_exc_page_fault+0x26/0x30 [ 555.396968] RIP: 0033:0x55b0ea8bb7ac [ 555.396972] Code: 00 00 b0 00 e8 e5 f8 ff ff 31 c0 48 83 c4 20 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 89 7d f8 48 8b 45 f8 <8b> 00 89 c0 5d c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 [ 555.396974] RSP: 002b:00007fff80973530 EFLAGS: 00010202 [ 555.396976] RAX: 00007f23b988a008 RBX: 00007fff80973738 RCX: 00007f23b988a000 [ 555.396978] RDX: 0000000000000001 RSI: 00007fff809735e8 RDI: 00007f23b988a008 [ 555.396979] RBP: 00007fff80973530 R08: 0000000000000005 R09: 0000000000000000 [ 555.396981] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ 555.396982] R13: 0000000000000000 R14: 00007f23b98c8000 R15: 000055b0ea8bddc0 [ 555.396986] </TASK> [ 555.396987] ---[ end trace 0000000000000000 ]---