Re: [PATCH v3 1/3] s390/pci: Fix s390_mmio_read/write syscall page fault handling

David Hildenbrand <david@xxxxxxxxxx> · Wed, 12 Jun 2024 09:28:29 +0200

On 12.06.24 00:21, Alex Williamson wrote:
On Tue, 11 Jun 2024 17:37:20 +0200
Niklas Schnelle <schnelle@xxxxxxxxxxxxx> wrote:

On Tue, 2024-06-11 at 17:10 +0200, David Hildenbrand wrote:

which checks mmap_assert_write_locked().

Setting VMA flags would be racy with the mmap lock in read mode.

remap_pfn_range() documents: "this is only safe if the mm semaphore is
held when called." which doesn't spell out if it needs to be held in
write mode (which I think it does) :)

Logically this makes sense to me. At the same time it looks like
fixup_user_fault() expects the caller to only hold mmap_read_lock() as
I do here. In there it even retakes mmap_read_lock(). But then wouldn't
any fault handling by its nature need to hold the write lock?

Well, if you're calling remap_pfn_range() right now the expectation is
that we hold it in write mode. :)

Staring at some random users, they all call it from mmap(), where you
hold the mmap lock in write mode.

I wonder why we are not seeing that splat with vfio all of the time?

That mmap lock check was added "recently". In 1c71222e5f23 we started
using vm_flags_set(). That (including the mmap_assert_write_locked())
check was added via bc292ab00f6c almost 1.5 years ago.

Maybe vfio is a bit special and was never really run with lockdep?

My best guess is: if you are using remap_pfn_range() from a fault
handler (not during mmap time) you are doing something wrong, that's why
you get that report.

@Alex: I guess so far the vfio_pci_mmap_fault() handler is only ever
triggered by "normal"/"actual" page faults where this isn't a problem?
Or could it be a problem there too?

I think we should see it there as well, unless I am missing something.

Well good news for me, bad news for everyone else. I just reproduced
the same problem on my x86_64 workstation. I "ported over" (hacked it
until it compiles) an x86 version of my trivial vfio-pci user-space
test code that mmaps() the BAR 0 of an NVMe and MMIO reads the NVMe
version field at offset 8. On my x86_64 box this leads to the following
splat (still on v6.10-rc1).

There's already a fix for this queued[1] in my for-linus branch for
v6.10.  The problem has indeed existed with lockdep for some time but
only with the recent lockdep changes to generate a warning regardless
of debug kernel settings has it gone from just sketchy to having a fire
under it.  There's still an outstanding question of whether we
can/should insert as many pfns as we can during the fault[2] to reduce
the new overhead and hopefully at some point we'll have an even cleaner
option to use huge_fault for pfnmaps, but currently
vmf_insert_pfn_{pmd,pud} don't work with those pfnmaps.

So hopefully this problem disappears on current linux-next, but let me
know if there's still an issue.  Thanks,

I see us now using vmf_insert_pfn(), which should be the right thing to 
do. So I suspect this problem should be disappearing.

--
Cheers,

David / dhildenb