On 09/27/22 12:45, Peter Xu wrote: > On Tue, Sep 27, 2022 at 09:24:37AM -0700, Mike Kravetz wrote: > > This should guarantee a read fault independent of what pthread_mutex_lock > > does. However, it still results in the occasional "ERROR: unexpected write > > fault". So, something else if happening. I will continue to experiment > > and think about this. > > Thanks for verifying this, Mike. I didn't yet reply but I did have some > update on my side too. I plan to look deeper and wanted to reply until > that, because I do think there's something special on hugetlb and I still > don't know. I just keep getting distracted by other things.. but maybe I > should still share out what I have already. > > I think I already know what had guaranteed the read faults - the NPTL > pthread lib implemented mutex in different types, and the 1st instruction > of lock() is to fetch the mutex type (at offset 0x10) then we know how we > should move on: > > (gdb) disas pthread_mutex_lock > Dump of assembler code for function ___pthread_mutex_lock: > 0x00007ffff7e3b0d0 <+0>: endbr64 > 0x00007ffff7e3b0d4 <+4>: mov 0x10(%rdi),%eax <---- read 0x10 of &mutex > 0x00007ffff7e3b0d7 <+7>: mov %eax,%edx > 0x00007ffff7e3b0d9 <+9>: and $0x17f,%edx > (gdb) ptype pthread_mutex_t > type = union { > struct __pthread_mutex_s __data; > char __size[40]; > long __align; > } > (gdb) ptype struct __pthread_mutex_s > type = struct __pthread_mutex_s { > int __lock; > unsigned int __count; > int __owner; > unsigned int __nusers; > int __kind; <--- 0x10 offset here > short __spins; > short __elision; > __pthread_list_t __list; > } > > I looked directly at asm dumped from the binary I tested to make sure it's > accurate. So it means with current uffd selftest logically there should > never be a write missing fault (I still don't think it's good to have the > write check though.. but that's separate question to ask). > > It also means hugetlb does something special here. It smells really like > for some reason the hugetlb pgtable got evicted after UFFDIO_COPY during > locking_thread running, then any further lock() (e.g. cmpxchg) or modifying > the counter could trigger a write fault. > > OTOH this also explained why futex code was never tested on userfaultfd > selftest, simply because futex() will always to be after that "read the > type of mutex" thing.. which is something I want to rework a bit, so as to > have uffd selftest to cover gup as you used to rightfully suggest. But > that'll be nothing urgent, and be something after we know what's special > with hugetlb new code. > I was able to do a little more debugging: As you know the hugetlb calling path to handle_userfault is something like this, hugetlb_fault mutex_lock(&hugetlb_fault_mutex_table[hash]); ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h)); if (huge_pte_none_mostly()) hugetlb_no_page() page = find_lock_page(mapping, idx); if (!page) { if (userfaultfd_missing(vma)) mutex_unlock(&hugetlb_fault_mutex_table[hash]); return handle_userfault() } For anon mappings, find_lock_page() will never find a page, so as long as huge_pte_none_mostly() is true we will call into handle_userfault(). Since your analysis shows the testcase should never call handle_userfault() for a write fault, I simply added a 'if (flags & FAULT_FLAG_WRITE) printk' before the call to handle_userfault(). Sure enough, I saw plenty of printk messages. Then, before calling handle_userfault() I added code to take the page table lock and test huge_pte_none_mostly() again. In every FAULT_FLAG_WRITE case, this second test of huge_pte_none_mostly() was false. So, the condition changed from the check in hugetlb_fault until the check (with page table lock) in hugetlb_no_page. IIUC, the only code that should be modifying the pte in this test is hugetlb_mcopy_atomic_pte(). It also holds the hugetlb_fault_mutex while updating the pte. It 'appears' that hugetlb_fault is not seeing the updated pte and I can only guess that it is due to some caching issues. After writing the pte in hugetlb_mcopy_atomic_pte() there is this comment. /* No need to invalidate - it was non-present before */ update_mmu_cache(dst_vma, dst_addr, dst_pte); I suspect that is true. However, it seems like this test depends on all CPUs seeing the updated pte immediately? I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make any difference. Suggestions would be appreciated as cache/tlb/??? flushing issues take me a while to figure out. -- Mike Kravetz