On Fri, Apr 7, 2023 at 3:40 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Fri, Apr 7, 2023 at 2:36 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > On Fri, Apr 07, 2023 at 01:26:08PM -0700, Suren Baghdasaryan wrote: > > > True. do_swap_page() has the same issue. Can we move these > > > count_vm_event() calls to the end of handle_mm_fault(): > > > > I was going to suggest something similar, so count that as an > > Acked-by. This will change the accounting for the current retry > > situation, where we drop the mmap_lock in filemap_fault(), initiate I/O > > and return RETRY. I think that's probably a good change though; I don't > > see why applications should have their fault counters incremented twice > > for that situation. > > > > > mm_account_fault(regs, address, flags, ret); > > > +out: > > > + if (ret != VM_FAULT_RETRY) { > > > + count_vm_event(PGFAULT); > > > + count_memcg_event_mm(vma->vm_mm, PGFAULT); > > > + } > > > > Nit: this is a bitmask, so we should probably have: > > > > if (!(ret & VM_FAULT_RETRY)) { > > > > in case somebody's ORed it with VM_FAULT_MAJOR or something. > > > > Hm, I wonder if we're handling that correctly; if we need to start I/O, > > do we preserve VM_FAULT_MAJOR returned from the first attempt? Not > > in a position where I can look at the code right now. > > Interesting question. IIUC mm_account_fault() is the place where > VM_FAULT_MAJOR is used to perform minor/major fault accounting. > However it has an early exit: > > /* > * We don't do accounting for some specific faults: > * > * - Unsuccessful faults (e.g. when the address wasn't valid). That > * includes arch_vma_access_permitted() failing before reaching here. > * So this is not a "this many hardware page faults" counter. We > * should use the hw profiling for that. > * > * - Incomplete faults (VM_FAULT_RETRY). They will only be counted > * once they're completed. > */ > if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY)) > return; > > So, VM_FAULT_RETRY is ignored in the accounting path. > For the current do_swap_page (the only user returning VM_FAULT_RETRY > this late) it's fine because we did not really do anything. However > with your series and with my current attempts to drop the vma lock, do > swapin_readahead() and return VM_FAULT_RETRY the situation changes. We > need to register such (VM_FAULT_MAJOR | VM_FAULT_RETRY) case as a > major fault, otherwise after retry it will be accounted as only a > minor fault. Accounting the retry as a minor fault after the original > major fault is technically also double-counting but probably not as > big of a problem as missing the major fault accounting. Actually, looks like in most cases where VM_FAULT_MAJOR is returned we also do count_vm_event(PGMAJFAULT) and count_memcg_event_mm(vma->vm_mm, PGMAJFAULT). mm_account_fault() modifies only current->{maj|min}_flt and perf_sw_event().