On Mon, Apr 17, 2023 at 2:14 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Mon, Apr 17, 2023 at 01:29:45PM -0700, Suren Baghdasaryan wrote: > > On Mon, Apr 17, 2023 at 12:40 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > On Fri, Apr 14, 2023 at 05:08:18PM -0700, Suren Baghdasaryan wrote: > > > > If the page fault handler requests a retry, we will count the fault > > > > multiple times. This is a relatively harmless problem as the retry paths > > > > are not often requested, and the only user-visible problem is that the > > > > fault counter will be slightly higher than it should be. Nevertheless, > > > > userspace only took one fault, and should not see the fact that the > > > > kernel had to retry the fault multiple times. > > > > Move page fault accounting into mm_account_fault() and skip incomplete > > > > faults which will be accounted upon completion. > > > > > > > > Fixes: d065bd810b6d ("mm: retry page fault when blocking on disk transfer") > > > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > > > > --- > > > > mm/memory.c | 45 ++++++++++++++++++++++++++------------------- > > > > 1 file changed, 26 insertions(+), 19 deletions(-) > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > index 01a23ad48a04..c3b709ceeed7 100644 > > > > --- a/mm/memory.c > > > > +++ b/mm/memory.c > > > > @@ -5080,24 +5080,30 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, > > > > * updates. However, note that the handling of PERF_COUNT_SW_PAGE_FAULTS should > > > > * still be in per-arch page fault handlers at the entry of page fault. > > > > */ > > > > -static inline void mm_account_fault(struct pt_regs *regs, > > > > +static inline void mm_account_fault(struct mm_struct *mm, struct pt_regs *regs, > > > > unsigned long address, unsigned int flags, > > > > vm_fault_t ret) > > > > { > > > > bool major; > > > > > > > > /* > > > > - * We don't do accounting for some specific faults: > > > > - * > > > > - * - Unsuccessful faults (e.g. when the address wasn't valid). That > > > > - * includes arch_vma_access_permitted() failing before reaching here. > > > > - * So this is not a "this many hardware page faults" counter. We > > > > - * should use the hw profiling for that. > > > > - * > > > > - * - Incomplete faults (VM_FAULT_RETRY). They will only be counted > > > > - * once they're completed. > > > > + * Do not account for incomplete faults (VM_FAULT_RETRY). They will be > > > > + * counted upon completion. > > > > */ > > > > - if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY)) > > > > + if (ret & VM_FAULT_RETRY) > > > > + return; > > > > + > > > > + /* Register both successful and failed faults in PGFAULT counters. */ > > [1] > > > > > + count_vm_event(PGFAULT); > > > > + count_memcg_event_mm(mm, PGFAULT); > > > > > > Is there reason on why vm events accountings need to be explicitly > > > different from perf events right below on handling ERROR? > > > > > > I get the point if this is to make sure ERROR accountings untouched for > > > these two vm events after this patch. IOW probably the only concern right > > > now is having RETRY counted much more than before (perhaps worse with vma > > > locking applied). > > > > > > But since we're on this, I'm wondering whether we should also align the two > > > events (vm, perf) so they represent in an aligned manner if we'll change it > > > anyway. Any future reader will be confused on why they account > > > differently, IMHO, so if we need to differenciate we'd better add a comment > > > on why. > > > > > > I'm wildly guessing the error faults are indeed very rare and probably not > > > matter much at all. I just think the code can be slightly cleaner if > > > vm/perf accountings match and easier if we treat everything the same. E.g., > > > we can also drop the below "goto out"s too. What do you think? > > > > I think the rationale might be that vm accounting should account for > > *all* events, including failing page faults while for perf, the corner > > cases which rarely happen would not have tangible effect. > > Note that it's not only for perf, but also task_struct.maj_flt|min_flt. > > If we check the reasoning of "why ERROR shouldn't be accounted for perf > events", there're actually something valid in the comment: > > * - Unsuccessful faults (e.g. when the address wasn't valid). That > * includes arch_vma_access_permitted() failing before reaching here. > * So this is not a "this many hardware page faults" counter. We > * should use the hw profiling for that. > > IMHO it suggests that if someone wants to trap either ERROR or RETRY one > can use the hardware counters instead. The same reasoning just sounds > applicable to vm events too, because vm events are not special in this case > to me. > > > I don't have a strong position on this issue and kept it as is to > > avoid changing the current accounting approach. If we are fine with > > such consolidation which would miss failing faults in vm accounting, I > > can make the change. > > I don't have a strong opinion either. We used to change this path before > for perf events and task events and no one complains with the change. I'd > just bet the same to vm events: > > https://lore.kernel.org/all/20200707225021.200906-1-peterx@xxxxxxxxxx/ Ok, if these rare failures don't change anything then let's consolidate the code. It should simplify things a bit and will account faults in a consistent way. I'll post v3 shortly incorporating your and Matthew's feedbacks. Thanks for the input! > > Please feel free to keep it as-is if you still feel unsafe on changing > ERROR handling. If so, would you mind slightly modify [1] above explaining > why we need ERROR to be accounted for vm accountings with the reasoning? > Current comment only says "what it does" rather than why. > > Thanks, > > -- > Peter Xu >