On Wed, Jul 08, 2020 at 02:54:32AM +0800, Yang Shi wrote: > Recently we found regression when running will_it_scale/page_fault3 test > on ARM64. Over 70% down for the multi processes cases and over 20% down > for the multi threads cases. It turns out the regression is caused by commit > 89b15332af7c0312a41e50846819ca6613b58b4c ("mm: drop mmap_sem before > calling balance_dirty_pages() in write fault"). > > The test mmaps a memory size file then write to the mapping, this would > make all memory dirty and trigger dirty pages throttle, that upstream > commit would release mmap_sem then retry the page fault. The retried > page fault would see correct PTEs installed by the first try then update > access flags and flush TLBs. The regression is caused by the excessive > TLB flush. It is fine on x86 since x86 doesn't need flush TLB for > access flag update. > > The page fault would be retried due to: > 1. Waiting for page readahead > 2. Waiting for page swapped in > 3. Waiting for dirty pages throttling > > The first two cases don't have PTEs set up at all, so the retried page > fault would install the PTEs, so they don't reach there. But the #3 > case usually has PTEs installed, the retried page fault would reach the > access flag update. But it seems not necessary to update access flags > for #3 since retried page fault is not real "second access", so it > sounds safe to skip access flag update for retried page fault. > > With this fix the test result get back to normal. > > Reported-by: Xu Yu <xuyu@xxxxxxxxxxxxxxxxx> > Debugged-by: Xu Yu <xuyu@xxxxxxxxxxxxxxxxx> > Tested-by: Xu Yu <xuyu@xxxxxxxxxxxxxxxxx> > Signed-off-by: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> > --- > I'm not sure if this is safe for non-x86 machines, we did some tests on arm64, but > there may be still corner cases not covered. > > mm/memory.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 87ec87c..3d4e671 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4241,8 +4241,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > if (vmf->flags & FAULT_FLAG_WRITE) { > if (!pte_write(entry)) > return do_wp_page(vmf); > - entry = pte_mkdirty(entry); > } > + > + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vmf->flags & FAULT_FLAG_TRIED)) > + entry = pte_mkdirty(entry); > + else if (vmf->flags & FAULT_FLAG_TRIED) > + goto unlock; > + Can you rewrite this as: if (vmf->flags & FAULT_FLAG_TRIED) goto unlock; if (vmf->flags & FAULT_FLAG_WRITE) entry = pte_mkdirty(entry); ? (I'm half-asleep this morning and there are people screaming and shouting outside my window, so this might be rubbish) If you _can_make that change, then I don't understand why the existing pte_mkdirty() line needs to move at all. Couldn't you just add: if (vmf->flags & FAULT_FLAG_TRIED) goto unlock; after the existing "vmf->flags & FAULT_FLAG_WRITE" block? Will