Re: [patch]x86: clearing access bit don't flush tlb

Rik van Riel <riel@xxxxxxxxxx> · Wed, 26 Mar 2014 19:55:51 -0400

On 03/26/2014 06:30 PM, Shaohua Li wrote:

I posted this patch a year ago or so, but it gets lost. Repost it here to check
if we can make progress this time.

I believe we can make progress. However, I also
believe the code could be enhanced to address a
concern that Hugh raised last time this was
proposed...

And according to intel manual, tlb has less than 1k entries, which covers < 4M
memory. In today's system, several giga byte memory is normal. After page
reclaim clears pte access bit and before cpu access the page again, it's quite
unlikely this page's pte is still in TLB. And context swich will flush tlb too.
The chance skiping tlb flush to impact page reclaim should be very rare.

Context switch to a kernel thread does not result in a
TLB flush, due to the lazy TLB code.

While I agree with you that clearing the TLB right at
the moment the accessed bit is cleared in a PTE is
not necessary, I believe it would be good to clear
the TLB on affected CPUs relatively soon, maybe at the
next time schedule is called?

--- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
+++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
@@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
  int ptep_clear_flush_young(struct vm_area_struct *vma,
  			   unsigned long address, pte_t *ptep)
  {
-	int young;
-
-	young = ptep_test_and_clear_young(vma, address, ptep);
-	if (young)
-		flush_tlb_page(vma, address);
-
-	return young;
+	/*
+	 * In X86, clearing access bit without TLB flush doesn't cause data
+	 * corruption. Doing this could cause wrong page aging and so hot pages
+	 * are reclaimed, but the chance should be very rare.
+	 */
+	return ptep_test_and_clear_young(vma, address, ptep);
  }


At this point, we could use vma->vm_mm->cpu_vm_mask_var to
set (or clear) some bit in the per-cpu data of each CPU that
has active/valid tlb state for the mm in question.

I could see using cpu_tlbstate.state for this, or maybe
another variable in cpu_tlbstate, so switch_mm will load
both items with the same cache line.

At schedule time, the function switch_mm() can examine that
variable (it already touches that data, anyway), and flush
the TLB even if prev==next.

I suspect that would be both low overhead enough to get you
the performance gains you want, and address the concern that
we do want to flush the TLB at some point.

Does that sound reasonable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>