Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2024/11/19 17:55, David Hildenbrand wrote:
On 18.11.24 12:13, Qi Zheng wrote:


On 2024/11/18 18:59, David Hildenbrand wrote:
On 18.11.24 11:56, Qi Zheng wrote:


On 2024/11/18 18:41, David Hildenbrand wrote:
On 18.11.24 11:34, Qi Zheng wrote:


On 2024/11/18 17:29, David Hildenbrand wrote:
On 18.11.24 04:35, Qi Zheng wrote:


On 2024/11/15 22:59, David Hildenbrand wrote:
On 15.11.24 15:41, Qi Zheng wrote:


On 2024/11/15 18:22, David Hildenbrand wrote:
*nr_skip = nr;

and then:

zap_pte_range
--> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
&skip_nr,
                               rss, &force_flush, &force_break);
            if (can_reclaim_pt) {
                none_nr += count_pte_none(pte, nr);
                none_nr += nr_skip;
            }

Right?

Yes. I did not look closely at the patch that adds the
counting of

Got it.

pte_none though (to digest why it is required :) ).

Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
empty PTE page.

Okay, so the problem is that "nr" would be "all processed
entries" but
there are cases where we "process an entry but not zap it".

What you really only want to know is "was any entry not zapped",
which
could be a simple input boolean variable passed into
do_zap_pte_range?

Because as soon as any entry was processed but  no zapped, you can
immediately give up on reclaiming that table.

Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
found in count_pte_none().

I'm not sure if well need cont_pte_none(), but I'll have to take a
look
at your new patch to see how this fits together with doing the
pte_none
detection+skipping in do_zap_pte_range().

I was wondering if you cannot simply avoid the additional
scanning and
simply set "can_reclaim_pt" if you skip a zap.

Maybe we can return the information whether the zap was skipped from
zap_present_ptes() and zap_nonpresent_ptes() through parameters
like I
did in [PATCH v1 3/7] and [PATCH v1 4/7].

In theory, we can detect empty PTE pages in the following two ways:

1) If no zap is skipped, it means that all pte entries have been
        zap, and the PTE page must be empty.
2) If all pte entries are detected to be none, then the PTE page is
        empty.

In the error case, 1) may cause non-empty PTE pages to be reclaimed
(which is unacceptable), while the 2) will at most cause empty PTE
pages
to not be reclaimed.

So the most reliable and efficient method may be:

a. If there is a zap that is skipped, stop scanning and do not
reclaim
        the PTE page;
b. Otherwise, as now, detect the empty PTE page through
count_pte_none()

Is there a need for count_pte_none() that I am missing?

When any_skipped == false, at least add VM_BUG_ON() to recheck none
ptes.


Assume we have

nr = do_zap_pte_range(&any_skipped)


If "nr" is the number of processed entries (including pte_none()), and "any_skipped" is set whenever we skipped to zap a !pte_none entry, we
can detect what we need, no?

If any_skipped == false after the call, we now have "nr" pte_none()
entries. -> We can continue trying to reclaim

I prefer that "nr" should not include pte_none().


Why? do_zap_pte_range() should tell you how far to advance, nothing
less, nothing more.

Let's just keep it simple and avoid count_pte_none().

I'm probably missing something important?

As we discussed before, we should skip all consecutive none ptes,
  > pte and addr are already incremented before returning.

It's probably best to send the resulting patch so I can either
understand why count_pte_none() is required or comment on how to get rid
of it.

Something like this:

diff --git a/mm/memory.c b/mm/memory.c
index bd9ebe0f4471f..e9bec3cd49d44 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1657,6 +1657,66 @@ static inline int zap_nonpresent_ptes(struct
mmu_gather *tlb,
          return nr;
   }

+static inline int do_zap_pte_range(struct mmu_gather *tlb,
+                                  struct vm_area_struct *vma, pte_t *pte,
+                                  unsigned long addr, unsigned long end,
+                                  struct zap_details *details, int *rss,
+                                  bool *force_flush, bool *force_break,
+                                  bool *any_skipped)
+{
+       pte_t ptent = ptep_get(pte);
+       int max_nr = (end - addr) / PAGE_SIZE;

I'd do here:

       int nr = 0;

+
+       /* Skip all consecutive pte_none(). */
+       if (pte_none(ptent)) {
+

instead of the "int nr" here

+               for (nr = 1; nr < max_nr; nr++) {
+                       ptent = ptep_get(pte + nr);
+                       if (!pte_none(ptent))
+                               break;
+               }
+               max_nr -= nr;
+               if (!max_nr)
+                       return 0;
+               pte += nr;
+               addr += nr * PAGE_SIZE;
+       }
+
+       if (pte_present(ptent))
+               return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
+                                       addr, details, rss, force_flush,
+                                       force_break, any_skipped);

and here:

if (pte_present(ptent))
     nr += zap_present_ptes(...)
else
     nr += zap_nonpresent_ptes();
return nr;

So you really return the number of processed items -- how much the caller should advance.

Got it, please ignore my previous stupid considerations. ;)


+
+       return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
+                                  details, rss, any_skipped);
+}
+
+static inline int count_pte_none(pte_t *pte, int nr)
+{
+       int none_nr = 0;
+
+       /*
+        * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
+        * re-installed, so we need to check pte_none() one by one.
+        * Otherwise, checking a single PTE in a batch is sufficient.
+        */
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+       for (;;) {
+               if (pte_none(ptep_get(pte)))
+                       none_nr++;
+               if (--nr == 0)
+                       break;
+               pte++;
+       }
+#else
+       if (pte_none(ptep_get(pte)))
+               none_nr = nr;
+#endif
+       return none_nr;
+}
+
+
   static unsigned long zap_pte_range(struct mmu_gather *tlb,
                                  struct vm_area_struct *vma, pmd_t *pmd,
                                  unsigned long addr, unsigned long end,
@@ -1667,6 +1727,7 @@ static unsigned long zap_pte_range(struct
mmu_gather *tlb,
          int rss[NR_MM_COUNTERS];
          spinlock_t *ptl;
          pte_t *start_pte;
+       bool can_reclaim_pt;
          pte_t *pte;
          int nr;

@@ -1679,28 +1740,22 @@ static unsigned long zap_pte_range(struct
mmu_gather *tlb,
          flush_tlb_batched_pending(mm);
          arch_enter_lazy_mmu_mode();
          do {
-               pte_t ptent = ptep_get(pte);
-               int max_nr;
-
-               nr = 1;
-               if (pte_none(ptent))
-                       continue;
+               bool any_skipped;

                  if (need_resched())
                          break;

-               max_nr = (end - addr) / PAGE_SIZE;
-               if (pte_present(ptent)) {
-                       nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
-                                             addr, details, rss,
&force_flush,
-                                             &force_break);
-                       if (unlikely(force_break)) {
-                               addr += nr * PAGE_SIZE;
-                               break;
-                       }
-               } else {
-                       nr = zap_nonpresent_ptes(tlb, vma, pte, ptent,
max_nr,
-                                                addr, details, rss);
+               nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
+                                     rss, &force_flush, &force_break,
+                                     &any_skipped);
+               if (can_reclaim_pt) {
+                       VM_BUG_ON(!any_skipped && count_pte_none(pte,
nr) == nr);

Okay, so this is really only for debugging then? So we can safely drop that for now.

I assume it would make sense to check when reclaiming a page table with CONFIG_DEBUG_VM, that the table is actually all-pte_none instead?

Ah, make sense. Will change to it in the next version.


(as a side note: no VM_BUG_ON, please :) )

Got it.

Thanks!






[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux