Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting

Barry Song <21cnbao@xxxxxxxxx> · Wed, 28 Feb 2024 15:46:17 +1300

On Wed, Feb 28, 2024 at 2:37 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> On 05/02/2024 09:51, Barry Song wrote:
> > +Chris, Suren and Chuanhua
> >
> > Hi Ryan,
> >
> >> +    /*
> >> +     * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> >> +     * so indicate that we are scanning to synchronise with swapoff.
> >> +     */
> >> +    si->flags += SWP_SCANNING;
> >> +    ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> >> +    si->flags -= SWP_SCANNING;
> >
> > nobody is using this scan_base afterwards. it seems a bit weird to
> > pass a pointer.
> >
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>                                      if (!can_split_folio(folio, NULL))
> >>                                              goto activate_locked;
> >>                                      /*
> >> -                                     * Split folios without a PMD map right
> >> -                                     * away. Chances are some or all of the
> >> -                                     * tail pages can be freed without IO.
> >> +                                     * Split PMD-mappable folios without a
> >> +                                     * PMD map right away. Chances are some
> >> +                                     * or all of the tail pages can be freed
> >> +                                     * without IO.
> >>                                       */
> >> -                                    if (!folio_entire_mapcount(folio) &&
> >> +                                    if (folio_test_pmd_mappable(folio) &&
> >> +                                        !folio_entire_mapcount(folio) &&
> >>                                          split_folio_to_list(folio,
> >>                                                              folio_list))
> >>                                              goto activate_locked;
> >> --
> >
> > Chuanhua and I ran this patchset for a couple of days and found a race
> > between reclamation and split_folio. this might cause applications get
> > wrong data 0 while swapping-in.
> >
> > in case one thread(T1) is reclaiming a large folio by some means, still
> > another thread is calling madvise MADV_PGOUT(T2). and at the same time,
> > we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
> > and T2 does split as below,
>

Hi Ryan,

> Hi Barry,
>
> Do you have a test case you can share that provokes this problem? And is this a
> separate problem to the race you solved with TTU_SYNC or is this solving the
> same problem?

They are the same.

After sending you the report about the races, I spent some time and
finally figured
out what was happening, why corrupted data came while swapping in. it
is absolutely
not your fault, but TTU_SYNC does somehow resolve my problem though it is not
the root cause. this corrupted data only can reproduce after applying
patch 4[1] of
swap-in series,
[1]  [PATCH RFC 4/6] mm: support large folios swapin as a whole
https://lore.kernel.org/linux-mm/20240118111036.72641-5-21cnbao@xxxxxxxxx/

In case we have a large folio with 16 PTEs as below, and after
add_to_swap(), they get
swapoffset 0x10000,  their PTEs are all present as they are still mapped.
PTE          pte_stat
PTE0        present
PTE1        present
PTE2        present
PTE3        present
...
PTE15       present

then we get to try_to_unmap_one, as try_to_unmap_one doesn't hold PTL
from PTE0, while it scans PTEs, we might have
PTE          pte_stat
PTE0        none (someone is writing PTE0 for various reasons)
PTE1        present
PTE2        present
PTE3        present
...
PTE15       present

We hold PTL from PTE1.

after try_to_unmap_one, PTEs become

PTE          pte_stat
PTE0        present (someone finished the write of PTE0)
PTE1        swap 0x10001
PTE2        swap 0x10002
PTE3        swap 0x10003
...
...
PTE15      swap 0x1000F

Thus, after try_to_unmap_one, the large folio is still mapped. so its swapcache
will still be there.

Now a parallel thread runs MADV_PAGEOUT, and it finds this large folio
is not completely mapped, so it splits the folio into 16 small folios but their
swap offsets are kept.

Now in swapcache, we have 16 folios with contiguous swap offsets.
MADV_PAGEOUT will reclaim these 16 folios, after new 16 try_to_unmap_one,

PTE          pte_stat
PTE0        swap 0x10000  SWAP_HAS_CACHE
PTE1        swap 0x10001  SWAP_HAS_CACHE
PTE2        swap 0x10002  SWAP_HAS_CACHE
PTE3        swap 0x10003  SWAP_HAS_CACHE
...
PTE15        swap 0x1000F  SWAP_HAS_CACHE

>From this time, we can have various different cases for these 16 PTEs.
for example,

PTE          pte_stat
PTE0        swap 0x10000  SWAP_HAS_CACHE = 0 -> become false due to
finished pageout and remove_mapping
PTE1        swap 0x10001  SWAP_HAS_CACHE = 0 -> become false due to
finished pageout and remove_mapping
PTE2        swap 0x10002  SWAP_HAS_CACHE = 0 -> become false due to
concurrent swapin and swapout
PTE3        swap 0x10003  SWAP_HAS_CACHE = 1
...
PTE13        swap 0x1000D  SWAP_HAS_CACHE = 1
PTE14        swap 0x1000E  SWAP_HAS_CACHE = 1
PTE15        swap 0x1000F  SWAP_HAS_CACHE = 1

but all of them have swp_count = 1 and different SWAP_HAS_CACHE. some of these
small folios might be in swapcache, some others might not be in.

then we do_swap_page at one PTE whose SWAP_HAS_CACHE=0 and
swap_count=1 (the folio is not in swapcache, thus has been written to swap),
we do this check:

static bool pte_range_swap(pte_t *pte, int nr_pages)
{
        int i;
        swp_entry_t entry;
        unsigned type;
        pgoff_t start_offset;

        entry = pte_to_swp_entry(ptep_get_lockless(pte));
        if (non_swap_entry(entry))
                return false;
        start_offset = swp_offset(entry);
        if (start_offset % nr_pages)
                return false;

        type = swp_type(entry);
        for (i = 1; i < nr_pages; i++) {
                entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
                if (non_swap_entry(entry))
                        return false;
                if (swp_offset(entry) != start_offset + i)
                        return false;
                if (swp_type(entry) != type)
                        return false;
        }

        return true;
}

as those swp entries are contiguous, we will call swap_read_folio().
For those folios which are still in swapcache and haven't been written,
we get zero-filled data from zRAM.

So the root cause is that pte_range_swap should check
all 16 swap_map have the same SWAP_HAS_CACHE as
false.

static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
{
       ...
       count = si->swap_map[start_offset];
       for (i = 1; i < nr_pages; i++) {
               entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
               if (non_swap_entry(entry))
                       return false;
               if (swp_offset(entry) != start_offset + i)
                       return false;
               if (swp_type(entry) != type)
                       return false;
               /* fallback to small folios if SWAP_HAS_CACHE isn't same */
               if (si->swap_map[start_offset + i] != count)
                       return false;
       }

       return true;
}

but somehow TTU_SYNC "resolves" it by giving no chance to
MADV_PAGEOUT to split this folio as the large folio are either
entirely written by swap entries, or entirely keep present PTEs.

Though the bug is within the swap-in series, I am still a big fan of
TTU_SYNC for large folio reclamation for at least three reasons,

1. We remove some possibility that large folios fail to be reclaimed, improving
reclamation efficiency.

2. We avoid many strange cases and potential folio_split during reclamation.
without TTU_SYNC, folios can be splitted later, or partially being set to swap
entries while partially being still present

3. we don't increase PTL contention. My test shows try_to_unmap_one
will always get PTL after it sometimes skips one or two PTEs because
intermediate break-before-makes are short. Of course, most time try_to_unmap_one
will get PTL from PTE0.

>
> Thanks,
> Ryan
>

Thanks
Barry