Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile

Barry Song <21cnbao@xxxxxxxxx> · Tue, 30 Jul 2024 01:11:31 +1200

On Tue, Jul 30, 2024 at 12:49 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Mon, Jul 29, 2024 at 04:46:42PM +1200, Barry Song wrote:
> > On Mon, Jul 29, 2024 at 4:41 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > >
> > > On Mon, Jul 29, 2024 at 3:51 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote:
> > > > > -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > > > > -                                             vma, vmf->address, false);
> > > > > +                     folio = alloc_swap_folio(vmf);
> > > > >                       page = &folio->page;
> > > >
> > > > This is no longer correct.  You need to set 'page' to the precise page
> > > > that is being faulted rather than the first page of the folio.  It was
> > > > fine before because it always allocated a single-page folio, but now it
> > > > must use folio_page() or folio_file_page() (whichever has the correct
> > > > semantics for you).
> > > >
> > > > Also you need to fix your test suite to notice this bug.  I suggest
> > > > doing that first so that you know whether you've got the calculation
> > > > correct.
> > >
> > > I don't understand why the code is designed in the way the page
> > > is the first page of this folio. Otherwise, we need lots of changes
> > > later while mapping the folio in ptes and rmap.
>
> What?
>
>         folio = swap_cache_get_folio(entry, vma, vmf->address);
>         if (folio)
>                 page = folio_file_page(folio, swp_offset(entry));
>
> page is the precise page, not the first page of the folio.

this is the case we may get a large folio in swapcache but we result in
mapping only one subpage due to the condition to map the whole
folio is not met. if we meet the condition, we are going to set page
to the head instead and map the whole mTHP:

        if (folio_test_large(folio) && folio_test_swapcache(folio)) {
                int nr = folio_nr_pages(folio);
                unsigned long idx = folio_page_idx(folio, page);
                unsigned long folio_start = address - idx * PAGE_SIZE;
                unsigned long folio_end = folio_start + nr * PAGE_SIZE;
                pte_t *folio_ptep;
                pte_t folio_pte;

                if (unlikely(folio_start < max(address & PMD_MASK,
vma->vm_start)))
                        goto check_folio;
                if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
                        goto check_folio;

                folio_ptep = vmf->pte - idx;
                folio_pte = ptep_get(folio_ptep);
                if (!pte_same(folio_pte,
pte_move_swp_offset(vmf->orig_pte, -idx)) ||
                    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
                        goto check_folio;

                page_idx = idx;
                address = folio_start;
                ptep = folio_ptep;
                nr_pages = nr;
                entry = folio->swap;
                page = &folio->page;
        }

>
> > For both accessing large folios in the swapcache and allocating
> > new large folios, the page points to the first page of the folio. we
> > are mapping the whole folio not the specific page.
>
> But what address are we mapping the whole folio at?
>
> > for swapcache cases, you can find the same thing here,
> >
> >         if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> >                 ...
> >                 entry = folio->swap;
> >                 page = &folio->page;
> >         }
>
> Yes, but you missed some important lines from your quote:
>
>                 page_idx = idx;
>                 address = folio_start;
>                 ptep = folio_ptep;
>                 nr_pages = nr;
>
> We deliberate adjust the address so that, yes, we're mapping the entire
> folio, but we're mapping it at an address that means that the page we
> actually faulted on ends up at the address that we faulted on.

for this zRAM case, it is a new allocated large folio, only
while all conditions are met, we will allocate and map
the whole folio. you can check can_swapin_thp() and
thp_swap_suitable_orders().

static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
{
        struct swap_info_struct *si;
        unsigned long addr;
        swp_entry_t entry;
        pgoff_t offset;
        char has_cache;
        int idx, i;
        pte_t pte;

        addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
        idx = (vmf->address - addr) / PAGE_SIZE;
        pte = ptep_get(ptep);

        if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
                return false;
        entry = pte_to_swp_entry(pte);
        offset = swp_offset(entry);
        if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
                return false;

        si = swp_swap_info(entry);
        has_cache = si->swap_map[offset] & SWAP_HAS_CACHE;
        for (i = 1; i < nr_pages; i++) {
                /*
                 * while allocating a large folio and doing
swap_read_folio for the
                 * SWP_SYNCHRONOUS_IO path, which is the case the
being faulted pte
                 * doesn't have swapcache. We need to ensure all PTEs
have no cache
                 * as well, otherwise, we might go to swap devices
while the content
                 * is in swapcache
                 */
                if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache)
                        return false;
        }

        return true;
}

and

static struct folio *alloc_swap_folio(struct vm_fault *vmf)
{
        ....
        entry = pte_to_swp_entry(vmf->orig_pte);
        /*
         * Get a list of all the (large) orders below PMD_ORDER that are enabled
         * and suitable for swapping THP.
         */
        orders = thp_vma_allowable_orders(vma, vma->vm_flags,
                        TVA_IN_PF | TVA_IN_SWAPIN | TVA_ENFORCE_SYSFS,
                        BIT(PMD_ORDER) - 1);
        orders = thp_vma_suitable_orders(vma, vmf->address, orders);
        orders = thp_swap_suitable_orders(swp_offset(entry),
vmf->address, orders);
....
}

static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
                unsigned long addr, unsigned long orders)
{
        int order, nr;

        order = highest_order(orders);

        /*
         * To swap-in a THP with nr pages, we require its first swap_offset
         * is aligned with nr. This can filter out most invalid entries.
         */
        while (orders) {
                nr = 1 << order;
                if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr)
                        break;
                order = next_order(&orders, order);
        }

        return orders;
}

A mTHP is swapped out at aligned swap offset. and we only swap in
aligned mTHP. if somehow one mTHP is mremap() to unaligned address,
we won't swap them in as a large folio. For swapcache case, we are
still checking unaligned mTHP, but for new allocated mTHP, it
is a different story. There is totally no necessity to support
unaligned mTHP and there is no possibility to support unless
something is marked in swap devices to say there was a mTHP.

Thanks
Barry