Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile

Barry Song <21cnbao@xxxxxxxxx> · Tue, 30 Jul 2024 09:56:03 +1200

On Tue, Jul 30, 2024 at 8:03 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
>
> On Tue, Jul 30, 2024 at 3:13 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >
> > On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote:
> > > for this zRAM case, it is a new allocated large folio, only
> > > while all conditions are met, we will allocate and map
> > > the whole folio. you can check can_swapin_thp() and
> > > thp_swap_suitable_orders().
> >
> > YOU ARE DOING THIS WRONGLY!
> >
> > All of you anonymous memory people are utterly fixated on TLBs AND THIS
> > IS WRONG.  Yes, TLB performance is important, particularly with crappy
> > ARM designs, which I know a lot of you are paid to work on.  But you
> > seem to think this is the only consideration, and you're making bad
> > design choices as a result.  It's overly complicated, and you're leaving
> > performance on the table.
> >
> > Look back at the results Ryan showed in the early days of working on
> > large anonymous folios.  Half of the performance win on his system came
> > from using larger TLBs.  But the other half came from _reduced software
> > overhead_.  The LRU lock is a huge problem, and using large folios cuts
> > the length of the LRU list, hence LRU lock hold time.
> >
> > Your _own_ data on how hard it is to get hold of a large folio due to
> > fragmentation should be enough to convince you that the more large folios
> > in the system, the better the whole system runs.  We should not decline to
> > allocate large folios just because they can't be mapped with a single TLB!
>
> I am not convinced. for a new allocated large folio, even alloc_anon_folio()
> of do_anonymous_page() does the exactly same thing
>
> alloc_anon_folio()
> {
>         /*
>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
>          * for this vma. Then filter out the orders that can't be allocated over
>          * the faulting address and still be fully contained in the vma.
>          */
>         orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>                         TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
>         orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>
> }
>
> you are not going to allocate a mTHP for an unaligned address for a new
> PF.
> Please point out where it is wrong.

Let's assume we have a folio with the virtual address as
0x500000000000 ~ 0x500000000000 + 64KB
if it is swapped out to 0x10000 ~ 0x100000 + 64KB.

The current code will swap it in as a mTHP if page fault occurs in
any address within (0x500000000000 ~ 0x500000000000 + 64KB)

In this case, the mTHP enjoys both decreased TLB and reduced overhead
such as LRU lock etc. So it sounds we have nothing lost in this case.

But if the folio is mremap-ed to an unaligned address like:
(0x600000000000 + 16KB ~ 0x600000000000 + 80KB)
and its swap offset is still (0x10000 ~ 0x100000 + 64KB).

The current code won't swap in them as mTHP. Sounds like a loss?

If this is the performance problem you are trying to address, my point
is that it is not worth increasing the complexity for this stage though this
might be doable. We once tracked hundreds of phones running apps randomly
for a couple of days, and we didn't encounter such a case. So this is
pretty much a corner case.

If your concern is more than this, for example, if you want to swap in
large folios even when swaps are completely not contiguous, this is a different
story. I agree this is a potential optimization direction to go,  but in that
case, you still need to find an aligned boundary to handle page faults
just like do_anonymous_page(), otherwise, you may result in all
kinds of pointless intersections where PFs can cover the address ranges of
other PFs, then make the PTEs check such as pte_range_none()
completely dis-ordered:

static struct folio *alloc_anon_folio(struct vm_fault *vmf)
{
        ....

        /*
         * Find the highest order where the aligned range is completely
         * pte_none(). Note that all remaining orders will be completely
         * pte_none().
         */
        order = highest_order(orders);
        while (orders) {
                addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
                if (pte_range_none(pte + pte_index(addr), 1 << order))
                        break;
                order = next_order(&orders, order);
        }
}

>
> Thanks
> Barry