Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages

Barry Song <21cnbao@xxxxxxxxx> · Fri, 8 Nov 2024 09:53:46 +1300

On Fri, Nov 8, 2024 at 12:49 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
>
>
> On 07/11/2024 10:31, Barry Song wrote:
> > On Thu, Nov 7, 2024 at 11:25 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> >>
> >> On Thu, Nov 7, 2024 at 5:23 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>>
> >>>
> >>>
> >>> On 22/10/2024 00:28, Barry Song wrote:
> >>>>> From: Tangquan Zheng <zhengtangquan@xxxxxxxx>
> >>>>>
> >>>>> +static int zram_bvec_write_multi_pages(struct zram *zram, struct bio_vec *bvec,
> >>>>> +                       u32 index, int offset, struct bio *bio)
> >>>>> +{
> >>>>> +    if (is_multi_pages_partial_io(bvec))
> >>>>> +            return zram_bvec_write_multi_pages_partial(zram, bvec, index, offset, bio);
> >>>>> +    return zram_write_page(zram, bvec->bv_page, index);
> >>>>> +}
> >>>>> +
> >>>
> >>> Hi Barry,
> >>>
> >>> I started reviewing this series just to get a better idea if we can do something
> >>> similar for zswap. I haven't looked at zram code before so this might be a basic
> >>> question:
> >>> How would you end up in zram_bvec_write_multi_pages_partial if using zram for swap?
> >>
> >> Hi Usama,
> >>
> >> There’s a corner case where, for instance, a 32KiB mTHP is swapped
> >> out. Then, if userspace
> >> performs a MADV_DONTNEED on the 0~16KiB portion of this original mTHP,
> >> it now consists
> >> of 8 swap entries(mTHP has been released and unmapped). With
> >> swap0-swap3 released
> >> due to DONTNEED, they become available for reallocation, and other
> >> folios may be swapped
> >> out to those entries. Then it is a combination of the new smaller
> >> folios with the original 32KiB
> >> mTHP.
> >
>
> Hi Barry,
>
> Thanks for this. So in this example of 32K folio, when swap slots 0-3 are
> released zram_slot_free_notify will only clear the ZRAM_COMP_MULTI_PAGES
> flag on the 0-3 index and return (without calling zram_free_page on them).
>
> I am assuming that if another folio is now swapped out to those entries,
> zram allows to overwrite those pages, eventhough they haven't been freed?

Correct. This is a typical case for zRAM. zRAM allows zram_slot_free_notify()
to be skipped entirely (known as miss_free). As long as swap_map[] indicates
that the slots are free, they can be reused.

>
> Also, even if its allowed, I still dont think you will end up in
> zram_bvec_write_multi_pages_partial when you try to write a 16K or
> smaller folio to swap0-3. As want_multi_pages_comp will evaluate to false
> as 16K is less than 32K, you will just end up in zram_bio_write_page?

Until all slots are cleared from ZRAM_COMP_MULTI_PAGES, these entries
remain available for storing small folios. Prior to this, the large
block remains
intact. For instance, if swap0 to swap3 are free and swap4 to swap7
still reference
the old compressed mTHP, writing only to swap0 would modify the large block.

static inline int __test_multi_pages_comp(struct zram *zram, u32 index)
{
        int i;
        int count = 0;

        int head_index = index & ~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);

        for (i = 0; i < ZCOMP_MULTI_PAGES_NR; i++) {
                if (zram_test_flag(zram, head_index + i, ZRAM_COMP_MULTI_PAGES))
                        count++;
        }

        return count;
}

a mapping exists between the head index and the large block of zsmalloc. As long
as any entry with the same head index remains, the large block persists.

Another possible option is:
swap4 to swap7 indexes reference the old large block, while swap0 to
swap3 point to
new small blocks compressed from small folios. This approach would
greatly increase
implementation complexity and could also raise zRAM's memory consumption.

With Chris's and Kairui's swap allocation optimizations, hopefully,
this corner case
will remain minimal.

>
> Thanks,
> Usama
>
>
> > Sorry, I forgot to mention that the assumption is ZSMALLOC_MULTI_PAGES_ORDER=3,
> > so data is compressed in 32KiB blocks.
> >
> > With Chris' and Kairui's new swap optimization, this should be minor,
> > as each cluster has
> > its own order. However, I recall that order-0 can still steal swap
> > slots from other orders'
> > clusters when swap space is limited by scanning all slots? Please
> > correct me if I'm
> > wrong, Kairui and Chris.
> >
> >>
> >>>
> >>> We only swapout whole folios. If ZCOMP_MULTI_PAGES_SIZE=64K, any folio smaller
> >>> than 64K will end up in zram_bio_write_page. Folios greater than or equal to 64K
> >>> would be dispatched by zram_bio_write_multi_pages to zram_bvec_write_multi_pages
> >>> in 64K chunks. So for e.g. 128K folio would end up calling zram_bvec_write_multi_pages
> >>> twice.
> >>
> >> In v2, I changed the default order to 2, allowing all anonymous mTHP
> >> to benefit from this
> >> feature.
> >>
> >>>
> >>> Or is this for the case when you are using zram not for swap? In that case, I probably
> >>> dont need to consider zram_bvec_write_multi_pages_partial write case for zswap.
> >>>
> >>> Thanks,
> >>> Usama
> >>
> >
> > Thanks
> > barry
>

Thanks
Barry