Re: [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails

Barry Song <21cnbao@xxxxxxxxx> · Tue, 26 Nov 2024 07:32:38 +1300

On Tue, Nov 26, 2024 at 5:19 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
>
>
> On 24/11/2024 21:47, Barry Song wrote:
> > On Sat, Nov 23, 2024 at 3:54 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 21/11/2024 22:25, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@xxxxxxxx>
> >>>
> >>> The swapfile can compress/decompress at 4 * PAGES granularity, reducing
> >>> CPU usage and improving the compression ratio. However, if allocating an
> >>> mTHP fails and we fall back to a single small folio, the entire large
> >>> block must still be decompressed. This results in a 16 KiB area requiring
> >>> 4 page faults, where each fault decompresses 16 KiB but retrieves only
> >>> 4 KiB of data from the block. To address this inefficiency, we instead
> >>> fall back to 4 small folios, ensuring that each decompression occurs
> >>> only once.
> >>>
> >>> Allowing swap_read_folio() to decompress and read into an array of
> >>> 4 folios would be extremely complex, requiring extensive changes
> >>> throughout the stack, including swap_read_folio, zeromap,
> >>> zswap, and final swap implementations like zRAM. In contrast,
> >>> having these components fill a large folio with 4 subpages is much
> >>> simpler.
> >>>
> >>> To avoid a full-stack modification, we introduce a per-CPU order-2
> >>> large folio as a buffer. This buffer is used for swap_read_folio(),
> >>> after which the data is copied into the 4 small folios. Finally, in
> >>> do_swap_page(), all these small folios are mapped.
> >>>
> >>> Co-developed-by: Chuanhua Han <chuanhuahan@xxxxxxxxx>
> >>> Signed-off-by: Chuanhua Han <chuanhuahan@xxxxxxxxx>
> >>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
> >>> ---
> >>>  mm/memory.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++++---
> >>>  1 file changed, 192 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index 209885a4134f..e551570c1425 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -4042,6 +4042,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> >>>       return folio;
> >>>  }
> >>>
> >>> +#define BATCH_SWPIN_ORDER 2
> >>
> >> Hi Barry,
> >>
> >> Thanks for the series and the numbers in the cover letter.
> >>
> >> Just a few things.
> >>
> >> Should BATCH_SWPIN_ORDER be ZSMALLOC_MULTI_PAGES_ORDER instead of 2?
> >
> > Technically, yes. I'm also considering removing ZSMALLOC_MULTI_PAGES_ORDER
> > and always setting it to 2, which is the minimum anonymous mTHP order.  The main
> > reason is that it may be difficult for users to select the appropriate Kconfig?
> >
> > On the other hand, 16KB provides the most advantages for zstd compression and
> > decompression with larger blocks. While increasing from 16KB to 32KB or 64KB
> > can offer additional benefits, the improvement is not as significant
> > as the jump from
> > 4KB to 16KB.
> >
> > As I use zstd to compress and decompress the 'Beyond Compare' software
> > package:
> >
> > root@barry-desktop:~# ./zstd
> > File size: 182502912 bytes
> > 4KB Block: Compression time = 0.765915 seconds, Decompression time =
> > 0.203366 seconds
> >   Original size: 182502912 bytes
> >   Compressed size: 66089193 bytes
> >   Compression ratio: 36.21%
> > 16KB Block: Compression time = 0.558595 seconds, Decompression time =
> > 0.153837 seconds
> >   Original size: 182502912 bytes
> >   Compressed size: 59159073 bytes
> >   Compression ratio: 32.42%
> > 32KB Block: Compression time = 0.538106 seconds, Decompression time =
> > 0.137768 seconds
> >   Original size: 182502912 bytes
> >   Compressed size: 57958701 bytes
> >   Compression ratio: 31.76%
> > 64KB Block: Compression time = 0.532212 seconds, Decompression time =
> > 0.127592 seconds
> >   Original size: 182502912 bytes
> >   Compressed size: 56700795 bytes
> >   Compression ratio: 31.07%
> >
> > In that case, would we no longer need to rely on ZSMALLOC_MULTI_PAGES_ORDER?
> >
>
> Yes, I think if there isn't a very significant benefit of using a larger order,
> then its better not to have this option. It would also simplify the code.
>
> >>
> >> Did you check the performance difference with and without patch 4?
> >
> > I retested after reverting patch 4, and the sys time increased to over
> > 40 minutes
> > again, though it was slightly better than without the entire series.
> >
> > *** Executing round 1 ***
> >
> > real 7m49.342s
> > user 80m53.675s
> > sys 42m28.393s
> > pswpin: 29965548
> > pswpout: 51127359
> > 64kB-swpout: 0
> > 32kB-swpout: 0
> > 16kB-swpout: 11347712
> > 64kB-swpin: 0
> > 32kB-swpin: 0
> > 16kB-swpin: 6641230
> > pgpgin: 147376000
> > pgpgout: 213343124
> >
> > *** Executing round 2 ***
> >
> > real 7m41.331s
> > user 81m16.631s
> > sys 41m39.845s
> > pswpin: 29208867
> > pswpout: 50006026
> > 64kB-swpout: 0
> > 32kB-swpout: 0
> > 16kB-swpout: 11104912
> > 64kB-swpin: 0
> > 32kB-swpin: 0
> > 16kB-swpin: 6483827
> > pgpgin: 144057340
> > pgpgout: 208887688
> >
> >
> > *** Executing round 3 ***
> >
> > real 7m47.280s
> > user 78m36.767s
> > sys 37m32.210s
> > pswpin: 26426526
> > pswpout: 45420734
> > 64kB-swpout: 0
> > 32kB-swpout: 0
> > 16kB-swpout: 10104304
> > 64kB-swpin: 0
> > 32kB-swpin: 0
> > 16kB-swpin: 5884839
> > pgpgin: 132013648
> > pgpgout: 190537264
> >
> > *** Executing round 4 ***
> >
> > real 7m56.723s
> > user 80m36.837s
> > sys 41m35.979s
> > pswpin: 29367639
> > pswpout: 50059254
> > 64kB-swpout: 0
> > 32kB-swpout: 0
> > 16kB-swpout: 11116176
> > 64kB-swpin: 0
> > 32kB-swpin: 0
> > 16kB-swpin: 6514064
> > pgpgin: 144593828
> > pgpgout: 209080468
> >
> > *** Executing round 5 ***
> >
> > real 7m53.806s
> > user 80m30.953s
> > sys 40m14.870s
> > pswpin: 28091760
> > pswpout: 48495748
> > 64kB-swpout: 0
> > 32kB-swpout: 0
> > 16kB-swpout: 10779720
> > 64kB-swpin: 0
> > 32kB-swpin: 0
> > 16kB-swpin: 6244819
> > pgpgin: 138813124
> > pgpgout: 202885480
> >
> > I guess it is due to the occurrence of numerous partial reads
> > (about 10%, 3505537/35159852).
> >
> > root@barry-desktop:~# cat /sys/block/zram0/multi_pages_debug_stat
> >
> > zram_bio write/read multi_pages count:54452828 35159852
> > zram_bio failed write/read multi_pages count       0        0
> > zram_bio partial write/read multi_pages count       4  3505537
> > multi_pages_miss_free        0
> >
> > This workload doesn't cause fragmentation in the buddy allocator, so it’s
> > likely due to the failure of MEMCG_CHARGE.
> >
> >>
> >> I know that it wont help if you have a lot of unmovable pages
> >> scattered everywhere, but were you able to compare the performance
> >> of defrag=always vs patch 4? I feel like if you have space for 4 folios
> >> then hopefully compaction should be able to do its job and you can
> >> directly fill the large folio if the unmovable pages are better placed.
> >> Johannes' series on preventing type mixing [1] would help.
> >>
> >> [1] https://lore.kernel.org/all/20240320180429.678181-1-hannes@xxxxxxxxxxx/
> >
> > I believe this could help, but defragmentation is a complex issue. Especially on
> > phones, where various components like drivers, DMA-BUF, multimedia, and
> > graphics utilize memory.
> >
> > We observed that a fresh system could initially provide mTHP, but after a few
> > hours, obtaining mTHP became very challenging. I'm happy to arrange a test
> > of Johannes' series on phones (sometimes it is quite hard to backport to the
> > Android kernel) to see if it brings any improvements.
> >
>
> I think its definitely worth trying. If we can improve memory allocation/compaction
> instead of patch 4, then we should go for that. Maybe there won't be a need for TAO
> if allocation is done in a smarter way?
>
> Just out of curiosity, what is the base kernel version you are testing with?

This kernel build testing was conducted on my Intel PC running mm-unstable,
which includes Johannes' series. As mentioned earlier, it still shows
many partial
reads without patch 4.

For phones, we have to backport to android kernel such as 6.6, 6.1 etc:
https://android.googlesource.com/kernel/common/+refs

Testing new patchset can sometimes be quite a pain ....

>
> Thanks,
> Usama

Thanks
Barry