On Tue, Nov 26, 2024 at 5:19 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 24/11/2024 21:47, Barry Song wrote: > > On Sat, Nov 23, 2024 at 3:54 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >> > >> > >> > >> On 21/11/2024 22:25, Barry Song wrote: > >>> From: Barry Song <v-songbaohua@xxxxxxxx> > >>> > >>> The swapfile can compress/decompress at 4 * PAGES granularity, reducing > >>> CPU usage and improving the compression ratio. However, if allocating an > >>> mTHP fails and we fall back to a single small folio, the entire large > >>> block must still be decompressed. This results in a 16 KiB area requiring > >>> 4 page faults, where each fault decompresses 16 KiB but retrieves only > >>> 4 KiB of data from the block. To address this inefficiency, we instead > >>> fall back to 4 small folios, ensuring that each decompression occurs > >>> only once. > >>> > >>> Allowing swap_read_folio() to decompress and read into an array of > >>> 4 folios would be extremely complex, requiring extensive changes > >>> throughout the stack, including swap_read_folio, zeromap, > >>> zswap, and final swap implementations like zRAM. In contrast, > >>> having these components fill a large folio with 4 subpages is much > >>> simpler. > >>> > >>> To avoid a full-stack modification, we introduce a per-CPU order-2 > >>> large folio as a buffer. This buffer is used for swap_read_folio(), > >>> after which the data is copied into the 4 small folios. Finally, in > >>> do_swap_page(), all these small folios are mapped. > >>> > >>> Co-developed-by: Chuanhua Han <chuanhuahan@xxxxxxxxx> > >>> Signed-off-by: Chuanhua Han <chuanhuahan@xxxxxxxxx> > >>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > >>> --- > >>> mm/memory.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++++--- > >>> 1 file changed, 192 insertions(+), 11 deletions(-) > >>> > >>> diff --git a/mm/memory.c b/mm/memory.c > >>> index 209885a4134f..e551570c1425 100644 > >>> --- a/mm/memory.c > >>> +++ b/mm/memory.c > >>> @@ -4042,6 +4042,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf) > >>> return folio; > >>> } > >>> > >>> +#define BATCH_SWPIN_ORDER 2 > >> > >> Hi Barry, > >> > >> Thanks for the series and the numbers in the cover letter. > >> > >> Just a few things. > >> > >> Should BATCH_SWPIN_ORDER be ZSMALLOC_MULTI_PAGES_ORDER instead of 2? > > > > Technically, yes. I'm also considering removing ZSMALLOC_MULTI_PAGES_ORDER > > and always setting it to 2, which is the minimum anonymous mTHP order. The main > > reason is that it may be difficult for users to select the appropriate Kconfig? > > > > On the other hand, 16KB provides the most advantages for zstd compression and > > decompression with larger blocks. While increasing from 16KB to 32KB or 64KB > > can offer additional benefits, the improvement is not as significant > > as the jump from > > 4KB to 16KB. > > > > As I use zstd to compress and decompress the 'Beyond Compare' software > > package: > > > > root@barry-desktop:~# ./zstd > > File size: 182502912 bytes > > 4KB Block: Compression time = 0.765915 seconds, Decompression time = > > 0.203366 seconds > > Original size: 182502912 bytes > > Compressed size: 66089193 bytes > > Compression ratio: 36.21% > > 16KB Block: Compression time = 0.558595 seconds, Decompression time = > > 0.153837 seconds > > Original size: 182502912 bytes > > Compressed size: 59159073 bytes > > Compression ratio: 32.42% > > 32KB Block: Compression time = 0.538106 seconds, Decompression time = > > 0.137768 seconds > > Original size: 182502912 bytes > > Compressed size: 57958701 bytes > > Compression ratio: 31.76% > > 64KB Block: Compression time = 0.532212 seconds, Decompression time = > > 0.127592 seconds > > Original size: 182502912 bytes > > Compressed size: 56700795 bytes > > Compression ratio: 31.07% > > > > In that case, would we no longer need to rely on ZSMALLOC_MULTI_PAGES_ORDER? > > > > Yes, I think if there isn't a very significant benefit of using a larger order, > then its better not to have this option. It would also simplify the code. > > >> > >> Did you check the performance difference with and without patch 4? > > > > I retested after reverting patch 4, and the sys time increased to over > > 40 minutes > > again, though it was slightly better than without the entire series. > > > > *** Executing round 1 *** > > > > real 7m49.342s > > user 80m53.675s > > sys 42m28.393s > > pswpin: 29965548 > > pswpout: 51127359 > > 64kB-swpout: 0 > > 32kB-swpout: 0 > > 16kB-swpout: 11347712 > > 64kB-swpin: 0 > > 32kB-swpin: 0 > > 16kB-swpin: 6641230 > > pgpgin: 147376000 > > pgpgout: 213343124 > > > > *** Executing round 2 *** > > > > real 7m41.331s > > user 81m16.631s > > sys 41m39.845s > > pswpin: 29208867 > > pswpout: 50006026 > > 64kB-swpout: 0 > > 32kB-swpout: 0 > > 16kB-swpout: 11104912 > > 64kB-swpin: 0 > > 32kB-swpin: 0 > > 16kB-swpin: 6483827 > > pgpgin: 144057340 > > pgpgout: 208887688 > > > > > > *** Executing round 3 *** > > > > real 7m47.280s > > user 78m36.767s > > sys 37m32.210s > > pswpin: 26426526 > > pswpout: 45420734 > > 64kB-swpout: 0 > > 32kB-swpout: 0 > > 16kB-swpout: 10104304 > > 64kB-swpin: 0 > > 32kB-swpin: 0 > > 16kB-swpin: 5884839 > > pgpgin: 132013648 > > pgpgout: 190537264 > > > > *** Executing round 4 *** > > > > real 7m56.723s > > user 80m36.837s > > sys 41m35.979s > > pswpin: 29367639 > > pswpout: 50059254 > > 64kB-swpout: 0 > > 32kB-swpout: 0 > > 16kB-swpout: 11116176 > > 64kB-swpin: 0 > > 32kB-swpin: 0 > > 16kB-swpin: 6514064 > > pgpgin: 144593828 > > pgpgout: 209080468 > > > > *** Executing round 5 *** > > > > real 7m53.806s > > user 80m30.953s > > sys 40m14.870s > > pswpin: 28091760 > > pswpout: 48495748 > > 64kB-swpout: 0 > > 32kB-swpout: 0 > > 16kB-swpout: 10779720 > > 64kB-swpin: 0 > > 32kB-swpin: 0 > > 16kB-swpin: 6244819 > > pgpgin: 138813124 > > pgpgout: 202885480 > > > > I guess it is due to the occurrence of numerous partial reads > > (about 10%, 3505537/35159852). > > > > root@barry-desktop:~# cat /sys/block/zram0/multi_pages_debug_stat > > > > zram_bio write/read multi_pages count:54452828 35159852 > > zram_bio failed write/read multi_pages count 0 0 > > zram_bio partial write/read multi_pages count 4 3505537 > > multi_pages_miss_free 0 > > > > This workload doesn't cause fragmentation in the buddy allocator, so it’s > > likely due to the failure of MEMCG_CHARGE. > > > >> > >> I know that it wont help if you have a lot of unmovable pages > >> scattered everywhere, but were you able to compare the performance > >> of defrag=always vs patch 4? I feel like if you have space for 4 folios > >> then hopefully compaction should be able to do its job and you can > >> directly fill the large folio if the unmovable pages are better placed. > >> Johannes' series on preventing type mixing [1] would help. > >> > >> [1] https://lore.kernel.org/all/20240320180429.678181-1-hannes@xxxxxxxxxxx/ > > > > I believe this could help, but defragmentation is a complex issue. Especially on > > phones, where various components like drivers, DMA-BUF, multimedia, and > > graphics utilize memory. > > > > We observed that a fresh system could initially provide mTHP, but after a few > > hours, obtaining mTHP became very challenging. I'm happy to arrange a test > > of Johannes' series on phones (sometimes it is quite hard to backport to the > > Android kernel) to see if it brings any improvements. > > > > I think its definitely worth trying. If we can improve memory allocation/compaction > instead of patch 4, then we should go for that. Maybe there won't be a need for TAO > if allocation is done in a smarter way? > > Just out of curiosity, what is the base kernel version you are testing with? This kernel build testing was conducted on my Intel PC running mm-unstable, which includes Johannes' series. As mentioned earlier, it still shows many partial reads without patch 4. For phones, we have to backport to android kernel such as 6.6, 6.1 etc: https://android.googlesource.com/kernel/common/+refs Testing new patchset can sometimes be quite a pain .... > > Thanks, > Usama Thanks Barry