On Wed, Feb 05, 2025 at 11:21:02PM -0800, Kanchana P Sridhar wrote: > With the previous patch that enables support for batch compressions in > zswap_compress_folio(), a 6.2% throughput regression was seen with zstd and > 2M folios, using vm-scalability/usemem. > > For compressors that don't support batching, this was root-caused to the > following zswap_store_folio() structure: > > Batched stores: > --------------- > - Allocate all entries, > - Compress all entries, > - Store all entries in xarray/LRU. > > Hence, the above structure is maintained only for batched stores, and the > following structure is implemented for sequential stores of large folio pages, > that fixes the zstd regression, while preserving common code paths for batched > and sequential stores of a folio: > > Sequential stores: > ------------------ > For each page in folio: > - allocate an entry, > - compress the page, > - store the entry in xarray/LRU. > > This is submitted as a separate patch only for code review purposes. I will > squash this with the previous commit in subsequent versions of this > patch-series. Could it be the cache locality? I wonder if we should do what Chengming initially suggested and batch everything at ZSWAP_MAX_BATCH_SIZE instead. Instead of zswap_compress_folio() operating on the entire folio, we can operate on batches of size ZSWAP_MAX_BATCH_SIZE, regardless of whether the underlying compressor supports batching. If we do this, instead of: - Allocate all entries - Compress all entries - Store all entries We can do: - For each batch (8 entries) - Allocate all entries - Compress all entries - Store all entries This should help unify the code, and I suspect it may also fix the zstd regression. We can also skip the entries array allocation and use one on the stack.