Hello Qu, I remember implementing btrfs zlib changes related to s390 dfltcc compression support a while ago: https://lwn.net/Articles/808809/ The workspace buffer size was indeed enlarged for performance reasons. Please see my comments below. On 27.05.2024 11:24, Qu Wenruo wrote: > [BUG] > In function zlib_compress_folios(), we handle the input by: > > - If there are multiple pages left > We copy the page content into workspace->buf, and use workspace->buf > as input for compression. > > But on x86_64 (which doesn't support dfltcc), that buffer size is just > one page, so we're wasting our CPU time copying the page for no > benefit. > > - If there is only one page left > We use the mapped page address as input for compression. > > The problem is, this means we will copy the whole input range except the > last page (can be as large as 124K), without much obvious benefit. > > Meanwhile the cost is pretty obvious. Actually, the behavior for kernels w/o dfltcc support (currently available on s390 only) should not be affected. We copy input pages to the workspace->buf only if the buffer size is larger than 1 page. At least it worked this way after my original btrfs zlib patch: https://lwn.net/ml/linux-kernel/20200108105103.29028-1-zaslonko@xxxxxxxxxxxxx/ Has this behavior somehow changed after your page->folio conversion performed for btrfs? https://lore.kernel.org/all/cover.1706521511.git.wqu@xxxxxxxx/ Am I missing something? > > [POSSIBLE REASON] > The possible reason may be related to the support of S390 hardware zlib > decompression acceleration. > > As we allocate 4 pages (4 * 4K) as workspace input buffer just for s390. > > [FIX] > I checked the dfltcc code, there seems to be no requirement on the > input buffer size. > The function dfltcc_can_deflate() only checks: > > - If the compression settings are supported > Only level/w_bits/strategy/level_mask is checked. > > - If the hardware supports > > No mention at all on the input buffer size, thus I believe there is no > need to waste time doing the page copying. > > Maybe the hardware acceleration is so good for s390 that they can offset > the page copying cost, but it's definitely a penalty for non-s390 > systems. > > So fix the problem by: > > - Use the same buffer size > No matter if dfltcc support is enabled or not > > - Always use page address as input > > Cc: linux-s390@xxxxxxxxxxxxxxx > Signed-off-by: Qu Wenruo <wqu@xxxxxxxx> > --- > fs/btrfs/zlib.c | 67 +++++++++++-------------------------------------- > 1 file changed, 14 insertions(+), 53 deletions(-) > > diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c > index d9e5c88a0f85..9c88a841a060 100644 > --- a/fs/btrfs/zlib.c > +++ b/fs/btrfs/zlib.c > @@ -65,21 +65,8 @@ struct list_head *zlib_alloc_workspace(unsigned int level) > zlib_inflate_workspacesize()); > workspace->strm.workspace = kvzalloc(workspacesize, GFP_KERNEL | __GFP_NOWARN); > workspace->level = level; > - workspace->buf = NULL; > - /* > - * In case of s390 zlib hardware support, allocate lager workspace > - * buffer. If allocator fails, fall back to a single page buffer. > - */ > - if (zlib_deflate_dfltcc_enabled()) { > - workspace->buf = kmalloc(ZLIB_DFLTCC_BUF_SIZE, > - __GFP_NOMEMALLOC | __GFP_NORETRY | > - __GFP_NOWARN | GFP_NOIO); > - workspace->buf_size = ZLIB_DFLTCC_BUF_SIZE; > - } > - if (!workspace->buf) { > - workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL); > - workspace->buf_size = PAGE_SIZE; > - } > + workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL); > + workspace->buf_size = PAGE_SIZE; > if (!workspace->strm.workspace || !workspace->buf) > goto fail; > > @@ -103,7 +90,6 @@ int zlib_compress_folios(struct list_head *ws, struct address_space *mapping, > struct folio *in_folio = NULL; > struct folio *out_folio = NULL; > unsigned long bytes_left; > - unsigned int in_buf_folios; > unsigned long len = *total_out; > unsigned long nr_dest_folios = *out_folios; > const unsigned long max_out = nr_dest_folios * PAGE_SIZE; > @@ -130,7 +116,6 @@ int zlib_compress_folios(struct list_head *ws, struct address_space *mapping, > folios[0] = out_folio; > nr_folios = 1; > > - workspace->strm.next_in = workspace->buf; > workspace->strm.avail_in = 0; > workspace->strm.next_out = cfolio_out; > workspace->strm.avail_out = PAGE_SIZE; > @@ -142,43 +127,19 @@ int zlib_compress_folios(struct list_head *ws, struct address_space *mapping, > */ > if (workspace->strm.avail_in == 0) { > bytes_left = len - workspace->strm.total_in; > - in_buf_folios = min(DIV_ROUND_UP(bytes_left, PAGE_SIZE), > - workspace->buf_size / PAGE_SIZE); doesn't this always set *in_buf_folios* to 1 in case no dfltcc support (single page workspace buffer)? > - if (in_buf_folios > 1) { > - int i; > - > - for (i = 0; i < in_buf_folios; i++) { > - if (data_in) { > - kunmap_local(data_in); > - folio_put(in_folio); > - data_in = NULL; > - } > - ret = btrfs_compress_filemap_get_folio(mapping, > - start, &in_folio); > - if (ret < 0) > - goto out; > - data_in = kmap_local_folio(in_folio, 0); > - copy_page(workspace->buf + i * PAGE_SIZE, > - data_in); > - start += PAGE_SIZE; > - } > - workspace->strm.next_in = workspace->buf; > - } else { > - if (data_in) { > - kunmap_local(data_in); > - folio_put(in_folio); > - data_in = NULL; > - } > - ret = btrfs_compress_filemap_get_folio(mapping, > - start, &in_folio); > - if (ret < 0) > - goto out; > - data_in = kmap_local_folio(in_folio, 0); > - start += PAGE_SIZE; > - workspace->strm.next_in = data_in; > + if (data_in) { > + kunmap_local(data_in); > + folio_put(in_folio); > + data_in = NULL; > } > - workspace->strm.avail_in = min(bytes_left, > - (unsigned long) workspace->buf_size); > + ret = btrfs_compress_filemap_get_folio(mapping, > + start, &in_folio); > + if (ret < 0) > + goto out; > + data_in = kmap_local_folio(in_folio, 0); > + start += PAGE_SIZE; > + workspace->strm.next_in = data_in; > + workspace->strm.avail_in = min(bytes_left, PAGE_SIZE); > } > > ret = zlib_deflate(&workspace->strm, Z_SYNC_FLUSH); Thanks, Mikhail