Re: [PATCH] btrfs: zlib: do not do unnecessary page copying for compression

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Tue, 28 May 2024 07:39:05 +0930

在 2024/5/28 01:55, Zaslonko Mikhail 写道:
Hello Qu,

I remember implementing btrfs zlib changes related to s390 dfltcc compression support a while ago:
https://lwn.net/Articles/808809/

The workspace buffer size was indeed enlarged for performance reasons.

Please see my comments below.

On 27.05.2024 11:24, Qu Wenruo wrote:
[BUG]
In function zlib_compress_folios(), we handle the input by:

- If there are multiple pages left
   We copy the page content into workspace->buf, and use workspace->buf
   as input for compression.

   But on x86_64 (which doesn't support dfltcc), that buffer size is just
   one page, so we're wasting our CPU time copying the page for no
   benefit.

- If there is only one page left
   We use the mapped page address as input for compression.

The problem is, this means we will copy the whole input range except the
last page (can be as large as 124K), without much obvious benefit.

Meanwhile the cost is pretty obvious.

Actually, the behavior for kernels w/o dfltcc support (currently available on s390
only) should not be affected.
We copy input pages to the workspace->buf only if the buffer size is larger than 1 page.
At least it worked this way after my original btrfs zlib patch:
https://lwn.net/ml/linux-kernel/20200108105103.29028-1-zaslonko@xxxxxxxxxxxxx/

Has this behavior somehow changed after your page->folio conversion performed for btrfs?
https://lore.kernel.org/all/cover.1706521511.git.wqu@xxxxxxxx/

My bad, I forgot that the buf_size for non-S390 systems is fixed to one
page thus the page copy is not utilized for x86_64.

But I'm still wondering if we do not go 4 pages as buffer, how much
performance penalty would there be?

One of the objective is to prepare for the incoming sector perfect
subpage compression support, thus I'm re-checking the existing
compression code, preparing to change them to be subpage compatible.

If we can simplify the behavior without too large performance penalty,
can we consider just using one single page as buffer?

Thanks,
Qu