Hello Qu, On 28.05.2024 00:09, Qu Wenruo wrote: > > > 在 2024/5/28 01:55, Zaslonko Mikhail 写道: >> Hello Qu, >> >> I remember implementing btrfs zlib changes related to s390 dfltcc compression support a while ago: >> https://lwn.net/Articles/808809/ >> >> The workspace buffer size was indeed enlarged for performance reasons. >> >> Please see my comments below. >> >> On 27.05.2024 11:24, Qu Wenruo wrote: >>> [BUG] >>> In function zlib_compress_folios(), we handle the input by: >>> >>> - If there are multiple pages left >>> We copy the page content into workspace->buf, and use workspace->buf >>> as input for compression. >>> >>> But on x86_64 (which doesn't support dfltcc), that buffer size is just >>> one page, so we're wasting our CPU time copying the page for no >>> benefit. >>> >>> - If there is only one page left >>> We use the mapped page address as input for compression. >>> >>> The problem is, this means we will copy the whole input range except the >>> last page (can be as large as 124K), without much obvious benefit. >>> >>> Meanwhile the cost is pretty obvious. >> >> Actually, the behavior for kernels w/o dfltcc support (currently available on s390 >> only) should not be affected. >> We copy input pages to the workspace->buf only if the buffer size is larger than 1 page. >> At least it worked this way after my original btrfs zlib patch: >> https://lwn.net/ml/linux-kernel/20200108105103.29028-1-zaslonko@xxxxxxxxxxxxx/ >> >> Has this behavior somehow changed after your page->folio conversion performed for btrfs? >> https://lore.kernel.org/all/cover.1706521511.git.wqu@xxxxxxxx/ > > My bad, I forgot that the buf_size for non-S390 systems is fixed to one > page thus the page copy is not utilized for x86_64. > > But I'm still wondering if we do not go 4 pages as buffer, how much > performance penalty would there be? > > One of the objective is to prepare for the incoming sector perfect > subpage compression support, thus I'm re-checking the existing > compression code, preparing to change them to be subpage compatible. > > If we can simplify the behavior without too large performance penalty, > can we consider just using one single page as buffer? Based on my earlier estimates, bigger buffer provided up to 60% performance for inflate and up to 30% for deflate on s390 with dfltcc support. I don't think giving it away for simplification would be a good idea. > > Thanks, > Qu