On Thu, Mar 28, 2024 at 12:31 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote: > > The current same-filled pages handling supports pages filled with any > > repeated word-sized pattern. However, in practice, most of these should > > be zero pages anyway. Other patterns should be nearly as common. > > > > Drop the support for non-zero same-filled pages, but keep the names of > > knobs exposed to userspace as "same_filled", which isn't entirely > > inaccurate. > > > > This yields some nice code simplification and enables a following patch > > that eliminates the need to allocate struct zswap_entry for those pages > > completely. > > > > There is also a very small performance improvement observed over 50 runs > > of kernel build test (kernbench) comparing the mean build time on a > > skylake machine when building the kernel in a cgroup v1 container with a > > 3G limit: > > > > base patched % diff > > real 70.167 69.915 -0.359% > > user 2953.068 2956.147 +0.104% > > sys 2612.811 2594.718 -0.692% > > > > This probably comes from more optimized operations like memchr_inv() and > > clear_highpage(). Note that the percentage of zero-filled pages during > > this test was only around 1.5% on average, and was not affected by this > > patch. Practical workloads could have a larger proportion of such pages > > (e.g. Johannes observed around 10% [1]), so the performance improvement > > should be larger. > > > > [1]https://lore.kernel.org/linux-mm/20240320210716.GH294822@xxxxxxxxxxx/ > > > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > This is an interesting direction to pursue, but I actually thinkg it > doesn't go far enough. Either way, I think it needs more data. > > 1) How frequent are non-zero-same-filled pages? Difficult to > generalize, but if you could gather some from your fleet, that > would be useful. If you can devise a portable strategy, I'd also be > more than happy to gather this on ours (although I think you have > more widespread zswap use, whereas we have more disk swap.) I am trying to collect the data, but there are.. hurdles. It would take some time, so I was hoping the data could be collected elsewhere if possible. The idea I had was to hook a BPF program to the entry of zswap_fill_page() and create a histogram of the "value" argument. We would get more coverage by hooking it to the return of zswap_is_page_same_filled() and only updating the histogram if the return value is true, as it includes pages in zswap that haven't been swapped in. However, with zswap_is_page_same_filled() the BPF program will run in all zswap stores, whereas for zswap_fill_page() it will only run when needed. Not sure if this makes a practical difference tbh. > > 2) The fact that we're doing any of this pattern analysis in zswap at > all strikes me as a bit misguided. Being efficient about repetitive > patterns is squarely in the domain of a compression algorithm. Do > we not trust e.g. zstd to handle this properly? I thought about this briefly, but I didn't follow through. I could try to collect some data by swapping out different patterns and observing how different compression algorithms react. That would be interesting for sure. > > I'm guessing this goes back to inefficient packing from something > like zbud, which would waste half a page on one repeating byte. > > But zsmalloc can do 32 byte objects. It's also a batching slab > allocator, where storing a series of small, same-sized objects is > quite fast. > > Add to that the additional branches, the additional kmap, the extra > scanning of every single page for patterns - all in the fast path > of zswap, when we already know that the vast majority of incoming > pages will need to be properly compressed anyway. > > Maybe it's time to get rid of the special handling entirely? We would still be wasting some memory (~96 bytes between zswap_entry and zsmalloc object), and wasting cycling allocating them. This could be made up for by cycles saved by removing the handling. We will be saving some branches for sure. I am not worried about kmap as I think it's a noop in most cases. I am interested to see how much we could save by removing scanning for patterns. We may not save much if we abort after reading a few words in most cases, but I guess we could also be scanning a considerable amount before aborting. On the other hand, we would be reading the page contents into cache anyway for compression, so maybe it doesn't really matter? I will try to collect some data about this. I will start by trying to find out how the compression algorithms handle same-filled pages. If they can compress it efficiently, then I will try to get more data on the tradeoff from removing the handling. Thanks for the insights.