Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

Nhat Pham <nphamcs@xxxxxxxxx> · Thu, 28 Mar 2024 16:33:57 -0700

On Thu, Mar 28, 2024 at 1:24 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Thu, Mar 28, 2024 at 12:31 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Mon, Mar 25, 2024 at 11:50:14PM +0000, Yosry Ahmed wrote:
> > > The current same-filled pages handling supports pages filled with any
> > > repeated word-sized pattern. However, in practice, most of these should
> > > be zero pages anyway. Other patterns should be nearly as common.
> > >
> > > Drop the support for non-zero same-filled pages, but keep the names of
> > > knobs exposed to userspace as "same_filled", which isn't entirely
> > > inaccurate.
> > >
> > > This yields some nice code simplification and enables a following patch
> > > that eliminates the need to allocate struct zswap_entry for those pages
> > > completely.
> > >
> > > There is also a very small performance improvement observed over 50 runs
> > > of kernel build test (kernbench) comparing the mean build time on a
> > > skylake machine when building the kernel in a cgroup v1 container with a
> > > 3G limit:
> > >
> > >               base            patched         % diff
> > > real          70.167          69.915          -0.359%
> > > user          2953.068        2956.147        +0.104%
> > > sys           2612.811        2594.718        -0.692%
> > >
> > > This probably comes from more optimized operations like memchr_inv() and
> > > clear_highpage(). Note that the percentage of zero-filled pages during
> > > this test was only around 1.5% on average, and was not affected by this
> > > patch. Practical workloads could have a larger proportion of such pages
> > > (e.g. Johannes observed around 10% [1]), so the performance improvement
> > > should be larger.
> > >
> > > [1]https://lore.kernel.org/linux-mm/20240320210716.GH294822@xxxxxxxxxxx/
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> >
> > This is an interesting direction to pursue, but I actually thinkg it
> > doesn't go far enough. Either way, I think it needs more data.
> >
> > 1) How frequent are non-zero-same-filled pages? Difficult to
> >    generalize, but if you could gather some from your fleet, that
> >    would be useful. If you can devise a portable strategy, I'd also be
> >    more than happy to gather this on ours (although I think you have
> >    more widespread zswap use, whereas we have more disk swap.)
>
> I am trying to collect the data, but there are.. hurdles. It would
> take some time, so I was hoping the data could be collected elsewhere
> if possible.
>
> The idea I had was to hook a BPF program to the entry of
> zswap_fill_page() and create a histogram of the "value" argument. We
> would get more coverage by hooking it to the return of
> zswap_is_page_same_filled() and only updating the histogram if the
> return value is true, as it includes pages in zswap that haven't been
> swapped in.
>
> However, with zswap_is_page_same_filled() the BPF program will run in
> all zswap stores, whereas for zswap_fill_page() it will only run when
> needed. Not sure if this makes a practical difference tbh.
>
> >
> > 2) The fact that we're doing any of this pattern analysis in zswap at
> >    all strikes me as a bit misguided. Being efficient about repetitive
> >    patterns is squarely in the domain of a compression algorithm. Do
> >    we not trust e.g. zstd to handle this properly?
>
> I thought about this briefly, but I didn't follow through. I could try
> to collect some data by swapping out different patterns and observing
> how different compression algorithms react. That would be interesting
> for sure.
>
> >
> >    I'm guessing this goes back to inefficient packing from something
> >    like zbud, which would waste half a page on one repeating byte.
> >
> >    But zsmalloc can do 32 byte objects. It's also a batching slab
> >    allocator, where storing a series of small, same-sized objects is
> >    quite fast.
> >
> >    Add to that the additional branches, the additional kmap, the extra
> >    scanning of every single page for patterns - all in the fast path
> >    of zswap, when we already know that the vast majority of incoming
> >    pages will need to be properly compressed anyway.
> >
> >    Maybe it's time to get rid of the special handling entirely?
>
> We would still be wasting some memory (~96 bytes between zswap_entry
> and zsmalloc object), and wasting cycling allocating them. This could
> be made up for by cycles saved by removing the handling. We will be
> saving some branches for sure. I am not worried about kmap as I think
> it's a noop in most cases.

A secondary effect of the current same-filled page handling is that
we're not considering them for reclaim. Which could potentially be
beneficial, because we're not saving much memory (essentially just the
zswap entry and associated cost of storing them) by writing these
pages back - IOW, the cost / benefit ratio for reclaiming these pages
is quite atrocious.

Again, all of this is just handwaving without numbers. It'd be nice if
we can have more concrete data for this conversation :P