On Fri, Mar 29, 2024 at 2:17 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Fri, Mar 29, 2024 at 11:56:46AM -0700, Yosry Ahmed wrote: > > > I perf'd zswapping out data that is 10% same-filled and 90% data that > > > always needs compression. It does nothing but madvise(MADV_PAGEOUT), > > > and the zswap_store() stack is already only ~60% of the cycles. > > > > > > Using zsmalloc + zstd, this is the diff between vanilla and my patch: > > > > > > # Baseline Delta Abs Shared Object Symbol > > > # ........ ......... .................... ..................................................... > > > # > > > 4.34% -3.02% [kernel.kallsyms] [k] zswap_store > > > 11.07% +1.41% [kernel.kallsyms] [k] ZSTD_compressBlock_doubleFast > > > 15.55% +0.91% [kernel.kallsyms] [k] FSE_buildCTable_wksp > > > > > > As expected, we have to compress a bit more; on the other hand we're > > > removing the content scan for same-filled for 90% of the pages that > > > don't benefit from it. They almost amortize each other. Let's round it > > > up and the remaining difference is ~1%. > > > > Thanks for the data, this is very useful. > > > > There is also the load path. The original patch that introduced > > same-filled pages reports more gains on the load side, which also > > happens to be more latency-sensitive. Of course, the data could be > > outdated -- but it's a different type of workload than what will be > > running in a data center fleet AFAICT. > > > > Is there also no noticeable difference on the load side in your data? > > 9.40% +2.51% [kernel.kallsyms] [k] ZSTD_safecopy > 0.76% -0.48% [kernel.kallsyms] [k] zswap_load > > About 2% net. > > Checking other compression algorithms, lzo eats 27.58%, and lz4 > 13.82%. The variance between these alone makes our "try to be smarter > than an actual compression algorithm" code look even sillier. > > > Also, how much increase did you observe in the size of compressed data > > with your patch? Just curious about how zstd ended up handling those > > pages. > > Checking zsmalloc stats, it did pack the same-filled ones down into 32 > byte objects. So 0.7% of their original size. That's negligible, even > for workloads that have an unusually high share of them. Glad to see that this was handled appropriately. > > > > It's difficult to make the case that this matters to any real > > > workloads with actual think time in between paging. > > > > If the difference in performance is 1%, I surely agree. > > > > The patch reported 19-32% improvement in store time for same-filled > > pages depending on the workload and platform. For 10% same-filled > > pages, this should roughly correspond to 2-3% overall improvement, > > which is not too far from the 1% you observed. > > Right. > > > The patch reported much larger improvement for load times (which > > matters more), 49-85% for same-filled pages. If this corresponds to > > 5-8% overall improvement in zswap load time for a workload with 10% > > same-filled pages, that would be very significant. I don't expect to > > see such improvements tbh, but we should check. > > No, I'm seeing around 2% net. You mentioned that other compressors eat more cycles in this case, so perhaps the data in the original patch comes from one of those compressors. > > > > But let's say you do make the case that zero-filled pages are worth > > > optimizing for. > > > > I am not saying they are for sure, but removing the same-filled > > checking didn't seem to improve performance much in my testing, so the > > cost seems to be mostly in maintenance. This makes it seem to me that > > it *could* be a good tradeoff to only handle zero-filled pages. We can > > make them slightly faster and we can trim the complexity -- as shown > > by this patch. > > In the original numbers, there was a certain percentage of non-zero > same-filled pages. You still first have to find that number for real > production loads to determine what the tradeoff actually is. > > And I disagree that the code is less complex. There are fewer lines to > the memchr_inv() than to the open-coded word-scan, but that scan is > dead simple, self-contained and out of the way. > > So call that a wash in terms of maintenance burden, but for a > reduction in functionality and a regression risk (however small). > > The next patch to store them as special xarray entries is also a wash > at best. It trades the entry having an implicit subtype for no type at > all. zswap_load_zero_filled() looks like it's the fast path, and > decompression the fallback; the entry destructor is now called > "zswap_tree_free_element" and takes a void pointer. It also re-adds > most of the lines deleted by the zero-only patch. > > I think this is actually a bit worse than the status quo. > > But my point is, this all seems like a lot of opportunity cost in > terms of engineering time, review bandwidth, regression risk, and > readability, hackability, reliability, predictability of the code - > for gains that are still not shown to actually matter in practice. Yeah in terms of code cleanliness it did not turn out as I thought it would. Actually even using different subtypes will have either similarly abstract (yet less clear) helpers, or completely separate paths with code duplication -- both of which are not ideal. So perhaps it's better to leave it alone (and perhaps clean it up slightly) or remove it completely. I wanted to see what others thought, and I was aware it's controversial (hence RFC) :) Anyway, I will send a separate series with only cleanups and removing knobs. We can discuss completely removing same-filled handling separately. I suspect 2% regression in the load path (and potentially larger with other compressors) may not be okay. If handling for zero-filled pages is added directly in reclaim as you suggested though, then the justification for handling other patterns in zswap becomes much less :) Handling it in reclaim also means we avoid IO for zero pages completely, which would be even more beneficial than just avoiding compression/decompression in zswap. > > https://lore.kernel.org/all/20240328122352.a001a56aed97b01ac5931998@xxxxxxxxxxxxxxxxxxxx/ > > This needs numbers to show not just that your patches are fine, but > that any version of this optimization actually matters for real > workloads. Without that, my vote would be NAK.