On Mon, Mar 03, 2025 at 12:06:27PM -0800, Nhat Pham wrote: > Currently, we crash the kernel when a decompression failure occurs in > zswap (either because of memory corruption, or a bug in the compression > algorithm). This is overkill. We should only SIGBUS the unfortunate > process asking for the zswap entry on zswap load, and skip the corrupted > entry in zswap writeback. > > See [1] for a recent upstream discussion about this. > > The zswap writeback case is relatively straightforward to fix. For the > zswap_load() case, we reorganize the swap read paths, having > swap_read_folio_zeromap() and zswap_load() return specific error codes: > > * 0 indicates the backend owns the swapped out content, which was > successfully loaded into the page. > * -ENOENT indicates the backend does not own the swapped out content. > * -EINVAL and -EIO indicate the backend own the swapped out content, but > the content was not successfully loaded into the page for some > reasons. The folio will not be marked up-to-date, which will > eventually cause the process requesting the page to SIGBUS (see the > handling of not-up-to-date folio in do_swap_page() in mm/memory.c). > > zswap decompression failures on the zswap load path are treated as an > -EIO error, as described above, and will no longer crash the kernel. > > As a side effect, we require one extra zswap tree traversal in the load > and writeback paths. Quick benchmarking on a kernel build test shows no > performance difference: > > With the new scheme: > real: mean: 125.1s, stdev: 0.12s > user: mean: 3265.23s, stdev: 9.62s > sys: mean: 2156.41s, stdev: 13.98s > > The old scheme: > real: mean: 125.78s, stdev: 0.45s > user: mean: 3287.18s, stdev: 5.95s > sys: mean: 2177.08s, stdev: 26.52s > > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@xxxxxxxxxxxxxxxxxxxx/ > > Suggested-by: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Suggested-by: Yosry Ahmed <yosry.ahmed@xxxxxxxxx> > Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx> > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx> > --- > include/linux/zswap.h | 4 +- > mm/page_io.c | 35 ++++++++---- > mm/zswap.c | 130 ++++++++++++++++++++++++++++++------------ > 3 files changed, 120 insertions(+), 49 deletions(-) > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index d961ead91bf1..9468cb3e0878 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -26,7 +26,7 @@ struct zswap_lruvec_state { > > unsigned long zswap_total_pages(void); > bool zswap_store(struct folio *folio); > -bool zswap_load(struct folio *folio); > +int zswap_load(struct folio *folio); > void zswap_invalidate(swp_entry_t swp); > int zswap_swapon(int type, unsigned long nr_pages); > void zswap_swapoff(int type); > @@ -46,7 +46,7 @@ static inline bool zswap_store(struct folio *folio) > > static inline bool zswap_load(struct folio *folio) int? > { > - return false; > + return -ENOENT; > } > > static inline void zswap_invalidate(swp_entry_t swp) {} > diff --git a/mm/page_io.c b/mm/page_io.c > index 9b983de351f9..8a44faec3f92 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -511,7 +511,21 @@ static void sio_read_complete(struct kiocb *iocb, long ret) > mempool_free(sio, sio_pool); > } > > -static bool swap_read_folio_zeromap(struct folio *folio) > +/* > + * Return: one of the following error codes: nit: "Returns 0 on success, or one of the following errors on failure:" Also might as well make this a full kerneldoc? > + * > + * 0: the folio is zero-filled (and was populated as such and marked > + * up-to-date and unlocked). > + * > + * -ENOENT: the folio was not zero-filled. > + * > + * -EINVAL: some of the subpages in the folio are zero-filled, but not all of > + * them. This is an error because we don't currently support a large folio > + * that is partially in the zeromap. The folio is unlocked, but NOT marked > + * up-to-date, so that an IO error is emitted (e.g. do_swap_page() will > + * sigbus). > + */ > +static int swap_read_folio_zeromap(struct folio *folio) > { > int nr_pages = folio_nr_pages(folio); > struct obj_cgroup *objcg; > @@ -523,11 +537,13 @@ static bool swap_read_folio_zeromap(struct folio *folio) > * that an IO error is emitted (e.g. do_swap_page() will sigbus). > */ This comment says "Return true", so it needs to be updated. We probably don't need to explain the return value anymore since it's documented above. > if (WARN_ON_ONCE(swap_zeromap_batch(folio->swap, nr_pages, > - &is_zeromap) != nr_pages)) > - return true; > + &is_zeromap) != nr_pages)) { > + folio_unlock(folio); > + return -EINVAL; > + } > > if (!is_zeromap) > - return false; > + return -ENOENT; > > objcg = get_obj_cgroup_from_folio(folio); > count_vm_events(SWPIN_ZERO, nr_pages); > @@ -538,7 +554,8 @@ static bool swap_read_folio_zeromap(struct folio *folio) > > folio_zero_range(folio, 0, folio_size(folio)); > folio_mark_uptodate(folio); > - return true; > + folio_unlock(folio); > + return 0; > } > > static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug) > @@ -635,13 +652,11 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) > } > delayacct_swapin_start(); > > - if (swap_read_folio_zeromap(folio)) { > - folio_unlock(folio); > + if (swap_read_folio_zeromap(folio) != -ENOENT) > goto finish; I would split the zeromap change into a separate patch, but it's probably fine either way. > - } else if (zswap_load(folio)) { > - folio_unlock(folio); > + > + if (zswap_load(folio) != -ENOENT) > goto finish; > - } > > /* We have to read from slower devices. Increase zswap protection. */ > zswap_folio_swapin(folio); > diff --git a/mm/zswap.c b/mm/zswap.c > index 6dbf31bd2218..b67481defc6c 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail; > static u64 zswap_reject_compress_fail; > /* Compressed page was too big for the allocator to (optimally) store */ > static u64 zswap_reject_compress_poor; > +/* Load or writeback failed due to decompression failure */ > +static u64 zswap_decompress_fail; > /* Store failed because underlying allocator could not get memory */ > static u64 zswap_reject_alloc_fail; > /* Store failed because the entry metadata could not be allocated (rare) */ > @@ -996,11 +998,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > return comp_ret == 0 && alloc_ret == 0; > } > > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) > { > struct zpool *zpool = entry->pool->zpool; > struct scatterlist input, output; > struct crypto_acomp_ctx *acomp_ctx; > + int decomp_ret, dlen; > u8 *src; > > acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool); > @@ -1025,12 +1028,31 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > sg_init_table(&output, 1); > sg_set_folio(&output, folio, PAGE_SIZE, 0); > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > + decomp_ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); > + dlen = acomp_ctx->req->dlen; > > if (src != acomp_ctx->buffer) > zpool_unmap_handle(zpool, entry->handle); > acomp_ctx_put_unlock(acomp_ctx); > + > + if (decomp_ret || dlen != PAGE_SIZE) { > + zswap_decompress_fail++; > + pr_alert_ratelimited( > + "decompression failed with returned value %d on zswap entry with " nit: Decompression* I am also wondering how this looks like in dmesg? Is the line too long to be read? Should we add some line breaks (e.g. like warn_sysctl_write()), we could probably also put this in a helper to keep this function visually easy to follow. > + "swap entry value %08lx, swap type %d, and swap offset %lu. " > + "compression algorithm is %s. compressed size is %u bytes, and " > + "decompressed size is %u bytes.\n", > + decomp_ret, > + entry->swpentry.val, > + swp_type(entry->swpentry), > + swp_offset(entry->swpentry), > + entry->pool->tfm_name, > + entry->length, > + acomp_ctx->req->dlen); We should probably be using 'dlen' here since we're outside the lock. > + > + return false; > + } > + return true; > } > > /********************************* [..] > @@ -1620,7 +1651,29 @@ bool zswap_store(struct folio *folio) > return ret; > } > > -bool zswap_load(struct folio *folio) > +/** > + * zswap_load() - load a page from zswap > + * @folio: folio to load > + * > + * Return: one of the following error codes: nit: Returns 0 on success, or .. > + * > + * 0: if the swapped out content was in zswap and was successfully loaded. > + * The folio is unlocked and marked up-to-date. > + * > + * -EIO: if the swapped out content was in zswap, but could not be loaded > + * into the page (for e.g, because there was a memory corruption, or a > + * decompression bug). The folio is unlocked, but NOT marked up-to-date, > + * so that an IO error is emitted (e.g. do_swap_page() will SIGBUS). > + * > + * -EINVAL: if the swapped out content was in zswap, but the page belongs > + * to a large folio, which is not supported by zswap. The folio is unlocked, > + * but NOT marked up-to-date, so that an IO error is emitted (e.g. > + * do_swap_page() will SIGBUS). > + * > + * -ENOENT: if the swapped out content was not in zswap. The folio remains > + * locked on return. > + */ > +int zswap_load(struct folio *folio) > { > swp_entry_t swp = folio->swap; > pgoff_t offset = swp_offset(swp); [..]