On Wed, Feb 26, 2025 at 7:33 AM Yosry Ahmed <yosry.ahmed@xxxxxxxxx> wrote: > > On Tue, Feb 25, 2025 at 11:57:27PM -0500, Johannes Weiner wrote: > > On Wed, Feb 26, 2025 at 03:12:35AM +0000, Yosry Ahmed wrote: > > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > > > Currently, we crash the kernel when a decompression failure occurs in > > > > zswap (either because of memory corruption, or a bug in the compression > > > > algorithm). This is overkill. We should only SIGBUS the unfortunate > > > > process asking for the zswap entry on zswap load, and skip the corrupted > > > > entry in zswap writeback. > > > > > > Some relevant observations/questions, but not really actionable for this > > > patch, perhaps some future work, or more likely some incoherent > > > illogical thoughts : > > > > > > (1) It seems like not making the folio uptodate will cause shmem faults > > > to mark the swap entry as hwpoisoned, but I don't see similar handling > > > for do_swap_page(). So it seems like even if we SIGBUS the process, > > > other processes mapping the same page could follow in the same > > > footsteps. > > > > It's analogous to what __end_swap_bio_read() does for block backends, > > so it's hitchhiking on the standard swap protocol for read failures. > > Right, that's also how I got the idea when I did the same for large > folios handling. And your handling of the large folio (along with the comment in the other thread) was how I got the idea for this patch :) > > > > > The page sticks around if there are other users. It can get reclaimed, > > but since it's not marked dirty, it won't get overwritten. Another > > access will either find it in the swapcache and die on !uptodate; if > > it was reclaimed, it will attempt another decompression. If all > > references have been killed, zswap_invalidate() will finally drop it. > > > > Swapoff actually poisons the page table as well (unuse_pte). > > Right. My question was basically why don't we also poison the page table > in do_swap_page() in this case. It's like that we never swapoff. That would require a rmap walk right? To also poison the other PTEs that point to the faulty (z)swap entry? Or am I misunderstanding your point :) > > This will cause subsequent fault attempts to return VM_FAULT_HWPOISON > quickly without doing through the swapcache or decompression. Probably > not a big deal, but shmem does it so maybe it'd be nice to do it for > consistency. > > > > > > (2) A hwpoisoned swap entry results in VM_FAULT_SIGBUS in some cases > > > (e.g. shmem_fault() -> shmem_get_folio_gfp() -> shmem_swapin_folio()), > > > even though we have VM_FAULT_HWPOISON. This patch falls under this > > > bucket, but unfortunately we cannot tell for sure if it's a hwpoision or > > > a decompression bug. > > > > Are you sure? Actual memory failure should replace the ptes of a > > mapped shmem page with TTU_HWPOISON, which turns them into special > > swap entries that trigger VM_FAULT_HWPOISON in do_swap_page(). > > I was looking at the shmem_fault() path. It seems like for this path we > end up with VM_SIGBUS because shmem_swapin_folio() returns -EIO and not > -EHWPOISON. This seems like something that can be easily fixed though, > unless -EHWPOISON is not always correct for a diffrent reason. > > > > > Anon swap distinguishes as long as the swapfile is there. Swapoff > > installs poison markers, which are then handled the same in future > > faults (VM_FAULT_HWPOISON): > > > > /* > > * "Poisoned" here is meant in the very general sense of "future accesses are > > * invalid", instead of referring very specifically to hardware memory errors. > > * This marker is meant to represent any of various different causes of this. > > * > > * Note that, when encountered by the faulting logic, PTEs with this marker will > > * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error > > * logic. > > If that's the case, maybe it's better for zswap in the future if we stop > relying on not marking the folio uptodate, and instead propagate an > error through swap_read_folio() to the callers to make sure we always > return VM_FAULT_HWPOISON and install poison markers. > > The handling is a bit quirky and inconsistent, but it ultimately results > in VM_SIGBUS or VM_FAULT_HWPOISON which I guess is fine for now. Yeah I think it's OK for now. FWIW it's consistent with the way we treat swap IO error, as you pointed out :) > > > */ > > #define PTE_MARKER_POISONED BIT(1)