On Fri, Jun 7, 2024 at 8:32 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > On 06.06.24 20:48, Yosry Ahmed wrote: > > > With ongoing work to support large folio swapin, it is important to make > > > sure we do not pass large folios to zswap_load() without implementing > > > proper support. > > > > > > For example, if a swapin fault observes that contiguous PTEs are > > > pointing to contiguous swap entries and tries to swap them in as a large > > > folio, swap_read_folio() will pass in a large folio to zswap_load(), but > > > zswap_load() will only effectively load the first page in the folio. If > > > the first page is not in zswap, the folio will be read from disk, even > > > though other pages may be in zswap. > > > > > > In both cases, this will lead to silent data corruption. > > > > > > Proper large folio swapin support needs to go into zswap before zswap > > > can be enabled in a system that supports large folio swapin. > > > > > > Looking at callers of swap_read_folio(), it seems like they are either > > > allocated from __read_swap_cache_async() or do_swap_page() in the > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > > > are fine for now. > > > > > > Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > > > the order of those allocations without proper handling of zswap. > > > > > > Alternatively, swap_read_folio() (or its callers) can be updated to have > > > a fallback mechanism that splits large folios or reads subpages > > > separately. Similar logic may be needed anyway in case part of a large > > > folio is already in the swapcache and the rest of it is swapped out. > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> Acked-by: Barry Song <baohua@xxxxxxxxxx> this has been observed by me[1], that's why you can find the below code in my swapin patch +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + ... + /* + * a large folio being swapped-in could be partially in + * zswap and partially in swap devices, zswap doesn't + * support large folios yet, we might get corrupted + * zero-filled data by reading all subpages from swap + * devices while some of them are actually in zswap + */ + if (is_zswap_enabled()) + goto fallback; [1] https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@xxxxxxxxx/ > > > --- > > > > > > Sorry for the long CC list, I just found myself repeatedly looking at > > > new series that add swap support for mTHPs / large folios, making sure > > > they do not break with zswap or make incorrect assumptions. This debug > > > check should give us some peace of mind. Hopefully this patch will also > > > raise awareness among people who are working on this. > > > > > > --- > > > mm/zswap.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index b9b35ef86d9be..6007252429bb2 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > > > if (!entry) > > > return false; > > > > > > + /* Zswap loads do not handle large folio swapins correctly yet */ > > > + VM_BUG_ON(folio_test_large(folio)); > > > + > > > > There is no way we could have a WARN_ON_ONCE() and recover, right? > > Not without making more fundamental changes to the surrounding swap > code. Currently zswap_load() returns either true (folio was loaded > from zswap) or false (folio is not in zswap). > > To handle this correctly zswap_load() would need to tell > swap_read_folio() which subpages are in zswap and have been loaded, > and then swap_read_folio() would need to read the remaining subpages > from disk. This of course assumes that the caller of swap_read_folio() > made sure that the entire folio is swapped out and protected against > races with other swapins. > > Also, because swap_read_folio() cannot split the folio itself, other > swap_read_folio_*() functions that are called from it should be > updated to handle swapping in tail subpages, which may be questionable > in its own right. > > An alternative would be that zswap_load() (or a separate interface) > could tell swap_read_folio() that the folio is partially in zswap, > then we can just bail and tell the caller that it cannot read the > large folio and that it should be split. > > There may be other options as well, but the bottom line is that it is > possible, but probably not something that we want to do right now. > > A stronger protection method would be to introduce a config option or > boot parameter for large folio swapin, and then make CONFIG_ZSWAP > depend on it being disabled, or have zswap check it at boot and refuse > to be enabled if it is on. Thanks Barry