On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 06.06.24 22:31, Yosry Ahmed wrote: > > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > >> > >> On 06.06.24 20:48, Yosry Ahmed wrote: > >>> With ongoing work to support large folio swapin, it is important to make > >>> sure we do not pass large folios to zswap_load() without implementing > >>> proper support. > >>> > >>> For example, if a swapin fault observes that contiguous PTEs are > >>> pointing to contiguous swap entries and tries to swap them in as a large > >>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but > >>> zswap_load() will only effectively load the first page in the folio. If > >>> the first page is not in zswap, the folio will be read from disk, even > >>> though other pages may be in zswap. > >>> > >>> In both cases, this will lead to silent data corruption. > >>> > >>> Proper large folio swapin support needs to go into zswap before zswap > >>> can be enabled in a system that supports large folio swapin. > >>> > >>> Looking at callers of swap_read_folio(), it seems like they are either > >>> allocated from __read_swap_cache_async() or do_swap_page() in the > >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > >>> are fine for now. > >>> > >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > >>> the order of those allocations without proper handling of zswap. > >>> > >>> Alternatively, swap_read_folio() (or its callers) can be updated to have > >>> a fallback mechanism that splits large folios or reads subpages > >>> separately. Similar logic may be needed anyway in case part of a large > >>> folio is already in the swapcache and the rest of it is swapped out. > >>> > >>> Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > >>> --- > >>> > >>> Sorry for the long CC list, I just found myself repeatedly looking at > >>> new series that add swap support for mTHPs / large folios, making sure > >>> they do not break with zswap or make incorrect assumptions. This debug > >>> check should give us some peace of mind. Hopefully this patch will also > >>> raise awareness among people who are working on this. > >>> > >>> --- > >>> mm/zswap.c | 3 +++ > >>> 1 file changed, 3 insertions(+) > >>> > >>> diff --git a/mm/zswap.c b/mm/zswap.c > >>> index b9b35ef86d9be..6007252429bb2 100644 > >>> --- a/mm/zswap.c > >>> +++ b/mm/zswap.c > >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > >>> if (!entry) > >>> return false; > >>> > >>> + /* Zswap loads do not handle large folio swapins correctly yet */ > >>> + VM_BUG_ON(folio_test_large(folio)); > >>> + > >> > >> There is no way we could have a WARN_ON_ONCE() and recover, right? > > > > Not without making more fundamental changes to the surrounding swap > > code. Currently zswap_load() returns either true (folio was loaded > > from zswap) or false (folio is not in zswap). > > > > To handle this correctly zswap_load() would need to tell > > swap_read_folio() which subpages are in zswap and have been loaded, > > and then swap_read_folio() would need to read the remaining subpages > > from disk. This of course assumes that the caller of swap_read_folio() > > made sure that the entire folio is swapped out and protected against > > races with other swapins. > > > > Also, because swap_read_folio() cannot split the folio itself, other > > swap_read_folio_*() functions that are called from it should be > > updated to handle swapping in tail subpages, which may be questionable > > in its own right. > > > > An alternative would be that zswap_load() (or a separate interface) > > could tell swap_read_folio() that the folio is partially in zswap, > > then we can just bail and tell the caller that it cannot read the > > large folio and that it should be split. > > > > There may be other options as well, but the bottom line is that it is > > possible, but probably not something that we want to do right now. > > > > A stronger protection method would be to introduce a config option or > > boot parameter for large folio swapin, and then make CONFIG_ZSWAP > > depend on it being disabled, or have zswap check it at boot and refuse > > to be enabled if it is on. > > Right, sounds like the VM_BUG_ON() really is not that easily avoidable. > > I was wondering, if we could WARN_ON_ONCE and make the swap code detect > this like a read-error from disk. > > I think do_swap_page() detects that by checking if the folio is not > uptodate: > > if (unlikely(!folio_test_uptodate(folio))) { > ret = VM_FAULT_SIGBUS; > goto out_nomap; > } > > So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the > system (but the app would crash either way, there is no way around it). > I'd rather fallback to small folios swapin instead crashing apps till we fix the large folio swapin in zswap :-) +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + ... + + if (is_zswap_enabled()) + goto fallback; > -- > Cheers, > > David / dhildenb Thanks Barry