On Fri, Jun 7, 2024 at 9:37 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Thu, Jun 6, 2024 at 2:30 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > > > > On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > > > On 06.06.24 22:31, Yosry Ahmed wrote: > > > > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > >> > > > >> On 06.06.24 20:48, Yosry Ahmed wrote: > > > >>> With ongoing work to support large folio swapin, it is important to make > > > >>> sure we do not pass large folios to zswap_load() without implementing > > > >>> proper support. > > > >>> > > > >>> For example, if a swapin fault observes that contiguous PTEs are > > > >>> pointing to contiguous swap entries and tries to swap them in as a large > > > >>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but > > > >>> zswap_load() will only effectively load the first page in the folio. If > > > >>> the first page is not in zswap, the folio will be read from disk, even > > > >>> though other pages may be in zswap. > > > >>> > > > >>> In both cases, this will lead to silent data corruption. > > > >>> > > > >>> Proper large folio swapin support needs to go into zswap before zswap > > > >>> can be enabled in a system that supports large folio swapin. > > > >>> > > > >>> Looking at callers of swap_read_folio(), it seems like they are either > > > >>> allocated from __read_swap_cache_async() or do_swap_page() in the > > > >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > > > >>> are fine for now. > > > >>> > > > >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > > > >>> the order of those allocations without proper handling of zswap. > > > >>> > > > >>> Alternatively, swap_read_folio() (or its callers) can be updated to have > > > >>> a fallback mechanism that splits large folios or reads subpages > > > >>> separately. Similar logic may be needed anyway in case part of a large > > > >>> folio is already in the swapcache and the rest of it is swapped out. > > > >>> > > > >>> Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > > >>> --- > > > >>> > > > >>> Sorry for the long CC list, I just found myself repeatedly looking at > > > >>> new series that add swap support for mTHPs / large folios, making sure > > > >>> they do not break with zswap or make incorrect assumptions. This debug > > > >>> check should give us some peace of mind. Hopefully this patch will also > > > >>> raise awareness among people who are working on this. > > > >>> > > > >>> --- > > > >>> mm/zswap.c | 3 +++ > > > >>> 1 file changed, 3 insertions(+) > > > >>> > > > >>> diff --git a/mm/zswap.c b/mm/zswap.c > > > >>> index b9b35ef86d9be..6007252429bb2 100644 > > > >>> --- a/mm/zswap.c > > > >>> +++ b/mm/zswap.c > > > >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > > > >>> if (!entry) > > > >>> return false; > > > >>> > > > >>> + /* Zswap loads do not handle large folio swapins correctly yet */ > > > >>> + VM_BUG_ON(folio_test_large(folio)); > > > >>> + > > > >> > > > >> There is no way we could have a WARN_ON_ONCE() and recover, right? > > > > > > > > Not without making more fundamental changes to the surrounding swap > > > > code. Currently zswap_load() returns either true (folio was loaded > > > > from zswap) or false (folio is not in zswap). > > > > > > > > To handle this correctly zswap_load() would need to tell > > > > swap_read_folio() which subpages are in zswap and have been loaded, > > > > and then swap_read_folio() would need to read the remaining subpages > > > > from disk. This of course assumes that the caller of swap_read_folio() > > > > made sure that the entire folio is swapped out and protected against > > > > races with other swapins. > > > > > > > > Also, because swap_read_folio() cannot split the folio itself, other > > > > swap_read_folio_*() functions that are called from it should be > > > > updated to handle swapping in tail subpages, which may be questionable > > > > in its own right. > > > > > > > > An alternative would be that zswap_load() (or a separate interface) > > > > could tell swap_read_folio() that the folio is partially in zswap, > > > > then we can just bail and tell the caller that it cannot read the > > > > large folio and that it should be split. > > > > > > > > There may be other options as well, but the bottom line is that it is > > > > possible, but probably not something that we want to do right now. > > > > > > > > A stronger protection method would be to introduce a config option or > > > > boot parameter for large folio swapin, and then make CONFIG_ZSWAP > > > > depend on it being disabled, or have zswap check it at boot and refuse > > > > to be enabled if it is on. > > > > > > Right, sounds like the VM_BUG_ON() really is not that easily avoidable. > > > > > > I was wondering, if we could WARN_ON_ONCE and make the swap code detect > > > this like a read-error from disk. > > > > > > I think do_swap_page() detects that by checking if the folio is not > > > uptodate: > > > > > > if (unlikely(!folio_test_uptodate(folio))) { > > > ret = VM_FAULT_SIGBUS; > > > goto out_nomap; > > > } > > > > > > So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the > > > system (but the app would crash either way, there is no way around it). > > > > > > > I'd rather fallback to small folios swapin instead crashing apps till we fix > > the large folio swapin in zswap :-) > > I think David is referring to catching the buggy cases that do not > properly fallback to small folios with zswap, not as an alternative to > the fallback. This is at least what I had in mind with the patch. Cool. Thanks for the clarification. I'm fine with keeping the fallback, whether it's the current VM_BUG_ON or David's recommended SIGBUS. Currently, mainline doesn't support large folios swap-in, so I see your patch as a helpful warning for those attempting large folio swap-in to avoid making mistakes like loading large folios from zswap. In fact, I spent a week trying to figure out why my app was crashing before I added 'if (zswap_is_enabled()) goto fallback'. If your patch had come earlier, it would have saved me that week of work :-) To me, a direct crash seems like a more straightforward way to prompt people to fallback when attempting large folio swap-ins. So, I am slightly in favor of your current patch. Thanks Barry