On Sat, Jun 8, 2024 at 5:43 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Fri, Jun 7, 2024 at 12:11 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > On 06.06.24 23:53, Barry Song wrote: > > > On Fri, Jun 7, 2024 at 9:37 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > >> > > >> On Thu, Jun 6, 2024 at 2:30 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > > >>> > > >>> On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > >>>> > > >>>> On 06.06.24 22:31, Yosry Ahmed wrote: > > >>>>> On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > >>>>>> > > >>>>>> On 06.06.24 20:48, Yosry Ahmed wrote: > > >>>>>>> With ongoing work to support large folio swapin, it is important to make > > >>>>>>> sure we do not pass large folios to zswap_load() without implementing > > >>>>>>> proper support. > > >>>>>>> > > >>>>>>> For example, if a swapin fault observes that contiguous PTEs are > > >>>>>>> pointing to contiguous swap entries and tries to swap them in as a large > > >>>>>>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but > > >>>>>>> zswap_load() will only effectively load the first page in the folio. If > > >>>>>>> the first page is not in zswap, the folio will be read from disk, even > > >>>>>>> though other pages may be in zswap. > > >>>>>>> > > >>>>>>> In both cases, this will lead to silent data corruption. > > >>>>>>> > > >>>>>>> Proper large folio swapin support needs to go into zswap before zswap > > >>>>>>> can be enabled in a system that supports large folio swapin. > > >>>>>>> > > >>>>>>> Looking at callers of swap_read_folio(), it seems like they are either > > >>>>>>> allocated from __read_swap_cache_async() or do_swap_page() in the > > >>>>>>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > > >>>>>>> are fine for now. > > >>>>>>> > > >>>>>>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > > >>>>>>> the order of those allocations without proper handling of zswap. > > >>>>>>> > > >>>>>>> Alternatively, swap_read_folio() (or its callers) can be updated to have > > >>>>>>> a fallback mechanism that splits large folios or reads subpages > > >>>>>>> separately. Similar logic may be needed anyway in case part of a large > > >>>>>>> folio is already in the swapcache and the rest of it is swapped out. > > >>>>>>> > > >>>>>>> Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > >>>>>>> --- > > >>>>>>> > > >>>>>>> Sorry for the long CC list, I just found myself repeatedly looking at > > >>>>>>> new series that add swap support for mTHPs / large folios, making sure > > >>>>>>> they do not break with zswap or make incorrect assumptions. This debug > > >>>>>>> check should give us some peace of mind. Hopefully this patch will also > > >>>>>>> raise awareness among people who are working on this. > > >>>>>>> > > >>>>>>> --- > > >>>>>>> mm/zswap.c | 3 +++ > > >>>>>>> 1 file changed, 3 insertions(+) > > >>>>>>> > > >>>>>>> diff --git a/mm/zswap.c b/mm/zswap.c > > >>>>>>> index b9b35ef86d9be..6007252429bb2 100644 > > >>>>>>> --- a/mm/zswap.c > > >>>>>>> +++ b/mm/zswap.c > > >>>>>>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > > >>>>>>> if (!entry) > > >>>>>>> return false; > > >>>>>>> > > >>>>>>> + /* Zswap loads do not handle large folio swapins correctly yet */ > > >>>>>>> + VM_BUG_ON(folio_test_large(folio)); > > >>>>>>> + > > >>>>>> > > >>>>>> There is no way we could have a WARN_ON_ONCE() and recover, right? > > >>>>> > > >>>>> Not without making more fundamental changes to the surrounding swap > > >>>>> code. Currently zswap_load() returns either true (folio was loaded > > >>>>> from zswap) or false (folio is not in zswap). > > >>>>> > > >>>>> To handle this correctly zswap_load() would need to tell > > >>>>> swap_read_folio() which subpages are in zswap and have been loaded, > > >>>>> and then swap_read_folio() would need to read the remaining subpages > > >>>>> from disk. This of course assumes that the caller of swap_read_folio() > > >>>>> made sure that the entire folio is swapped out and protected against > > >>>>> races with other swapins. > > >>>>> > > >>>>> Also, because swap_read_folio() cannot split the folio itself, other > > >>>>> swap_read_folio_*() functions that are called from it should be > > >>>>> updated to handle swapping in tail subpages, which may be questionable > > >>>>> in its own right. > > >>>>> > > >>>>> An alternative would be that zswap_load() (or a separate interface) > > >>>>> could tell swap_read_folio() that the folio is partially in zswap, > > >>>>> then we can just bail and tell the caller that it cannot read the > > >>>>> large folio and that it should be split. > > >>>>> > > >>>>> There may be other options as well, but the bottom line is that it is > > >>>>> possible, but probably not something that we want to do right now. > > >>>>> > > >>>>> A stronger protection method would be to introduce a config option or > > >>>>> boot parameter for large folio swapin, and then make CONFIG_ZSWAP > > >>>>> depend on it being disabled, or have zswap check it at boot and refuse > > >>>>> to be enabled if it is on. > > >>>> > > >>>> Right, sounds like the VM_BUG_ON() really is not that easily avoidable. > > >>>> > > >>>> I was wondering, if we could WARN_ON_ONCE and make the swap code detect > > >>>> this like a read-error from disk. > > >>>> > > >>>> I think do_swap_page() detects that by checking if the folio is not > > >>>> uptodate: > > >>>> > > >>>> if (unlikely(!folio_test_uptodate(folio))) { > > >>>> ret = VM_FAULT_SIGBUS; > > >>>> goto out_nomap; > > >>>> } > > >>>> > > >>>> So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the > > >>>> system (but the app would crash either way, there is no way around it). > > >>>> > > >>> > > >>> I'd rather fallback to small folios swapin instead crashing apps till we fix > > >>> the large folio swapin in zswap :-) > > >> > > >> I think David is referring to catching the buggy cases that do not > > >> properly fallback to small folios with zswap, not as an alternative to > > >> the fallback. This is at least what I had in mind with the patch. > > > > > > Cool. Thanks for the clarification. I'm fine with keeping the fallback, > > > whether it's the current VM_BUG_ON or David's recommended > > > SIGBUS. > > > > > > Currently, mainline doesn't support large folios swap-in, so I see > > > your patch as a helpful warning for those attempting large folio > > > swap-in to avoid making mistakes like loading large folios from > > > zswap. > > > > > > In fact, I spent a week trying to figure out why my app was crashing > > > before I added 'if (zswap_is_enabled()) goto fallback'. If your patch > > > had come earlier, it would have saved me that week of work :-) > > > > > > To me, a direct crash seems like a more straightforward way to > > > prompt people to fallback when attempting large folio swap-ins. > > > So, I am slightly in favor of your current patch. > > > > BUG_ON and friends are frowned-upon, in particular in scenarios where we > > can recover or that are so unexpected that they can be found during > > early testing. > > > > I have no strong opinion on this one, but likely a VM_WARN_ON would also > > be sufficient to find such issues early during testing. No need to crash > > the machine. > > I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after > some digging I found your patches to checkpatch and Linus clearly > stating that it isn't. > > How about something like the following (untested), it is the minimal > recovery we can do but should work for a lot of cases, and does > nothing beyond a warning if we can swapin the large folio from disk: > > diff --git a/mm/page_io.c b/mm/page_io.c > index f1a9cfab6e748..8f441dd8e109f 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct > swap_iocb **plug) > delayacct_swapin_start(); > > if (zswap_load(folio)) { > - folio_mark_uptodate(folio); > folio_unlock(folio); > } else if (data_race(sis->flags & SWP_FS_OPS)) { > swap_read_folio_fs(folio, plug); > diff --git a/mm/zswap.c b/mm/zswap.c > index 6007252429bb2..cc04db6bb217e 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio) > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > + /* > + * Large folios should not be swapped in while zswap is being used, as > + * they are not properly handled. > + * > + * If any of the subpages are in zswap, reading from disk would result > + * in data corruption, so return true without marking the folio uptodate > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault). > + * > + * Otherwise, return false and read the folio from disk. > + */ > + if (WARN_ON_ONCE(folio_test_large(folio))) { > + if (xa_find(tree, &offset, offset + > folio_nr_pages(folio) - 1, 0)) > + return true; > + return false; > + } > + > /* > * When reading into the swapcache, invalidate our entry. The > * swapcache can be the authoritative owner of the page and > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio) > zswap_entry_free(entry); > folio_mark_dirty(folio); > } > - > + folio_mark_uptodate(folio); > return true; > } > > One problem is that even if zswap was never enabled, the warning will > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or > static key if zswap was "ever" enabled. > > Barry, I suspect your is_zswap_enabled() check is deficient for > similar reasons, zswap could have been enabled before then became > disabled. I don't understand this. if zswap was enabled before but is disabled when I am loading data, will I get corrupted data before zswap was once enabled? If not, it seems nothing important. Thanks Barry