On Mon, 24 Apr 2023, David Stevens wrote: > On Sun, Apr 23, 2023 at 1:47 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > > > Inserting Ivan Orlov's syzbot fix commit 2ce0bdfebc74 > > ("mm: khugepaged: fix kernel BUG in hpage_collapse_scan_file()") > > ahead of Jiaqi Yan's and David Stevens's commits > > 12904d953364 ("mm/khugepaged: recover from poisoned file-backed memory") > > cae106dd67b9 ("mm/khugepaged: refactor collapse_file control flow") > > ac492b9c70ca ("mm/khugepaged: skip shmem with userfaultfd") > > (all of which restructure collapse_file()) did not work out well. > > > > xfstests generic/086 on huge tmpfs (with accelerated khugepaged) freezes > > (if not on the first attempt, then the 2nd or 3rd) in find_lock_entries() > > while doing drop_caches: the file's xarray seems to have been corrupted, > > with find_get_entry() returning nonsense which makes no progress. > > > > Bisection led to ac492b9c70ca; and diff against earlier working linux-next > > suggested that it's probably down to an errant xas_store(), which does not > > belong with the later changes (and nor does the positioning of warnings). > > The later changes look as if they fix the syzbot issue independently. > > > > Remove most of what's left of 2ce0bdfebc74: just leave one WARN_ON_ONCE > > (xas_error) after the final xas_store() of the multi-index entry. > > > > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> > > --- > > > > mm/khugepaged.c | 23 +---------------------- > > 1 file changed, 1 insertion(+), 22 deletions(-) > > > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1941,16 +1941,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > result = SCAN_FAIL; > > goto xa_locked; > > } > > - xas_store(&xas, hpage); > > - if (xas_error(&xas)) { > > - /* revert shmem_charge performed > > - * in the previous condition > > - */ > > - mapping->nrpages--; > > - shmem_uncharge(mapping->host, 1); > > - result = SCAN_STORE_FAILED; > > With this being removed, SCAN_STORE_FAILED should also be removed from > the scan_result enum and trace event definitions. Only if we also remove your use of SCAN_STORE_FAILED in ac492b9c70ca: what would you want that to say instead? I don't care myself for any of those "SCAN" result codes, nor whether they are few or many: I'd rather have __LINE__ numbers for my own debugging. But if people want to remove SCAN_STORE_FAILED now, sure, send a patch; my intent was to unbreak the breakage. Hugh