On Mon, Dec 05, 2022 at 03:40:59PM -0800, Jiaqi Yan wrote: > Make collapse_file roll back when copying pages failed. More concretely: > - extract copying operations into a separate loop > - postpone the updates for nr_none until both scanning and copying > succeeded > - postpone joining small xarray entries until both scanning and copying > succeeded > - postpone the update operations to NR_XXX_THPS until both scanning and > copying succeeded > - for non-SHMEM file, roll back filemap_nr_thps_inc if scan succeeded but > copying failed > > Tested manually: > 0. Enable khugepaged on system under test. Mount tmpfs at /mnt/ramdisk. > 1. Start a two-thread application. Each thread allocates a chunk of > non-huge memory buffer from /mnt/ramdisk. > 2. Pick 4 random buffer address (2 in each thread) and inject > uncorrectable memory errors at physical addresses. > 3. Signal both threads to make their memory buffer collapsible, i.e. > calling madvise(MADV_HUGEPAGE). > 4. Wait and then check kernel log: khugepaged is able to recover from > poisoned pages by skipping them. > 5. Signal both threads to inspect their buffer contents and make sure no > data corruption. > > Signed-off-by: Jiaqi Yan <jiaqiyan@xxxxxxxxxx> Okay, looks sane. Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> -- Kiryl Shutsemau / Kirill A. Shutemov