On Wed, Jan 02, 2019 at 01:56:41PM -0800, David Rientjes wrote: > On Sat, 29 Dec 2018, Kirill A. Shutemov wrote: > > > > > > Smaps reveals the state of each vma at the moment it is read, not the > > > > > future, this has always been the case before and after the broken commit; > > > > > it's not intended to be something that can specify "thp is currently > > > > > disabled, but if we did prctl(PR_SET_THP_DISABLE, 0), this vma has done > > > > > madvise(MADV_HUGEPAGE), and we are on a thp enabled == madvise host, so > > > > > it would be eligible for thp." > > > > > > > > The point of CRIU is that it tries to save and re-create the state of the > > > > process. > > > > > > > > With your fix, "nh" in VMA flags can mean two different things: > > > > - MADV_NOHUGEPAGE was called for the VMA, or > > > > - prctl(PR_SET_THP_DISABLE) was called for the whole process > > > > > > > > Current CRIU code assumes that "nh" means MADV_NOHUGEPAGE. But if it was > > > > due to prctl(PR_SET_THP_DISABLE) the process state will not be restored > > > > correctly: all VMAs will be restored with MADV_NOHUGEPAGE and future > > > > re-enabling THP with prctl(PR_SET_THP_DISABLE) for restored process will > > > > not have any effect. VMAs stay not eligble for THP. > > > > > > > > > > CRIU's usage of "nh" was created specifically for PR_SET_THP_DISABLE: > > > > > > commit 9f61de87768b33d9b187ef71a57262a17276ff3b > > > > > > proc_parse: make smaps parsing helpers globally available > > > > > > The is_vma_range_fmt and parse_vmflags will be required for detection of > > > availability of PR_SET_THP_DISABLE prctl > > > > > > Acked-by: Pavel Emelyanov <xemul@xxxxxxxxxxxxx> > > > Signed-off-by: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx> > > > Signed-off-by: Andrei Vagin <avagin@xxxxxxxxxxxxx> > > > > > > > > > That's at > > > https://github.com/checkpoint-restore/criu/commit/9f61de87768b33d9b187ef71a57262a17276ff3b. > > > > > > They are using it for the *exact* same reason that we are. That commit > > > was merged June 2017 based on the kernel code as it existed prior to the > > > broken commit in July 2017. > > > > > > > > > Does this not suggest that CRIU is now equally as broken as my userspace? > > > > I think it *was* broken in the same way as your userspace. And it was > > adjusted to the updated kernel by 5480a2a40683. > > > > Ah, ok, I see this in criu: > > commit 5480a2a4068349c911fd78b9a45850b4b8a5596e > Author: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx> > Date: Wed Jun 28 09:43:25 2017 +0300 > > kerndat: add test for availability of PR_SET_THP_DISABLE prctl > > The PR_SET_THP_DISABLE prctl allows control of transparent huge pages on > per-process basis. It is available since Linux 3.15, but until recently it > set VM_NOHUGEPAGE for all VMAs created after prctl() call, which prevents > proper restore for combination of pre- and post-copy. A recent change to > prctl(PR_SET_THP_DISABLE) behavior eliminates the use of per-VMA flags and > we can use the new version of the prctl() to disable THP. > > Acked-by: Pavel Emelyanov <xemul@xxxxxxxxxxxxx> > Signed-off-by: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx> > Signed-off-by: Andrei Vagin <avagin@xxxxxxxxxxxxx> > > So commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") > broke our usecase as well as criu. I'm not sure of the implications of > the fix above, it appears that it tests if PR_SET_THP_DISABLE is available > and whether or not we can parse VmFlags from smaps to detect that. > > I assume that criu still works on previous kernels such that if this patch > were merged that it wouldn't be broken. The commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") didn't break CRIU. On the contrary, it allowed combination of pre- and post-copy migration that didn't work with THP otherwise. When CRIU does checkpoint and restore or pre-copy migration it does not care about THP at all, the only important thing in these cases is to have VmFlags identical before and after the checkpoint (migration). With introduction of post-copy migration, we've faced a problem when we tried to combine pre- and post-copy. In this case we populate a part of a memory region with data that was saved during the pre-copy stage. Afterwards, the region is registered with userfaultfd and we expect to get page faults for the parts of the region that were not yet populated. However, khugepaged collapses the pages and the page faults we would expect do not occur. We had a long discussion [1] how to prevent khugepaged from collapsing some regions without altering VmFlags for good. The outcome of that discussion was the commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active"). The CRIU commit 5480a2a40683 ("kerndat: add test for availability of PR_SET_THP_DISABLE prctl") is a part of a larger set that actually enables the use of prctl(PR_SET_THP_DISABLE) to temporarily disable THP for the restored process during post-copy migration and re-enable it once the restore is done. > So we have a case where multiple users have been broken by the commit, is > there an objection to merging this so that we always get "nh" in VmFlags > when PR_SET_THP_DISABLE is used since this was the only way it was > detectable on earlier kernels? This will break CRIU because it won't be able to detect why a VMA has 'nh' flag set as I explained at [2]. [1] https://lore.kernel.org/lkml/1495433562-26625-1-git-send-email-rppt@xxxxxxxxxxxxxxxxxx/#t [2] https://lwn.net/ml/linux-api/20181230175555.GA31291%40rapoport-lnx/ -- Sincerely yours, Mike.