On Tue, Nov 20, 2018 at 12:11:22PM +0300, Kirill A. Shutemov wrote: > On Sat, Nov 10, 2018 at 11:44:12AM -0500, Andrea Arcangeli wrote: > > I would prefer to add intelligence to detect when COWs after fork > > should be done at 2m or 4k granularity (in the latter case by > > splitting the pmd before the actual COW while leaving the transhuge > > pmd intact in the other mm), because that would save CPU (and it'd > > automatically optimize redis). The snapshot process especially would > > run faster as it will read with THP performance. > > I would argue we should switch to 4k COW everywhere. But it requires some We could do that if MADV_HUGEPAGE is not set for example. So there would still be a way to force the 2M cows if something benefits from them. For example with binaries executed in tmpfs one could want 2M cows on MAP_PRIVATE to keep all the executable in 2MB tlbs despite the memory loss (but then there are those libs that apparently aren't released to load the binaries into THP anon too for the same reason and with even higher memory waste risk as unlike tmpfs nothing can be shared if you run multiple copies of a go large binary or something). Certainly it would help whenever fork() is used for snapshotting purposes, but then fork() used for snapshotting purposes doesn't look the best mechanism possible for atomic snapshots. It would be interesting to know which other common workloads will benefit, for workloads that unlike fork()-for-snapshot, are already as optimal as it can get. > work on khugepaged side to be able to recover THP back after multiple 4k > COW in the range. Currently khugepaged is not able to collapse PTE entires > backed by compound page back to PMD. Yes this is also answering Anthony question about what shall happen after to the 4k cows on the doublemap. The thing is, by the time khugepaged comes around, the child will hopefully already have quit, so it would be ideal if it can understand the anon page isn't even shared anymore, it's fully private to the process after holding the mmap_sem for writing, so if it's not-shared anymore and mapcount is 1, khugepaged doesn't need to do the 2M cow of the doublemap THP at all, it just needs to flush the 4k fragment back to the THP and drop the doublemap and convert the readonly pte entries to a writable pmd_trans_huge (if VM_WRITE is still set). > I have this on my todo list for long time, but... We're also slowly making progress on the uffd-wp to offer a hopefully way more efficient way to do the snapshot than using fork(), then the whole fork thing won't be an issue because there will be no fork.