On Wed, Feb 03, 2021 at 01:15:03PM -0800, Linus Torvalds wrote: > On Wed, Feb 3, 2021 at 1:08 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > This is the last missing piece of the COW-during-fork effort when there're > > pinned pages found. One can reference 70e806e4e645 ("mm: Do early cow for > > pinned pages during fork() for ptes", 2020-09-27) for more information, since > > we do similar things here rather than pte this time, but just for hugetlb. > > No issues with the code itself, but.. > > Comments are good, but the comments inside this block of code actually > makes the code *much* harder to read, because now the actual logic is > much more spread out and you can't see what it does so well. > > > + if (unlikely(page_needs_cow_for_dma(vma, ptepage))) { > > + /* This is very possibly a pinned huge page */ > > + if (!prealloc) { > > + /* > > + * Preallocate the huge page without > > + * tons of locks since we could sleep. > > + * Note: we can't use any reservation > > + * because the page will be exclusively > > + * owned by the child later. > > + */ > > + put_page(ptepage); > > + spin_unlock(src_ptl); > > + spin_unlock(dst_ptl); > > + prealloc = alloc_huge_page(vma, addr, 0); > > + if (!prealloc) { > > + /* > > + * hugetlb_cow() seems to be > > + * more careful here than us. > > + * However for fork() we could > > + * be strict not only because > > + * no one should be referencing > > + * the child mm yet, but also > > + * if resources are rare we'd > > + * better simply fail the > > + * fork() even earlier. > > + */ > > + ret = -ENOMEM; > > + break; > > + } > > + goto again; > > + } > > + /* > > + * We have page preallocated so that we can do > > + * the copy right now. > > + */ > > + hugetlb_copy_page(vma, dst_pte, addr, ptepage, > > + prealloc); > > + put_page(ptepage); > > + spin_unlock(src_ptl); > > + spin_unlock(dst_ptl); > > + prealloc = NULL; > > + continue; > > + } > > Can you move the comment above the code? Sure. > And I _think_ the prealloc conditional could be split up to a helper function > (which would help more), but maybe there are too many variables for that to > be practical. It's just that comparing to pte case where we introduced page_copy_prealloc(), we've already got a very nice helper alloc_huge_page() for that for e.g. cgroup charging and so on, so it seems already clean enough to use it. The only difference comparing to the pte case is I moved the reset of "prealloc" to be out of the copy function since we never fail after all, to avoid passing a struct page** double pointer. Would below look better (only comment change)? ---------------8<------------------ @@ -3816,6 +3832,39 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, } set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); } else { + entry = huge_ptep_get(src_pte); + ptepage = pte_page(entry); + get_page(ptepage); + + /* + * This is a rare case where we see pinned hugetlb + * pages while they're prone to COW. We need to do the + * COW earlier during fork. + * + * When pre-allocating the page we need to be without + * all the locks since we could sleep when allocate. + */ + if (unlikely(page_needs_cow_for_dma(vma, ptepage))) { + if (!prealloc) { + put_page(ptepage); + spin_unlock(src_ptl); + spin_unlock(dst_ptl); + prealloc = alloc_huge_page(vma, addr, 0); + if (!prealloc) { + ret = -ENOMEM; + break; + } + goto again; + } + hugetlb_copy_page(vma, dst_pte, addr, ptepage, + prealloc); + put_page(ptepage); + spin_unlock(src_ptl); + spin_unlock(dst_ptl); + prealloc = NULL; + continue; + } + ---------------8<------------------ Thanks, -- Peter Xu