On 2020-01-07 at 16:24 Konstantin Khlebnikov wrote: >On 07/01/2020 05.44, lixinhai.lxh@xxxxxxxxx wrote: >> On 2020-01-07 at 04:35 Konstantin Khlebnikov wrote: >>> On Mon, Jan 6, 2020 at 1:42 PM Konstantin Khlebnikov >>> <khlebnikov@xxxxxxxxxxxxxx> wrote: >>>> >>>> This fixes couple misconceptions in commit 4e4a9eb92133 ("mm/rmap.c: reuse >>>> mergeable anon_vma as parent when fork"). >>>> >>>> First problem caused by initialization order in dup_mmap(): vma->vm_prev >>>> is set after calling anon_vma_fork(). Thus in anon_vma_fork() it points to >>>> previous VMA in parent mm. This is fixed by rearrangement in dup_mmap(). >>>> >>>> If in parent VMAs: SRC1 SRC2 .. SRCn share anon-vma ANON0, then after fork >>>> before all patches in child process related VMAs: DST1 DST2 .. DSTn will >>>> use different anon-vmas: ANON1 ANON2 .. ANONn. Before this patch only DST1 >>>> will fork new ANON1 and following DST2 .. DSTn will share parent's ANON0. >>>> With this patch DST1 will create new ANON1 and DST2 .. DSTn will share it. >>>> >>>> Also this patch moves sharing logic out of anon_vma_clone() into more >>>> specific anon_vma_fork() because this supposed to work only at fork(). >>>> Function anon_vma_clone() is more generic is also used at splitting VMAs. >>>> >>>> Second problem is hidden behind first one: assumption "Parent has vm_prev, >>>> which implies we have vm_prev" is wrong if first VMA in parent mm has set >>>> flag VM_DONTCOPY. Luckily prev->anon_vma doesn't dereference NULL pointer >>>> because in current code 'prev' actually is same as 'pprev'. To avoid that >>>> this patch just checks pointer and compares vm_start to verify relation >>>> between previous VMAs in parent and child. >>>> >>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> >>>> Fixes: 4e4a9eb92133 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork") >>> >>> Oops, I've forgot to mention that Li Xinhai <lixinhai.lxh@xxxxxxxxx> >>> found and reported this suspicious code. Sorry. >>> >>> Reported-by: Li Xinhai <lixinhai.lxh@xxxxxxxxx> >>> Link: https://lore.kernel.org/linux-mm/CALYGNiNzz+dxHX0g5-gNypUQc3B=8_Scp53-NTOh=zWsdUuHAw@xxxxxxxxxxxxxx/T/#t >>> >> >> Can we change the interface >> int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma), >> to >> int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma, struct vm_area_struct *pcvma), >> and 'pcvma' means previous child vma. >> so highlight the use of that vma, and the current code sequence for linking 'tmp' vma >> in dup_mmap() is not changed(in case some code would have dependency on that >> linking sequence) > >There should be no dependency on linking sequence. >But we could generalize sharing: cloned vma could share prev anon-vma >(or any other actually) if anon_vma->parent == src->anon_vma. >This is more clear than sharing only between related vmas. > I mean in int dup_mmap(...) { tmp = vm_area_dup(mpnt); //.... some code in this part would have dependency on 'tmp->vm_prev, tmp->vm_next' // ... I didn't go through all this part, but need take care about it. // so, better to keep the current semantic and pass in *pprev for anon vma call /* * Link in the new vma and copy the page table entries. */ *pprev = tmp; pprev = &tmp->vm_next; tmp->vm_prev = prev; prev = tmp; } >> >> Another issue is for linking the avc for the reused anon_vma. anon_vma_clone() >> use the iteration >> list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma), >> to link avc for child vma, and it is unable to reach the resued anon_vma because >> that is from the previous vma not from parent vma. So, in anon_vma_fork(), >> we need to setup the avc link for vma->anon. > >Oh, yes. That's another example where current code miraculously stays correct. > >> >>>> --- >>>> kernel/fork.c | 4 ++-- >>>> mm/rmap.c | 25 ++++++++++++------------- >>>> 2 files changed, 14 insertions(+), 15 deletions(-) >>>> >>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>> index 2508a4f238a3..04ee5e243f65 100644 >>>> --- a/kernel/fork.c >>>> +++ b/kernel/fork.c >>>> @@ -548,6 +548,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >>>> if (retval) >>>> goto fail_nomem_policy; >>>> tmp->vm_mm = mm; >>>> + tmp->vm_prev = prev; /* anon_vma_fork use this */ >>>> + tmp->vm_next = NULL; >>>> retval = dup_userfaultfd(tmp, &uf); >>>> if (retval) >>>> goto fail_nomem_anon_vma_fork; >>>> @@ -559,7 +561,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >>>> } else if (anon_vma_fork(tmp, mpnt)) >>>> goto fail_nomem_anon_vma_fork; >>>> tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT); >>>> - tmp->vm_next = tmp->vm_prev = NULL; >>>> file = tmp->vm_file; >>>> if (file) { >>>> struct inode *inode = file_inode(file); >>>> @@ -592,7 +593,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >>>> */ >>>> *pprev = tmp; >>>> pprev = &tmp->vm_next; >>>> - tmp->vm_prev = prev; >>>> prev = tmp; >>>> >>>> __vma_link_rb(mm, tmp, rb_link, rb_parent); >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index b3e381919835..77b3aa38d5c2 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -269,19 +269,6 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) >>>> { >>>> struct anon_vma_chain *avc, *pavc; >>>> struct anon_vma *root = NULL; >>>> - struct vm_area_struct *prev = dst->vm_prev, *pprev = src->vm_prev; >>>> - >>>> - /* >>>> - * If parent share anon_vma with its vm_prev, keep this sharing in in >>>> - * child. >>>> - * >>>> - * 1. Parent has vm_prev, which implies we have vm_prev. >>>> - * 2. Parent and its vm_prev have the same anon_vma. >>>> - */ >>>> - if (!dst->anon_vma && src->anon_vma && >>>> - pprev && pprev->anon_vma == src->anon_vma) >>>> - dst->anon_vma = prev->anon_vma; >>>> - >>>> >>>> list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) { >>>> struct anon_vma *anon_vma; >>>> @@ -334,6 +321,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) >>>> */ >>>> int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) >>>> { >>>> + struct vm_area_struct *prev = vma->vm_prev, *pprev = pvma->vm_prev; >>>> struct anon_vma_chain *avc; >>>> struct anon_vma *anon_vma; >>>> int error; >>>> @@ -345,6 +333,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) >>>> /* Drop inherited anon_vma, we'll reuse existing or allocate new. */ >>>> vma->anon_vma = NULL; >>>> >>>> + /* >>>> + * If parent shares anon_vma with its vm_prev, keep this sharing. >>>> + * >>>> + * Previous VMA could be missing or not match previuos in parent >>>> + * if VM_DONTCOPY is set: compare vm_start to avoid this case. >>>> + */ >>>> + if (pvma->anon_vma && pprev && prev && >>>> + pprev->anon_vma == pvma->anon_vma && >>>> + pprev->vm_start == prev->vm_start) >>>> + vma->anon_vma = prev->anon_vma; >>>> + >>>> /* >>>> * First, attach the new VMA to the parent VMA's anon_vmas, >>>> * so rmap can find non-COWed pages in child processes. >>>> >>> >