* Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> [240808 14:02]: > On Thu, Aug 08, 2024 at 06:45:43PM GMT, Vlastimil Babka wrote: > > On 8/5/24 14:13, Lorenzo Stoakes wrote: > > > In mmap_region() and do_brk_flags() we open code scenarios where we prefer > > > to use vma_expand() rather than invoke a full vma_merge() operation. > > > > > > Abstract this logic and eliminate all of the open-coding, and also use the > > > same logic for all cases where we add new VMAs to, rather than ultimately > > > use vma_merge(), rather use vma_expand(). > > > > > > We implement this by replacing vma_merge_new_vma() with this newly > > > abstracted logic. > > > > > > Doing so removes duplication and simplifies VMA merging in all such cases, > > > laying the ground for us to eliminate the merging of new VMAs in > > > vma_merge() altogether. > > > > > > This makes it far easier to understand what is happening in these cases > > > avoiding confusion, bugs and allowing for future optimisation. > > > > > > As a result of this change we are also able to make vma_prepare(), > > > init_vma_prep(), vma_complete(), can_vma_merge_before() and > > > can_vma_merge_after() static and internal to vma.c. > > > > That's really great, but it would be even better if these code moves could > > be a separate patch as it would make reviewing so much easier. But with git > > diff's --color-moved to the rescue, let me try... > > Will separate out on respin. > > > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > > --- > > > mm/mmap.c | 79 ++--- > > > mm/vma.c | 482 +++++++++++++++++++------------ > > > mm/vma.h | 51 +--- > > > tools/testing/vma/vma_internal.h | 6 + > > > 4 files changed, 324 insertions(+), 294 deletions(-) ... > > > + */ > > > +struct vm_area_struct *vma_merge_new_vma(struct vma_merge_struct *vmg) > > > +{ > > > + bool is_special = vmg->flags & VM_SPECIAL; > > > + struct vm_area_struct *prev = vmg->prev; > > > + struct vm_area_struct *next = vmg->next; > > > + unsigned long start = vmg->start; > > > + unsigned long end = vmg->end; > > > + pgoff_t pgoff = vmg->pgoff; > > > + pgoff_t pglen = PHYS_PFN(end - start); > > > + > > > + VM_WARN_ON(vmg->vma); > > > + > > > + if (!prev && !next) { > > > + /* > > > + * Since the caller must have determined that the requested > > > + * range is empty, vmg->vmi will be left pointing at the VMA > > > + * immediately prior. > > > + */ > > > > OK that's perhaps not that obvious, as it seems copy_vma() is doing some > > special dance to ensure this. Should we add it to the ASSUMPTIONS and assert > > it, or is there a maple tree operation we can do to ensure it, ideally if > > it's very cheap if the iterator is already set the way we want it to be? > > > > To be fair this is something that was previously assumed, and I just added > a comment. > > Will add to assumptions, and again I think any assert should be done in > such a way that under non-CONFIG_DEBUG_VM nothing happens, maybe > VM_WARN_ON()? > > Will try to come up with something. > > > > + next = vmg->next = vma_next(vmg->vmi); > > > + prev = vmg->prev = vma_prev(vmg->vmi); > > > + > > > + /* Avoid maple tree re-walk. */ > > > + if (is_special && prev) > > > + vma_iter_next_range(vmg->vmi); > > > > I wish I knew what this did but seems it's the same as the old code did so > > hopefully that's fine. > > I think point is that we are about to exit, so we'd be left pointing at > prev. But since we're exiting in just a second, we want to be pointing at > the next vma which will become the prev of the next merge attempt. > > Liam can maybe elucidate further. What you have to remember is that the vma iterator (vmg->vmi above), contains (or, basically is) a maple state (usually written as mas). We keep state of the maple tree walker so that we don't have to keep re-walking to find the same thing. We move around the tree with this maple state because going prev/next is faster from leaves (almost always just the next thing in the nodes array of pointers). We use the maple state to write as well, so the maple state needs to point to the correct location in the tree for a write. The maple tree is a range-based tree, so each entry exists for a span of values. A write happens at the lowest index and can overwrite subsequent values. This means that the maple state needs to point to the range containing the lowest index for the write (if it's pointing to a node - it could walk from the top). A side effect of writing to the lowest index is that we need to point to the previous vma if we are going to 'expand' the vma. The range is essentially going to be from prev->start to "whatever we are expanding over". In the old code, the vm_flags & VM_SPECIAL code meant there was no way an expansion was going to happen, but we've moved the maple state to the wrong location for a write of a new vma - so this vma_iter_next_range() just moves it back. Then we "goto cannot_expand". > > > > > > + } > > > + > > > + /* If special mapping or no adjacent VMAs, nothing to merge. */ > > > + if (is_special || (!prev && !next)) > > > + return NULL; > > > + > > > + /* If we can merge with the following VMA, adjust vmg accordingly. */ > > > + if (next && next->vm_start == end && can_vma_merge_before(vmg)) { > > > + /* > > > + * We can adjust this here as can_vma_merge_after() doesn't > > > + * touch vmg->end. > > > + */ > > > + vmg->end = next->vm_end; > > > + vmg->vma = next; > > > + vmg->pgoff = next->vm_pgoff - pglen; > > > + > > > + vmg->anon_vma = next->anon_vma; > > > + } > > > + > > > + /* If we can merge with the previous VMA, adjust vmg accordingly. */ > > > + if (prev && prev->vm_end == start && can_vma_merge_after(vmg)) { > > > + vmg->start = prev->vm_start; > > > + vmg->vma = prev; > > > + vmg->pgoff = prev->vm_pgoff; > > > + } else if (prev) { > > > + vma_iter_next_range(vmg->vmi); > > > + } > > > > Sigh... ditto. > > > > (Liam can correct me) I think this is just setting up the vmi similar to > the other case such that if expansion fails we can positioned correctly for > the next merge attempt. > > Yes it's fiddly, maybe needs a comment... Yes, ditto.