* Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> [240808 14:34]: > * Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> [240808 14:02]: > > On Thu, Aug 08, 2024 at 06:45:43PM GMT, Vlastimil Babka wrote: > > > On 8/5/24 14:13, Lorenzo Stoakes wrote: > > > > In mmap_region() and do_brk_flags() we open code scenarios where we prefer > > > > to use vma_expand() rather than invoke a full vma_merge() operation. > > > > > > > > Abstract this logic and eliminate all of the open-coding, and also use the > > > > same logic for all cases where we add new VMAs to, rather than ultimately > > > > use vma_merge(), rather use vma_expand(). > > > > > > > > We implement this by replacing vma_merge_new_vma() with this newly > > > > abstracted logic. > > > > > > > > Doing so removes duplication and simplifies VMA merging in all such cases, > > > > laying the ground for us to eliminate the merging of new VMAs in > > > > vma_merge() altogether. > > > > > > > > This makes it far easier to understand what is happening in these cases > > > > avoiding confusion, bugs and allowing for future optimisation. > > > > > > > > As a result of this change we are also able to make vma_prepare(), > > > > init_vma_prep(), vma_complete(), can_vma_merge_before() and > > > > can_vma_merge_after() static and internal to vma.c. > > > > > > That's really great, but it would be even better if these code moves could > > > be a separate patch as it would make reviewing so much easier. But with git > > > diff's --color-moved to the rescue, let me try... > > > > Will separate out on respin. > > > > > > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > > > --- > > > > mm/mmap.c | 79 ++--- > > > > mm/vma.c | 482 +++++++++++++++++++------------ > > > > mm/vma.h | 51 +--- > > > > tools/testing/vma/vma_internal.h | 6 + > > > > 4 files changed, 324 insertions(+), 294 deletions(-) > > ... > > > > + */ > > > > +struct vm_area_struct *vma_merge_new_vma(struct vma_merge_struct *vmg) > > > > +{ > > > > + bool is_special = vmg->flags & VM_SPECIAL; > > > > + struct vm_area_struct *prev = vmg->prev; > > > > + struct vm_area_struct *next = vmg->next; > > > > + unsigned long start = vmg->start; > > > > + unsigned long end = vmg->end; > > > > + pgoff_t pgoff = vmg->pgoff; > > > > + pgoff_t pglen = PHYS_PFN(end - start); > > > > + > > > > + VM_WARN_ON(vmg->vma); > > > > + > > > > + if (!prev && !next) { > > > > + /* > > > > + * Since the caller must have determined that the requested > > > > + * range is empty, vmg->vmi will be left pointing at the VMA > > > > + * immediately prior. > > > > + */ > > > > > > OK that's perhaps not that obvious, as it seems copy_vma() is doing some > > > special dance to ensure this. Should we add it to the ASSUMPTIONS and assert > > > it, or is there a maple tree operation we can do to ensure it, ideally if > > > it's very cheap if the iterator is already set the way we want it to be? > > > > > > > To be fair this is something that was previously assumed, and I just added > > a comment. > > > > Will add to assumptions, and again I think any assert should be done in > > such a way that under non-CONFIG_DEBUG_VM nothing happens, maybe > > VM_WARN_ON()? > > > > Will try to come up with something. Something like: VM_BUG_ON(vma_iter_end(vmg->vmi) > start); > > > > > > + next = vmg->next = vma_next(vmg->vmi); and: VM_BUG_ON(vma_iter_addr(vmg->vmi) < end); > > > > + prev = vmg->prev = vma_prev(vmg->vmi); > > > > + > > > > + /* Avoid maple tree re-walk. */ > > > > + if (is_special && prev) > > > > + vma_iter_next_range(vmg->vmi); > > > > > > I wish I knew what this did but seems it's the same as the old code did so > > > hopefully that's fine. > > > > I think point is that we are about to exit, so we'd be left pointing at > > prev. But since we're exiting in just a second, we want to be pointing at > > the next vma which will become the prev of the next merge attempt. > > > > Liam can maybe elucidate further. > > What you have to remember is that the vma iterator (vmg->vmi above), > contains (or, basically is) a maple state (usually written as mas). We > keep state of the maple tree walker so that we don't have to keep > re-walking to find the same thing. We move around the tree with this > maple state because going prev/next is faster from leaves (almost always > just the next thing in the nodes array of pointers). > > We use the maple state to write as well, so the maple state needs to > point to the correct location in the tree for a write. > > The maple tree is a range-based tree, so each entry exists for a span of > values. A write happens at the lowest index and can overwrite > subsequent values. This means that the maple state needs to point to > the range containing the lowest index for the write (if it's pointing to > a node - it could walk from the top). > > A side effect of writing to the lowest index is that we need to point to > the previous vma if we are going to 'expand' the vma. The range is > essentially going to be from prev->start to "whatever we are expanding > over". > > In the old code, the vm_flags & VM_SPECIAL code meant there was no way > an expansion was going to happen, but we've moved the maple state to the > wrong location for a write of a new vma - so this vma_iter_next_range() > just moves it back. Then we "goto cannot_expand". >