* Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> [231006 21:11]: > * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [231005 11:56]: > > > > > > 在 2023/10/5 03:53, Liam R. Howlett 写道: > > > * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [231004 05:10]: > > > > > > > > > > > > 在 2023/10/4 02:46, Liam R. Howlett 写道: > > > > > * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230924 23:58]: > > > > > > In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then > > > > > > directly replacing the entries of VMAs in the new maple tree can result > > > > > > in better performance. __mt_dup() uses DFS pre-order to duplicate the > > > > > > maple tree, so it is very efficient. The average time complexity of > > > > > > duplicating VMAs is reduced from O(n * log(n)) to O(n). The optimization > > > > > > effect is proportional to the number of VMAs. > > > > > > > > > > I am not confident in the big O calculations here. Although the addition > > > > > of the tree is reduced, adding a VMA still needs to create the nodes > > > > > above it - which are a function of n. How did you get O(n * log(n)) for > > > > > the existing fork? > > > > > > > > > > I would think your new algorithm is n * log(n/16), while the > > > > > previous was n * log(n/16) * f(n). Where f(n) would be something > > > > > to do with the decision to split/rebalance in bulk insert mode. > > > > > > > > > > It's certainly a better algorithm to duplicate trees, but I don't think > > > > > it is O(n). Can you please explain? > > > > > > > > The following is a non-professional analysis of the algorithm. > > > > > > > > Let's first analyze the average time complexity of the new algorithm, as > > > > it is relatively easy to analyze. The maximum number of branches for > > > > internal nodes in a maple tree in allocation mode is 10. However, to > > > > simplify the analysis, we will not consider this case and assume that > > > > all nodes have a maximum of 16 branches. > > > > > > > > The new algorithm assumes that there is no case where a VMA with the > > > > VM_DONTCOPY flag is deleted. If such a case exists, this analysis cannot > > > > be applied. > > > > > > > > The operations of the new algorithm consist of three parts: > > > > > > > > 1. DFS traversal of each node in the source tree > > > > 2. For each node in the source tree, create a copy and construct a new > > > > node > > > > 3. Traverse the new tree using mas_find() and replace each element > > > > > > > > If there are a total of n elements in the maple tree, we can conclude > > > > that there are n/16 leaf nodes. Regarding the second-to-last level, we > > > > can conclude that there are n/16^2 nodes. The total number of nodes in > > > > the entire tree is given by the sum of n/16 + n/16^2 + n/16^3 + ... + 1. > > > > This is a geometric progression with a total of log base 16 of n terms. > > > > According to the formula for the sum of a geometric progression, the sum > > > > is (n-1)/15. So, this tree has a total of (n-1)/15 nodes and > > > > (n-1)/15 - 1 edges. > > > > > > > > For the operations in the first part of this algorithm, since DFS > > > > traverses each edge twice, the time complexity would be > > > > 2*((n-1)/15 - 1). > > > > > > > > For the second part, each operation involves copying a node and making > > > > necessary modifications. Therefore, the time complexity is > > > > 16*(n-1)/15. > > > > > > > > For the third part, we use mas_find() to traverse and replace each > > > > element, which is essentially similar to the combination of the first > > > > and second parts. mas_find() traverses all nodes and within each node, > > > > it iterates over all elements and performs replacements. The time > > > > complexity of traversing the nodes is 2*((n-1)/15 - 1), and for all > > > > nodes, the time complexity of replacing all their elements is > > > > 16*(n-1)/15. > > > > > > > > By ignoring all constant factors, each of the three parts of the > > > > algorithm has a time complexity of O(n). Therefore, this new algorithm > > > > is O(n). > > > > > > Thanks for the detailed analysis! I didn't mean to cause so much work > > > with this question. I wanted to know so that future work could rely on > > > this calculation to demonstrate if it is worth implementing without > > > going through the effort of coding and benchmarking - after all, this > > > commit message will most likely be examined during that process. > > > > > > I asked because O(n) vs O(n*log(n)) doesn't seem to fit with your > > > benchmarking. > > It may not be well reflected in the benchmarking of fork() because all > > the aforementioned time complexity analysis is related to the part > > involving the maple tree, specifically the time complexity of > > constructing a new maple tree. However, fork() also includes many other > > behaviors. > > The forking is allocating VMAs, etc but all a 1-1 mapping per VMA so it > should be linear, if not near-linear. There is some setup time involved > with the mm struct too, but that should become less as more VMAs are > added per fork. > > > > > > > > > > > > The exact time complexity of the old algorithm is difficult to analyze. > > > > I can only provide an upper bound estimation. There are two possible > > > > scenarios for each insertion: > > > > > > > > 1. Appending at the end of a node. > > > > 2. Splitting nodes multiple times. > > > > > > > > For the first scenario, the individual operation has a time complexity > > > > of O(1). As for the second scenario, it involves node splitting. The > > > > challenge lies in determining which insertions trigger splits and how > > > > many splits occur each time, which is difficult to calculate. In the > > > > worst-case scenario, each insertion requires splitting the tree's height > > > > log(n) times. Assuming every insertion is in the worst-case scenario, > > > > the time complexity would be n*log(n). However, not every insertion > > > > requires splitting, and the number of splits each time may not > > > > necessarily be log(n). Therefore, this is an estimation of the upper > > > > bound. > > > > > > Saying every insert causes a split and adding in n*log(n) is more than > > > an over estimation. At worst there is some n + n/16 * log(n) going on > > > there. > > > > > > During the building of a tree, we are in bulk insert mode. This favours > > > balancing the tree to the left to maximize the number of inserts being > > > append operations. The algorithm inserts as many to the left as we can > > > leaving the minimum number on the right. > > > > > > We also reduce the number of splits by pushing data to the left whenever > > > possible, at every level. > > Yes, but I don't think pushing data would occur when inserting in > > ascending order in bulk mode because the left nodes are all full, while > > there are no nodes on the right side. However, I'm not entirely certain > > about this since I only briefly looked at the implementation of this > > part. > > They are not full, the right node has enough entries to have a > sufficient node, so the left node will have that many spaces for push. > mab_calc_split(): > if (unlikely((mas->mas_flags & MA_STATE_BULK))) { > *mid_split = 0; > split = b_end - mt_min_slots[bn->type]; > > > > > > > > > > > > > > > > > > > > > > > > As the entire maple tree is duplicated using __mt_dup(), if dup_mmap() > > > > > > fails, there will be a portion of VMAs that have not been duplicated in > > > > > > the maple tree. This makes it impossible to unmap all VMAs in exit_mmap(). > > > > > > To solve this problem, undo_dup_mmap() is introduced to handle the failure > > > > > > of dup_mmap(). I have carefully tested the failure path and so far it > > > > > > seems there are no issues. > > > > > > > > > > > > There is a "spawn" in byte-unixbench[1], which can be used to test the > > > > > > performance of fork(). I modified it slightly to make it work with > > > > > > different number of VMAs. > > > > > > > > > > > > Below are the test results. By default, there are 21 VMAs. The first row > > > > > > shows the number of additional VMAs added on top of the default. The last > > > > > > two rows show the number of fork() calls per ten seconds. The test results > > > > > > were obtained with CPU binding to avoid scheduler load balancing that > > > > > > could cause unstable results. There are still some fluctuations in the > > > > > > test results, but at least they are better than the original performance. > > > > > > > > > > > > Increment of VMAs: 0 100 200 400 800 1600 3200 6400 > > > > > > next-20230921: 112326 75469 54529 34619 20750 11355 6115 3183 > > > > > > Apply this: 116505 85971 67121 46080 29722 16665 9050 4805 > > > > > > +3.72% +13.92% +23.09% +33.11% +43.24% +46.76% +48.00% +50.96% > > > delta 4179 10502 12592 11461 8972 5310 2935 1622 > > > > > > Looking at this data, it is difficult to see what is going on because > > > there is a doubling of the VMAs per fork per column while the count is > > > forks per 10 seconds. So this table is really a logarithmic table with > > > increases growing by 10%. Adding the delta row makes it seem like the > > > number are not growing apart as I would expect. > > > > > > If we normalize this to VMAs per second by dividing the forks by 10, > > > then multiplying by the number of VMAs we get this: > > > > > > VMA Count: 21 121 221 421 821 1621 3221 6421 > > > log(VMA) 1.32 2.00 2.30 2.60 2.90 3.20 3.36 3.81 > > > next-20230921: 258349.8 928268.7 1215996.7 1464383.7 1707725.0 1842916.5 1420514.5 2044440.9 > > > this: 267961.5 1057443.3 1496798.3 1949184.0 2446120.6 2704729.5 2102315.0 3086251.5 > > > delta 9611.7 129174.6 280801.6 484800.3 738395.6 861813.0 681800.5 1041810.6 > > > > > > The first thing that I noticed was that we hit some dip in the numbers > > > at 3221. I first thought that might be something else running on the > > > host machine, but both runs are affected by around the same percent. > > > > > > Here, we do see the delta growing apart, but peaking in growth around > > > 821 VMAs. Again that 3221 number is out of line. > > > > > > If we discard 21 and anything above 1621, we still see both lines are > > > asymptotic curves. I would expect that the new algorithm would be more > > > linear to represent O(n), but there is certainly a curve when graphed > > > with a normalized X-axis. The older algorithm, O(n*log(n)) should be > > > the opposite curve all together, and with a diminishing return, but it > > > seems the more elements we have, the more operations we can perform in a > > > second. > > Thank you for your detailed analysis. > > > > So, are you expecting the transformed data to be close to a constant > > value? > > I would expect it to increase linearly, but it's a curve. Also, it > seems that both methods are near the identical curve, including the dip > at 3221. I expect the new method to have a different curve, especially > at the higher numbers where the fork() overhead is much less, but it > seems they both curve asymptotically. That is, they seen to be the same > complexity related to n, but with different constants. > > > Please note that besides constructing a new maple tree, there are many > > other operations in fork(). As the number of VMAs increases, the number > > of fork() calls decreases. Therefore, the overall cost spent on other > > operations becomes smaller, while the cost spent on duplicating VMAs > > increases. That's why this data grows with the increase of VMAs. I > > speculate that if the number of VMAs is large enough to neglect the time > > spent on other operations in fork(), this data will approach a constant > > value. > > If it were the other parts of fork() causing the non-linear growth, then > I would expect 800 -> 1600 to increase to +53% instead of +46%, and if > we were hitting the limit of fork affecting the data, then I would > expect the delta of VMAs/second to not be so high at the upper 6421 - > both algorithms have more room to get more performance at least until > 6421 VMAs/fork. > > > > > If we want to achieve the expected curve, I think we should simulate the > > process of constructing the maple tree in user space to avoid the impact > > of other operations in fork(), just like in the current bench_forking(). > > > > > > Thinking about what is going on here, I cannot come up with a reason > > > that there would be a curve to the line at all. If we took more > > > measurements, I would think the samples would be an ever-increasing line > > > with variability for some function of 16 - a saw toothed increasing > > > line. At least, until an upper limit is reached. We can see that the > > > upper limit was still not achieved at 1621 since 6421 is higher for both > > > runs, but a curve is evident on both methods, which suggests something > > > else is a significant contributor. > > > > > > I would think each VMA requires the same amount of work, so a constant. > > > The allocations would again, be some function that would linearly > > > increase with the existing method over-estimating by a huge number of > > > nodes. > > > > > > I'm not trying to nitpick here, but it is important to be accurate in > > > the statements because it may alter choices on how to proceed in > > > improving this performance later. It may be others looking through > > > these commit messages to see if something can be improved. > > Thank you for pointing that out. I will try to describe it more > > accurately in the commit log and see if I can measure the expected curve > > in user space. > > > > > > I also feel like your notes on your algorithm are worth including in the > > > commit because it could prove rather valuable if we revisit forking in > > > the future. > > Do you mean that I should write the analysis of the time complexity of > > the new algorithm in the commit log? > > Yes, I think it's worth capturing. What do you think? > > > > > > > The more I look at this, the more questions I have that I cannot answer. > > > One thing we can see is that the new method is faster in this > > > micro-benchmark. > > Yes. It should be noted that in the field of computer science, if the > > test results don't align with the expected mathematical calculations, > > it indicates an error in the calculations. This is because accurate > > calculations will always be reflected in the test results. 😂 > > > > > > > > > > > > > > > [1] https://github.com/kdlucas/byte-unixbench/tree/master > > > > > > > > > > > > Signed-off-by: Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> > > > > > > --- > > > > > > include/linux/mm.h | 1 + > > > > > > kernel/fork.c | 34 ++++++++++++++++++++---------- > > > > > > mm/internal.h | 3 ++- > > > > > > mm/memory.c | 7 ++++--- > > > > > > mm/mmap.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-- > > > > > > 5 files changed, 80 insertions(+), 17 deletions(-) > > > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > > index 1f1d0d6b8f20..10c59dc7ffaa 100644 > > > > > > --- a/include/linux/mm.h > > > > > > +++ b/include/linux/mm.h > > > > > > @@ -3242,6 +3242,7 @@ extern void unlink_file_vma(struct vm_area_struct *); > > > > > > extern struct vm_area_struct *copy_vma(struct vm_area_struct **, > > > > > > unsigned long addr, unsigned long len, pgoff_t pgoff, > > > > > > bool *need_rmap_locks); > > > > > > +extern void undo_dup_mmap(struct mm_struct *mm, struct vm_area_struct *vma_end); > > > > > > extern void exit_mmap(struct mm_struct *); > > > > > > static inline int check_data_rlimit(unsigned long rlim, > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > > > index 7ae36c2e7290..2f3d83e89fe6 100644 > > > > > > --- a/kernel/fork.c > > > > > > +++ b/kernel/fork.c > > > > > > @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > > int retval; > > > > > > unsigned long charge = 0; > > > > > > LIST_HEAD(uf); > > > > > > - VMA_ITERATOR(old_vmi, oldmm, 0); > > > > > > VMA_ITERATOR(vmi, mm, 0); > > > > > > uprobe_start_dup_mmap(); > > > > > > @@ -678,16 +677,25 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > > goto out; > > > > > > khugepaged_fork(mm, oldmm); > > > > > > - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count); > > > > > > - if (retval) > > > > > > + /* Use __mt_dup() to efficiently build an identical maple tree. */ > > > > > > + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL); > > > > > > + if (unlikely(retval)) > > > > > > goto out; > > > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > > > - for_each_vma(old_vmi, mpnt) { > > > > > > + for_each_vma(vmi, mpnt) { > > > > > > struct file *file; > > > > > > vma_start_write(mpnt); > > > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > > > + mas_store_gfp(&vmi.mas, NULL, GFP_KERNEL); > > > > > > + > > > > > > + /* If failed, undo all completed duplications. */ > > > > > > + if (unlikely(mas_is_err(&vmi.mas))) { > > > > > > + retval = xa_err(vmi.mas.node); > > > > > > + goto loop_out; > > > > > > + } > > > > > > + > > > > > > vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); > > > > > > continue; > > > > > > } > > > > > > @@ -749,9 +757,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > > if (is_vm_hugetlb_page(tmp)) > > > > > > hugetlb_dup_vma_private(tmp); > > > > > > - /* Link the vma into the MT */ > > > > > > - if (vma_iter_bulk_store(&vmi, tmp)) > > > > > > - goto fail_nomem_vmi_store; > > > > > > + /* > > > > > > + * Link the vma into the MT. After using __mt_dup(), memory > > > > > > + * allocation is not necessary here, so it cannot fail. > > > > > > + */ > > > > > > + mas_store(&vmi.mas, tmp); > > > > > > mm->map_count++; > > > > > > if (!(tmp->vm_flags & VM_WIPEONFORK)) > > > > > > @@ -760,15 +770,19 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > > if (tmp->vm_ops && tmp->vm_ops->open) > > > > > > tmp->vm_ops->open(tmp); > > > > > > - if (retval) > > > > > > + if (retval) { > > > > > > + mpnt = vma_next(&vmi); > > > > > > goto loop_out; > > > > > > + } > > > > > > } > > > > > > /* a new mm has just been created */ > > > > > > retval = arch_dup_mmap(oldmm, mm); > > > > > > loop_out: > > > > > > vma_iter_free(&vmi); > > > > > > - if (!retval) > > > > > > + if (likely(!retval)) > > > > > > mt_set_in_rcu(vmi.mas.tree); > > > > > > + else > > > > > > + undo_dup_mmap(mm, mpnt); > > > > > > out: > > > > > > mmap_write_unlock(mm); > > > > > > flush_tlb_mm(oldmm); > > > > > > @@ -778,8 +792,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > > uprobe_end_dup_mmap(); > > > > > > return retval; > > > > > > -fail_nomem_vmi_store: > > > > > > - unlink_anon_vmas(tmp); > > > > > > fail_nomem_anon_vma_fork: > > > > > > mpol_put(vma_policy(tmp)); > > > > > > fail_nomem_policy: > > > > > > diff --git a/mm/internal.h b/mm/internal.h > > > > > > index 7a961d12b088..288ec81770cb 100644 > > > > > > --- a/mm/internal.h > > > > > > +++ b/mm/internal.h > > > > > > @@ -111,7 +111,8 @@ void folio_activate(struct folio *folio); > > > > > > void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > > > > > > struct vm_area_struct *start_vma, unsigned long floor, > > > > > > - unsigned long ceiling, bool mm_wr_locked); > > > > > > + unsigned long ceiling, unsigned long tree_end, > > > > > > + bool mm_wr_locked); > > > > > > void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); > > > > > > struct zap_details; > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > > index 983a40f8ee62..1fd66a0d5838 100644 > > > > > > --- a/mm/memory.c > > > > > > +++ b/mm/memory.c > > > > > > @@ -362,7 +362,8 @@ void free_pgd_range(struct mmu_gather *tlb, > > > > > > void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > > > > > > struct vm_area_struct *vma, unsigned long floor, > > > > > > - unsigned long ceiling, bool mm_wr_locked) > > > > > > + unsigned long ceiling, unsigned long tree_end, > > > > > > + bool mm_wr_locked) > > > > > > { > > > > > > do { > > > > > > unsigned long addr = vma->vm_start; > > > > > > @@ -372,7 +373,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > > > > > > * Note: USER_PGTABLES_CEILING may be passed as ceiling and may > > > > > > * be 0. This will underflow and is okay. > > > > > > */ > > > > > > - next = mas_find(mas, ceiling - 1); > > > > > > + next = mas_find(mas, tree_end - 1); > > > > > > /* > > > > > > * Hide vma from rmap and truncate_pagecache before freeing > > > > > > @@ -393,7 +394,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > > > > > > while (next && next->vm_start <= vma->vm_end + PMD_SIZE > > > > > > && !is_vm_hugetlb_page(next)) { > > > > > > vma = next; > > > > > > - next = mas_find(mas, ceiling - 1); > > > > > > + next = mas_find(mas, tree_end - 1); > > > > > > if (mm_wr_locked) > > > > > > vma_start_write(vma); > > > > > > unlink_anon_vmas(vma); > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > > > index 2ad950f773e4..daed3b423124 100644 > > > > > > --- a/mm/mmap.c > > > > > > +++ b/mm/mmap.c > > > > > > @@ -2312,7 +2312,7 @@ static void unmap_region(struct mm_struct *mm, struct ma_state *mas, > > > > > > mas_set(mas, mt_start); > > > > > > free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, > > > > > > next ? next->vm_start : USER_PGTABLES_CEILING, > > > > > > - mm_wr_locked); > > > > > > + tree_end, mm_wr_locked); > > > > > > tlb_finish_mmu(&tlb); > > > > > > } > > > > > > @@ -3178,6 +3178,54 @@ int vm_brk(unsigned long addr, unsigned long len) > > > > > > } > > > > > > EXPORT_SYMBOL(vm_brk); > > > > > > +void undo_dup_mmap(struct mm_struct *mm, struct vm_area_struct *vma_end) > > > > > > +{ > > > > > > + unsigned long tree_end; > > > > > > + VMA_ITERATOR(vmi, mm, 0); > > > > > > + struct vm_area_struct *vma; > > > > > > + unsigned long nr_accounted = 0; > > > > > > + int count = 0; > > > > > > + > > > > > > + /* > > > > > > + * vma_end points to the first VMA that has not been duplicated. We need > > > > > > + * to unmap all VMAs before it. > > > > > > + * If vma_end is NULL, it means that all VMAs in the maple tree have > > > > > > + * been duplicated, so setting tree_end to 0 will overflow to ULONG_MAX > > > > > > + * when using it. > > > > > > + */ > > > > > > + if (vma_end) { > > > > > > + tree_end = vma_end->vm_start; > > > > > > + if (tree_end == 0) > > > > > > + goto destroy; > > > > > > + } else > > > > > > + tree_end = 0; > > You need to enclose this statement to meet the coding style. You could > just set tree_end = 0 at the start of the function instead, actually I > think tree_end = USER_PGTABLES_CEILING unless there is a vma_end. > > > > > > > + > > > > > > + vma = mas_find(&vmi.mas, tree_end - 1); > > vma = vma_find(&vmi, tree_end); > > > > > > > + > > > > > > + if (vma) { > > Probably would be cleaner to jump to destroy here too: > if (!vma) > goto destroy; > > > > > > > + arch_unmap(mm, vma->vm_start, tree_end); One more thing, it seems the maple state that is passed into unmap_region() needs to point to the _next_ element, or the reset doesn't work right between the unmap_vmas() and free_pgtables() call: vma_iter_set(&vmi, vma->vm_end); > > > > > > + unmap_region(mm, &vmi.mas, vma, NULL, NULL, 0, tree_end, > > > > > > + tree_end, true); > > > > > > > > > > next is vma_end, as per your comment above. Using next = vma_end allows > > > > > you to avoid adding another argument to free_pgtables(). > > > > Unfortunately, it cannot be done this way. I fell into this trap before, > > > > and it caused incomplete page table cleanup. To solve this problem, the > > > > only solution I can think of right now is to add an additional > > > > parameter. > > > > > > > > free_pgtables() will be called in unmap_region() to free the page table, > > > > like this: > > > > > > > > free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, > > > > next ? next->vm_start : USER_PGTABLES_CEILING, > > > > mm_wr_locked); > > > > > > > > The problem is with 'next'. Our 'vma_end' does not exist in the actual > > > > mmap because it has not been duplicated and cannot be used as 'next'. > > > > If there is a real 'next', we can use 'next->vm_start' as the ceiling, > > > > which is not a problem. If there is no 'next' (next is 'vma_end'), we > > > > can only use 'USER_PGTABLES_CEILING' as the ceiling. Using > > > > 'vma_end->vm_start' as the ceiling will cause the page table not to be > > > > fully freed, which may be related to alignment in 'free_pgd_range()'. To > > > > solve this problem, we have to introduce 'tree_end', and separating > > > > 'tree_end' and 'ceiling' can solve this problem. > > > > > > Can you just use ceiling? That is, just not pass in next and keep the > > > code as-is? This is how exit_mmap() does it and should avoid any > > > alignment issues. I assume you tried that and something went wrong as > > > well? > > I tried that, but it didn't work either. In free_pgtables(), the > > following line of code is used to iterate over VMAs: > > mas_find(mas, ceiling - 1); > > If next is passed as NULL, ceiling will be 0, resulting in iterating > > over all the VMAs in the maple tree, including the last portion that was > > not duplicated. > > If vma_end is NULL, it means that all VMAs in the maple tree have been > duplicated, so shouldn't the correct action in this case be freeing up > to ceiling? > > If it isn't null, then vma_end->vm_start should work as the end of the > area to free. > > With your mas_find(mas, tree_end - 1), then the vma_end will be avoided, > but free_pgd_range() will use ceiling anyways: > > free_pgd_range(tlb, addr, vma->vm_end, floor, next ? next->vm_start : ceiling); > > Passing in vma_end as next to unmap_region() functions in my testing > without adding arguments to free_pgtables(). > > How are you producing the accounting issue you mention above? Maybe I > missed something? > > > > > > > > > > > > > > > > > > > > + > > > > > > + mas_set(&vmi.mas, vma->vm_end); > vma_iter_set(&vmi, vma->vm_end); > > > > > > + do { > > > > > > + if (vma->vm_flags & VM_ACCOUNT) > > > > > > + nr_accounted += vma_pages(vma); > > > > > > + remove_vma(vma, true); > > > > > > + count++; > > > > > > + cond_resched(); > > > > > > + vma = mas_find(&vmi.mas, tree_end - 1); > > > > > > + } while (vma != NULL); > > You can write this as: > do { ... } for_each_vma_range(vmi, vma, tree_end); > > > > > > > + > > > > > > + BUG_ON(count != mm->map_count); > > > > > > + > > > > > > + vm_unacct_memory(nr_accounted); > > > > > > + } > > > > > > + > > > > > > +destroy: > > > > > > + __mt_destroy(&mm->mm_mt); > > > > > > +} > > > > > > + > > > > > > /* Release all mmaps. */ > > > > > > void exit_mmap(struct mm_struct *mm) > > > > > > { > > > > > > @@ -3217,7 +3265,7 @@ void exit_mmap(struct mm_struct *mm) > > > > > > mt_clear_in_rcu(&mm->mm_mt); > > > > > > mas_set(&mas, vma->vm_end); > > > > > > free_pgtables(&tlb, &mas, vma, FIRST_USER_ADDRESS, > > > > > > - USER_PGTABLES_CEILING, true); > > > > > > + USER_PGTABLES_CEILING, USER_PGTABLES_CEILING, true); > > > > > > tlb_finish_mmu(&tlb); > > > > > > /* > > > > > > -- > > > > > > 2.20.1 > > > > > > > > > > > > > > > > > >