On Wed, 2023-07-19 at 14:51 -0400, Liam R. Howlett wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > * John Hsu (許永翰) <John.Hsu@xxxxxxxxxxxx> [230712 23:26]: > > On Mon, 2023-07-10 at 10:24 -0400, Liam R. Howlett wrote: > > ... > > > > > > > > > As you mentioned, mas_preallocate() should allocate > enough > > > > > node, > > > > > > > but there is such functions mas_node_count() in > > > > > mas_store_prealloc(). > > > > > > > > In mas_node_count() checks whether the *mas* has enough > > > nodes, > > > > > and > > > > > > > allocate memory for node if there was no enough nodes in > mas. > > > > > > > > > > > > > > Right, we call mas_node_count() so that both code paths > are > > > used > > > > > for > > > > > > > preallocations and regular > mas_store()/mas_store_gfp(). It > > > > > shouldn't > > > > > > > take a significant amount of time to verify there is > enough > > > > > nodes. > > > > > > > > > > > > Yap..., it didn't take a significant amount of time to > verify > > > > > whether > > > > > > there is enough nodes. The problem is why the flow in > > > > > mas_node_count > > > > > > will alloc nodes if there was no enough nodes in mas? > > > > > > > > > > What I meant is that both methods use the same call path > because > > > > > there > > > > > is not a reason to duplicate the path. After > mas_preallocate() > > > has > > > > > allocated the nodes needed, the call to check if there is > enough > > > > > nodes > > > > > will be quick. > > > > > > > > So whether the purpose of mas_preallocate() is decreasing the > lock > > > > retention time? > > > > > > It could be, but in this case it's the locking order. We have to > > > pre-allocate and fail early if we are out of memory, because we > > > _cannot_ > > > use GFP_KERNEL where we call > mas_store_prealloc(). mas_preallocate() > > > will use GFP_KERENL though. We cannot use GFP_KERNEL during the > > > store > > > because reclaim may sleep and we cannot sleep holding the locks > we > > > need > > > to hold at the time of the store operation in __vma_adjust(). > > > > > Yap, if the type of lock is spinlock, the flow shouldn't sleep in > the > > critical sections. mmap_lock is implmented by rw_semaphore(mutex). > Is > > there any other lock in this section? > > Yes, the i_mmap_lock_write(), the anon_vma_lock_write(), > flush_dcache_mmap_lock(). > > > > > > > > > > > > > > > > > > > I think that if mas_preallocate() allocate enough node, > why > > > we > > > > > > > check the node count and allocate nodes if there was no > > > enough > > > > > nodes > > > > > > > in mas in mas_node_count()? > > > > > > > > > > > > > > We check for the above reason. > > > > > > > > > > > > > > > > > > > OK..., this is one of the root cause of this BUG. > > > > > > > > > > The root cause is that there was not enough memory for a > store > > > > > operation. Regardless of if we check the allocations in the > > > > > mas_store_prealloc() path or not, this would fail. If we > remove > > > the > > > > > check for nodes within this path, then we would have to > BUG_ON() > > > when > > > > > we > > > > > run out of nodes to use or have a null pointer dereference > BUG > > > > > anyways. > > > > > > > > > Yap, the root cause is oom. The BUG_ON() for the situations > that > > > the > > > > maple tree struct cannot be maintained because of the lack of > > > memory is > > > > necessary. But the the buddy system in linux kernel can reclaim > > > memory > > > > when the system is under the low memory status. If we use > > > GFP_KERNEL > > > > after trying GFP_NOWAIT to allocate node, maybe we can get > enough > > > > memory when the second try with GFP_KERNEL. > > > > > > Right, but the GFP_KERNEL cannot be used when holding certain > locks > > > because it may sleep. > > > > > > > > > > > > > > > > > > > > > > > > > We have seen that there may be some maple_tree > operations > > > in > > > > > > > merge_vma... > > > > > > > > > > > > > > If merge_vma() does anything, then there was an operation > to > > > the > > > > > > > maple > > > > > > > tree. > > > > > > > > > > > > > > > > > > > > > > > Moreover, would maple_tree provides an API for > assigning > > > user's > > > > > gfp > > > > > > > flag for allocating node? > > > > > > > > > > > > > > mas_preallocate() and mas_store_gfp() has gfp flags as an > > > > > > > argument. In > > > > > > > your call stack, it will be called in __vma_adjust() as > such: > > > > > > > > > > > > > > if (mas_preallocate(&mas, vma, GFP_KERNEL)) > > > > > > > return -ENOMEM; > > > > > > > > > > > > > > line 715 in v6.1.25 > > > > > > > > > > > > > > > In rb_tree, we allocate vma_area_struct (rb_node is in > this > > > > > > > struct.) with GFP_KERNEL, and maple_tree allocate node > with > > > > > > > GFP_NOWAIT and __GFP_NOWARN. > > > > > > > > > > > > > > We use GFP_KERNEL as I explained above for the VMA tree. > > > > > > > > > > > > Got it! But the mas_node_count() always use GFP_NOWAIT and > > > > > __GFP_NOWARN > > > > > > in inserting tree flow. Do you consider the performance of > > > > > maintaining > > > > > > the structure of maple_tree? > > > > > > > > > > Sorry, I don't understand what you mean by 'consider the > > > performance > > > > > of > > > > > maintaining the structure of maple_tree'. > > > > > > > > > As I mentioned above, GFP_NOWAIT will not allow buddy system > for > > > > reclaiming memory, so "Do you consider the performance of > > > maintaining > > > > the structure of maple_tree" means that: whether the > > > mas_node_count() > > > > path is not allowed to reclaim or compact memory for the > > > performance. > > > > > > Ah, no. This is not for performance. It was initially on the > road > > > map > > > for performance, but it was needed for the complicated locking in > the > > > MM > > > code. > > > > > > rb tree embedded the nodes in the VMA which is allocated outside > this > > > critical section and so it could use GFP_KERNEL. > > > > > As I know, following is rb_tree flow in 5.15.186: > > > > ... > > mmap_write_lock_killable(mm) > > ... > > do_mmap() > > ... > > mmap_region() > > ... > > vm_area_alloc(mm) > > ... > > mmap_write_unlock(mm) > > > > vm_area_alloc is in the mmap_lock hoding period. > > It seems that the flow would sleep here in rb_tree flow. > > If I miss anything, please tell me, thanks! > > Before the mmap_write_unlock(mm) in the above sequence, the > i_mmap_lock_write(), anon_vma_lock_write(), and/or the > flush_dcache_mmap_lock() may be taken. Check __vma_adjust(). > > The insertion into the tree needs to hold some subset of these locks. > The rb-tree insert did not allocate within these locks, but the maple > tree would need to allocate within these locks to insert into the > tree. > This is why the preallocation exists and why it is necessary. > Yap, preallocation is necessary. anon_vma_lock_write() and flush_dcache_mmap_lock() hold the lock and manipulate rb_tree. I think that there is no maple tree manipulations during the lock holding period. Is there any future work in this section? > > > > > > > > > > > > > > > > > It also will drop the lock and retry with GFP_KERNEL on > > > failure > > > > > > > when not using the external lock. The mmap_lock is > > > configured as > > > > > an > > > > > > > external lock. > > > > > > > > > > > > > > > Allocation will not wait for reclaiming and compacting > when > > > > > there > > > > > > > is no enough available memory. > > > > > > > > Is there any concern for this design? > > > > > > > > > > > > > > This has been addressed above, but let me know if I > missed > > > > > anything > > > > > > > here. > > > > > > > > > > > > > > > > > > > I think that the mas_node_count() has higher rate of > triggering > > > > > > BUG_ON() when allocating nodes with GFP_NOWAIT and > > > __GFP_NOWARN. If > > > > > > mas_node_count() use GFP_KERNEL as mas_preallocate() in the > > > mmap.c, > > > > > the > > > > > > allocation fail rate may be lower than use GFP_NOWAIT. > > > > > > > > > > Which BUG_ON() are you referring to? > > > > > > > > > > If I was to separate the code path for mas_store_prealloc() > and > > > > > mas_store_gfp(), then a BUG_ON() would still need to exist > and > > > still > > > > > would have been triggered.. We are in a place in the code > where > > > we > > > > > should never sleep and we don't have enough memory allocated > to > > > do > > > > > what > > > > > was necessary. > > > > > > > > > Yap. There is no reason to seprate mas_store_prealloc() and > > > > mas_store_gfp. Is it possible to retry to allocate mas_node > with > > > > GFP_KERNEL (wait for system reclaim and compact) instead of > > > triggering > > > > BUG_ON once the GFP_NOWAIT allocation failed? > > > > > > Unfortunately not, no. In some cases it may actually work out > that > > > the > > > VMA may not need the locks in question, but it cannot be > generalized > > > for > > > __vma_adjust(). Where I am able, I use the mas_store_gfp() calls > so > > > we > > > can let reclaim and compact run, but it's not possible here. > > > > > We have used GFP_KERNEL as alloc flag in mas_node_count flow about > 2 > > months. The mentioned problem we mentioned in the first mail didn't > > reproduce under high stress stability test. > > > > I have seen the patch provided by you. I will verify this patch in > our > > stability test. But there is a problem, if maple_tree behavior is > > unexpected (e.g. redundant write bug this time). We can only review > the > > whole mm flow to find out the wrong behavior. Do we have an > efficient > > debug method for maple tree? > > There is a test suite for the maple tree found in > tools/testing/radix-tree, but it does not test the mm code and > integration. > > There are other mm tests, but they will not test the OOM scenario you > hit - to the best of my knowledge. > > There are also config options to debug the tree operations, but they > do > not detect the redundant write issues. Perhaps I can look at adding > support for detecting redundant writes, but that will not be > backported > to a stable kernel. > The sufficient test cases of maple tree ensure the function work well. But the redundant operations (alloc node, free node, tree manipulations) of maple_tree are not easy to detect (e.g. the case reported this time and mas_preallocate() allocates redundant nodes with the worst case). The detecting redundant writes mechanism may help the developers to find out the problems easier. Hope it can be establised successfully!! > Thanks, > Liam Best Regards, John Hsu