On Thu, 2023-07-13 at 11:25 +0800, John Hsu wrote: > On Mon, 2023-07-10 at 10:24 -0400, Liam R. Howlett wrote: > > > > External email : Please do not click links or open attachments > > until > > you have verified the sender or the content. > > * John Hsu (許永翰) <John.Hsu@xxxxxxxxxxxx> [230710 08:50]: > > ... > > > > > > > > > This BUG_ON() is necessary since this function should > > > > _never_ > > > > run > > > > > > out of > > > > > > > > > > > > > > memory; this function does not return an error code. > > > > > > > > > > > > mas_preallocate() > > > > > > > > > > > > > > should have gotten you the memory necessary (or returned > > > > > > > an > > > > > > > > > > > > -ENOMEM) > > > > > > > > > > > > > > prior to the call to mas_store_prealloc(), so this is > > > > probably > > > > an > > > > > > > > > > > > > > internal tree problem. > > > > > > > > > > > > > > There is a tree operation being performed here. mprotect > > > > is > > > > > > merging a > > > > > > > > > > > > > > vma by the looks of the call stack. Why do you think no > > > > tree > > > > > > operation > > > > > > > > > > > > > > is necessary? > > > > > > > > > > > > > > As you mentioned, mas_preallocate() should allocate > > > > > > > enough > > > > > > > > node, > > > > > > but there is such functions mas_node_count() in > > > > > > > > mas_store_prealloc(). > > > > > > > In mas_node_count() checks whether the *mas* has enough > > > > nodes, > > > > and > > > > > > allocate memory for node if there was no enough nodes in > > > > > > mas. > > > > > > > > > > > > Right, we call mas_node_count() so that both code paths are > > > > used > > > > for > > > > > > preallocations and regular mas_store()/mas_store_gfp(). It > > > > > > > > shouldn't > > > > > > take a significant amount of time to verify there is enough > > > > > > > > nodes. > > > > > > > > > > Yap..., it didn't take a significant amount of time to verify > > > > > > > > whether > > > > > there is enough nodes. The problem is why the flow in > > > > > > > > mas_node_count > > > > > will alloc nodes if there was no enough nodes in mas? > > > > > > > > What I meant is that both methods use the same call path > > > > because > > > > there > > > > is not a reason to duplicate the path. After mas_preallocate() > > > > has > > > > allocated the nodes needed, the call to check if there is > > > > enough > > > > nodes > > > > will be quick. > > > > > > So whether the purpose of mas_preallocate() is decreasing the > > > lock > > > retention time? > > > > It could be, but in this case it's the locking order. We have to > > pre-allocate and fail early if we are out of memory, because we > > _cannot_ > > use GFP_KERNEL where we call > > mas_store_prealloc(). mas_preallocate() > > will use GFP_KERENL though. We cannot use GFP_KERNEL during the > > store > > because reclaim may sleep and we cannot sleep holding the locks we > > need > > to hold at the time of the store operation in __vma_adjust(). > > > > Yap, if the type of lock is spinlock, the flow shouldn't sleep in the > critical sections. mmap_lock is implmented by rw_semaphore(mutex). Is > there any other lock in this section? > > > > > > > > > > > > > > > I think that if mas_preallocate() allocate enough node, > > > > > > > why > > > > we > > > > > > check the node count and allocate nodes if there was no > > > > enough > > > > nodes > > > > > > in mas in mas_node_count()? > > > > > > > > > > > > We check for the above reason. > > > > > > > > > > > > > > > > OK..., this is one of the root cause of this BUG. > > > > > > > > The root cause is that there was not enough memory for a store > > > > operation. Regardless of if we check the allocations in the > > > > mas_store_prealloc() path or not, this would fail. If we > > > > remove > > > > the > > > > check for nodes within this path, then we would have to > > > > BUG_ON() > > > > when > > > > we > > > > run out of nodes to use or have a null pointer dereference BUG > > > > anyways. > > > > > > > > > > Yap, the root cause is oom. The BUG_ON() for the situations that > > > > the > > > maple tree struct cannot be maintained because of the lack of > > > > memory is > > > necessary. But the the buddy system in linux kernel can reclaim > > > > memory > > > when the system is under the low memory status. If we use > > > > GFP_KERNEL > > > after trying GFP_NOWAIT to allocate node, maybe we can get enough > > > memory when the second try with GFP_KERNEL. > > > > Right, but the GFP_KERNEL cannot be used when holding certain locks > > because it may sleep. > > > > > > > > > > > > > > > > > > > > > We have seen that there may be some maple_tree operations > > > > in > > > > > > merge_vma... > > > > > > > > > > > > If merge_vma() does anything, then there was an operation > > > > > > to > > > > the > > > > > > maple > > > > > > tree. > > > > > > > > > > > > > > > > > > > > Moreover, would maple_tree provides an API for assigning > > > > user's > > > > gfp > > > > > > flag for allocating node? > > > > > > > > > > > > mas_preallocate() and mas_store_gfp() has gfp flags as an > > > > > > argument. In > > > > > > your call stack, it will be called in __vma_adjust() as > > > > > > such: > > > > > > > > > > > > if (mas_preallocate(&mas, vma, GFP_KERNEL)) > > > > > > return -ENOMEM; > > > > > > > > > > > > line 715 in v6.1.25 > > > > > > > > > > > > > In rb_tree, we allocate vma_area_struct (rb_node is in > > > > > > > this > > > > > > > > > > > > struct.) with GFP_KERNEL, and maple_tree allocate node with > > > > > > GFP_NOWAIT and __GFP_NOWARN. > > > > > > > > > > > > We use GFP_KERNEL as I explained above for the VMA tree. > > > > > > > > > > Got it! But the mas_node_count() always use GFP_NOWAIT and > > > > > > > > __GFP_NOWARN > > > > > in inserting tree flow. Do you consider the performance of > > > > > > > > maintaining > > > > > the structure of maple_tree? > > > > > > > > Sorry, I don't understand what you mean by 'consider the > > > > performance > > > > of > > > > maintaining the structure of maple_tree'. > > > > > > > > > > As I mentioned above, GFP_NOWAIT will not allow buddy system for > > > reclaiming memory, so "Do you consider the performance of > > > > maintaining > > > the structure of maple_tree" means that: whether the > > > > mas_node_count() > > > path is not allowed to reclaim or compact memory for the > > > > performance. > > > > Ah, no. This is not for performance. It was initially on the road > > map > > for performance, but it was needed for the complicated locking in > > the > > MM > > code. > > > > rb tree embedded the nodes in the VMA which is allocated outside > > this > > critical section and so it could use GFP_KERNEL. > > > > As I know, following is rb_tree flow in 5.15.186: > > ... > mmap_write_lock_killable(mm) > ... > do_mmap() > ... > mmap_region() > ... > vm_area_alloc(mm) > ... > mmap_write_unlock(mm) > > vm_area_alloc is in the mmap_lock hoding period. > It seems that the flow would sleep here in rb_tree flow. > If I miss anything, please tell me, thanks! > > > > > > > > > > > > > It also will drop the lock and retry with GFP_KERNEL on > > > > failure > > > > > > when not using the external lock. The mmap_lock is > > > > configured as > > > > an > > > > > > external lock. > > > > > > > > > > > > > Allocation will not wait for reclaiming and compacting > > > > > > > when > > > > > > > > there > > > > > > is no enough available memory. > > > > > > > Is there any concern for this design? > > > > > > > > > > > > This has been addressed above, but let me know if I missed > > > > > > > > anything > > > > > > here. > > > > > > > > > > > > > > > > I think that the mas_node_count() has higher rate of > > > > > triggering > > > > > BUG_ON() when allocating nodes with GFP_NOWAIT and > > > > __GFP_NOWARN. If > > > > > mas_node_count() use GFP_KERNEL as mas_preallocate() in the > > > > mmap.c, > > > > the > > > > > allocation fail rate may be lower than use GFP_NOWAIT. > > > > > > > > Which BUG_ON() are you referring to? > > > > > > > > If I was to separate the code path for mas_store_prealloc() and > > > > mas_store_gfp(), then a BUG_ON() would still need to exist and > > > > still > > > > would have been triggered.. We are in a place in the code > > > > where > > > > we > > > > should never sleep and we don't have enough memory allocated to > > > > do > > > > what > > > > was necessary. > > > > > > > > > > Yap. There is no reason to seprate mas_store_prealloc() and > > > mas_store_gfp. Is it possible to retry to allocate mas_node with > > > GFP_KERNEL (wait for system reclaim and compact) instead of > > > > triggering > > > BUG_ON once the GFP_NOWAIT allocation failed? > > > > Unfortunately not, no. In some cases it may actually work out that > > the > > VMA may not need the locks in question, but it cannot be > > generalized > > for > > __vma_adjust(). Where I am able, I use the mas_store_gfp() calls > > so > > we > > can let reclaim and compact run, but it's not possible here. > > > > We have used GFP_KERNEL as alloc flag in mas_node_count flow about 2 > months. The mentioned problem we mentioned in the first mail didn't > reproduce under high stress stability test. > > I have seen the patch provided by you. I will verify this patch in > our > stability test. But there is a problem, if maple_tree behavior is > unexpected (e.g. redundant write bug this time). We can only review > the > whole mm flow to find out the wrong behavior. Do we have an efficient > debug method for maple tree? > > > Thanks, > > Liam Best Regards, John Hsu