On 2024/9/10 19:44, Thomas Gleixner wrote: > On Tue, Sep 10 2024 at 12:00, Leizhen wrote: >> On 2024/9/10 2:41, Thomas Gleixner wrote: >>> All related functions have this problem and all of this code is very >>> strict about boundaries. Instead of accurately doing the refill, purge >>> etc. we should look into proper batch mode mechanisms. Let me think >>> about it. >> It may be helpful to add several arrays to record the first node of each batch >> in each free list. Take 'percpu_pool' as an example: >> >> struct debug_percpu_free { >> struct hlist_head free_objs; >> int obj_free; >> + int batch_idx; >> + struct hlist_node *batch_first[4]; // ODEBUG_POOL_PERCPU_SIZE / ODEBUG_BATCH_SIZE >> }; >> >> A new free node is added to the header of the list, and the batch is cut from the tail >> of the list. >> NodeA<-->...<-->NodeB<-->...<-->NodeC<-->NodeD<--> free_objs >> |---one batch---|---one batch---| >> | | >> batch_first[0] batch_first[1] > The current data structures are not fit for the purpose. Glueing > workarounds into the existing mess makes it just worse. > > So the data structures need to be redesigned from ground up to be fit > for the purpose. > > allocation: > > 1) Using the global pool for single object allocations is wrong > > During boot this can be a completely disconnected list, which does > not need any accounting, does not need pool_lock and can be just > protected with irqsave like the per CPU pools. It's effectivly a > per CPU pool because at that point there is only one CPU and > everything is single threaded. > > 2) The per CPU pool handling is backwards > > If the per CPU pool is empty, then the pool needs to be refilled > with a batch from the global pool and allocated from there. > > Allocation then always happens from the active per CPU batch slot. > > free: > > 1) Early boot > > Just put it back on the dedicated boot list and be done > > 2) After obj_cache is initialized > > Put it back to the per CPU pool into the active batch slot. If > the slot becomes full then make the next slot the active slot. It > the full slot was the top most slot then move that slot either > into the global pool when there is a free slot, or move it to the > to_free pool. > > That means the per CPU pool is different from the global pools as it can > allocate/free single objects, while the global pools are strictly stacks > of batches. Any movement between per CPU pools and global pools is batch > based and just moves lists from one head to another. > > That minimizes the pool lock contention and the cache foot print. The > global to free pool must have an extra twist to accomodate non-batch > sized drops and to handle the all slots are full case, but that's just a > trivial detail. That's great. I really admire you for completing the refactor in such a short of time. But I have a few minor comments. 1. When kmem_cache_zalloc() is called to allocate objs for filling, if less than one batch of objs are allocated, all of them can be pushed to the local CPU. That's, call pcpu_free() one by one. In this way, the number of free objs cached by pool_global and pool_to_free is always an integer multiple of ODEBUG_BATCH_SIZE. 2. Member tot_cnt of struct global_pool can be deleted. We can get it simply and quickly through (slot_idx * ODEBUG_BATCH_SIZE). Avoid redundant maintenance. 3. debug_objects_pool_min_level also needs to be adjusted accordingly, the number of batches of the min level. > > See the completely untested combo patch against tip core/debugobjects > below. > > Thanks, > > tglx -- Regards, Zhen Lei