Re: [PATCH 3/3] debugobjects: Use hlist_cut_number() to optimize performance and improve readability

"Leizhen (ThunderTown)" <thunder.leizhen@xxxxxxxxxx> · Wed, 11 Sep 2024 15:44:24 +0800

On 2024/9/10 19:44, Thomas Gleixner wrote:
> On Tue, Sep 10 2024 at 12:00, Leizhen wrote:
>> On 2024/9/10 2:41, Thomas Gleixner wrote:
>>> All related functions have this problem and all of this code is very
>>> strict about boundaries. Instead of accurately doing the refill, purge
>>> etc. we should look into proper batch mode mechanisms. Let me think
>>> about it.
>> It may be helpful to add several arrays to record the first node of each batch
>> in each free list. Take 'percpu_pool' as an example:
>>
>> struct debug_percpu_free {
>> 	struct hlist_head	free_objs;
>> 	int			obj_free;
>> +	int			batch_idx;
>> +	struct hlist_node	*batch_first[4]; // ODEBUG_POOL_PERCPU_SIZE / ODEBUG_BATCH_SIZE
>> };
>>
>> A new free node is added to the header of the list, and the batch is cut from the tail
>> of the list.
>>   NodeA<-->...<-->NodeB<-->...<-->NodeC<-->NodeD<--> free_objs
>>     |---one batch---|---one batch---|
>>                     |               |
>>         batch_first[0]  batch_first[1]
> The current data structures are not fit for the purpose. Glueing
> workarounds into the existing mess makes it just worse.
> 
> So the data structures need to be redesigned from ground up to be fit
> for the purpose.
> 
> allocation:
> 
>    1) Using the global pool for single object allocations is wrong
> 
>       During boot this can be a completely disconnected list, which does
>       not need any accounting, does not need pool_lock and can be just
>       protected with irqsave like the per CPU pools. It's effectivly a
>       per CPU pool because at that point there is only one CPU and
>       everything is single threaded.
> 
>    2) The per CPU pool handling is backwards
> 
>       If the per CPU pool is empty, then the pool needs to be refilled
>       with a batch from the global pool and allocated from there.
> 
>       Allocation then always happens from the active per CPU batch slot.
> 
> free:
> 
>    1) Early boot
> 
>       Just put it back on the dedicated boot list and be done
> 
>    2) After obj_cache is initialized
> 
>       Put it back to the per CPU pool into the active batch slot.  If
>       the slot becomes full then make the next slot the active slot. It
>       the full slot was the top most slot then move that slot either
>       into the global pool when there is a free slot, or move it to the
>       to_free pool.
> 
> That means the per CPU pool is different from the global pools as it can
> allocate/free single objects, while the global pools are strictly stacks
> of batches. Any movement between per CPU pools and global pools is batch
> based and just moves lists from one head to another.
> 
> That minimizes the pool lock contention and the cache foot print. The
> global to free pool must have an extra twist to accomodate non-batch
> sized drops and to handle the all slots are full case, but that's just a
> trivial detail.

That's great. I really admire you for completing the refactor in such a
short of time. But I have a few minor comments.
1. When kmem_cache_zalloc() is called to allocate objs for filling,
   if less than one batch of objs are allocated, all of them can be
   pushed to the local CPU. That's, call pcpu_free() one by one.
   In this way, the number of free objs cached by pool_global and
   pool_to_free is always an integer multiple of ODEBUG_BATCH_SIZE.
2. Member tot_cnt of struct global_pool can be deleted. We can get it
   simply and quickly through (slot_idx * ODEBUG_BATCH_SIZE). Avoid
   redundant maintenance.
3. debug_objects_pool_min_level also needs to be adjusted accordingly,
   the number of batches of the min level.

> 
> See the completely untested combo patch against tip core/debugobjects
> below.
> 
> Thanks,
> 
>         tglx

-- 
Regards,
  Zhen Lei