Re: [PATCH v2 bpf-next 09/13] bpf: Allow reuse from waiting_for_gp_ttrace list.

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Wed, 28 Jun 2023 10:38:39 -0700

On Wed, Jun 28, 2023 at 04:09:14PM +0800, Hou Tao wrote:
> Hi,
> 
> On 6/28/2023 8:59 AM, Alexei Starovoitov wrote:
> > On 6/26/23 12:16 AM, Hou Tao wrote:
> >> Hi,
> >>
> >> On 6/26/2023 12:42 PM, Alexei Starovoitov wrote:
> >>> On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >>>> Hi,
> >>>>
> >>>> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> >>>>> From: Alexei Starovoitov <ast@xxxxxxxxxx>
> >>>>>
> >>>>> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> >>>>> Let it reuse from waiting_for_gp_ttrace as well to avoid
> >>>>> unnecessary kmalloc().
> >>>>>
> >>>>> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> >>>>> ---
> >>>>>   kernel/bpf/memalloc.c | 9 +++++++++
> >>>>>   1 file changed, 9 insertions(+)
> >>>>>
> SNIP
> >>        // free A (from c1), ..., last free X (allocated from c0)
> >>      P3: unit_free(c1)
> >>          // the last freed element X is from c0
> >>          c1->tgt = c0
> >>          c1->free_llist->first -> X -> Y -> ... -> A
> >>      P3: free_bulk(c1)
> >>          enque_to_free(c0)
> >>              c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X
> >>          __llist_add_batch(c0->waiting_for_gp_ttrace)
> >>              c0->waiting_for_gp_ttrace = A -> ... -> Y -> X
> >
> > In theory that's possible, but for this to happen one cpu needs
> > to be thousand times slower than all others and since there is no
> > preemption in llist_del_first I don't think we need to worry about it.
> 
> Not sure whether or not such case will be possible in a VM, after all,
> the CPU X is just a thread in host and it may be preempted in any time
> and with any duration.

vCPU preemption can happen even with guest-OS interrupts disabled, and
such preemption can persist for hundreds of milliseconds, or even for
several seconds.  So admittedly quite rare, but also quite possible.

							Thanx, Paul

> > Also with removal of _tail optimization the above
> > llist_add_batch(waiting_for_gp_ttrace)
> > will become a loop, so reused element will be at the very end
> > instead of top, so one cpu to million times slower which is not
> > realistic.
> 
> It is still possible A will be added back as
> waiting_for_gp_ttrace->first after switching to llist_add() as shown
> below. My questions is how much is the benefit for reusing from
> waiting_for_gp_ttrace ?
> 
>     // free A (from c1), ..., last free X (allocated from c0) 
>     P3: unit_free(c1)
>         // the last freed element X is allocated from c0
>         c1->tgt = c0
>         c1->free_llist->first -> A -> ... -> Y
>         c1->free_llist_extra -> X
> 
>     P3: free_bulk(c1)
>         enque_to_free(c0) 
>             c0->free_by_rcu_ttrace->first -> Y -> ... A
>             c0->free_by_rcu_ttrace->first -> X -> Y -> ... A
> 
>         llist_add(c0->waiting_for_gp_ttrace)
>             c0->waiting_for_gp_ttrace = A -> .. -> Y -> X
> 
> >
> >> P1:
> >>      // A is added back as first again
> >>      // but llist_del_first() didn't know
> >>      try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B)
> >>      // c0->waiting_for_gp_trrace is corrupted
> >>      c0->waiting_for_gp_ttrace->first = B
> >>
>