Re: [PATCH v2 bpf-next] bpf: devmap: move drop error path to devmap for XDP_REDIRECT

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On Mon, 1 Mar 2021 13:23:06 +0200
> Shay Agroskin <shayagr@xxxxxxxxxx> wrote:
> 
> > Jesper Dangaard Brouer <brouer@xxxxxxxxxx> writes:
> > 
> > > On Sun, 28 Feb 2021 23:27:25 +0100
> > > Lorenzo Bianconi <lorenzo.bianconi@xxxxxxxxxx> wrote:
> > >  
> > >> > >  	drops = bq->count - sent;
> > >> > > -out:
> > >> > > -	bq->count = 0;
> > >> > > +	if (unlikely(drops > 0)) {
> > >> > > +		/* If not all frames have been 
> > >> > > transmitted, it is our
> > >> > > +		 * responsibility to free them
> > >> > > +		 */
> > >> > > +		for (i = sent; i < bq->count; i++)
> > >> > > + 
> > >> > > xdp_return_frame_rx_napi(bq->q[i]);
> > >> > > +	}    
> > >> > 
> > >> > Wouldn't the logic above be the same even w/o the 'if' 
> > >> > condition ?    
> > >> 
> > >> it is just an optimization to avoid the for loop instruction if 
> > >> sent = bq->count  
> > >
> > > True, and I like this optimization.
> > > It will affect how the code layout is (and thereby I-cache 
> > > usage).  
> > 
> > I'm not sure what I-cache optimization you mean here. Compiling 
> > the following C code:
> > 
> > # define unlikely(x)	__builtin_expect(!!(x), 0)
> > 
> > extern void xdp_return_frame_rx_napi(int q);
> > 
> > struct bq_stuff {
> >     int q[4];
> >     int count;
> > };
> > 
> > int test(int sent, struct bq_stuff *bq) {
> >     int i;
> >     int drops;
> > 
> >     drops = bq->count - sent;
> >     if(unlikely(drops > 0))
> >         for (i = sent; i < bq->count; i++)
> >             xdp_return_frame_rx_napi(bq->q[i]);
> > 
> >     return 2;
> > }
> > 
> > with x86_64 gcc 10.2 with -O3 flag in https://godbolt.org/ (which 
> > provides the assembly code for different compilers) yields the 
> > following assembly:
> > 
> > test:
> >         mov     eax, DWORD PTR [rsi+16]
> >         mov     edx, eax
> >         sub     edx, edi
> >         test    edx, edx
> >         jg      .L10
> > .L6:
> >         mov     eax, 2
> >         ret
> 
> This exactly shows my point.  Notice how 'ret' happens earlier in this
> function.  This is the common case, thus the CPU don't have to load the
> asm instruction below.
> 
> > .L10:
> >         cmp     eax, edi
> >         jle     .L6
> >         push    rbp
> >         mov     rbp, rsi
> >         push    rbx
> >         movsx   rbx, edi
> >         sub     rsp, 8
> > .L3:
> >         mov     edi, DWORD PTR [rbp+0+rbx*4]
> >         add     rbx, 1
> >         call    xdp_return_frame_rx_napi
> >         cmp     DWORD PTR [rbp+16], ebx
> >         jg      .L3
> >         add     rsp, 8
> >         mov     eax, 2
> >         pop     rbx
> >         pop     rbp
> >         ret
> > 
> > 
> > When dropping the 'if' completely I get the following assembly 
> > output
> > test:
> >         cmp     edi, DWORD PTR [rsi+16]
> >         jge     .L6
> 
> Jump to .L6 which is the common case.  The code in between is not used
> in common case, but the CPU will likely load this into I-cache, and
> then jumps over the code in common case.
> 
> >         push    rbp
> >         mov     rbp, rsi
> >         push    rbx
> >         movsx   rbx, edi
> >         sub     rsp, 8
> > .L3:
> >         mov     edi, DWORD PTR [rbp+0+rbx*4]
> >         add     rbx, 1
> >         call    xdp_return_frame_rx_napi
> >         cmp     DWORD PTR [rbp+16], ebx
> >         jg      .L3
> >         add     rsp, 8
> >         mov     eax, 2
> >         pop     rbx
> >         pop     rbp
> >         ret
> > .L6:
> >         mov     eax, 2
> >         ret
> > 
> > which exits earlier from the function if 'drops > 0' compared to 
> > the original code (the 'for' loop looks a little different, but 
> > this shouldn't affect icache).
> >
> > When removing the 'if' and surrounding the 'for' condition with 
> > 'unlikely' statement:
> > 
> > for (i = sent; unlikely(i < bq->count); i++)
> > 
> > I get the following assembly code:
> > 
> > test:
> >         cmp     edi, DWORD PTR [rsi+16]
> >         jl      .L10
> >         mov     eax, 2
> >         ret
> > .L10:
> >         push    rbx
> >         movsx   rbx, edi
> >         sub     rsp, 16
> > .L3:
> >         mov     edi, DWORD PTR [rsi+rbx*4]
> >         mov     QWORD PTR [rsp+8], rsi
> >         add     rbx, 1
> >         call    xdp_return_frame_rx_napi
> >         mov     rsi, QWORD PTR [rsp+8]
> >         cmp     DWORD PTR [rsi+16], ebx
> >         jg      .L3
> >         add     rsp, 16
> >         mov     eax, 2
> >         pop     rbx
> >         ret
> > 
> > which is shorter than the other two (one line compared to the 
> > second and 7 lines compared the original code) and seems as 
> > optimized as the second.
> 
> You are also using unlikely() and get the earlier return, with less
> instructions, which is great.  Perhaps we can use this type of
> unlikely() in the for-statement?  WDYT Lorenzo?

sure, we can do it..I will address it in v3. Thanks.

Regards,
Lorenzo

>  
>  
> > I'm far from being an assembly expert, and I tested a code snippet 
> > I wrote myself rather than the kernel's code (for the sake of 
> > simplicity only).
> > Can you please elaborate on what makes the original 'if' essential 
> > (I took the time to do the assembly tests, please take the time on 
> > your side to prove your point, I'm not trying to be grumpy here).
> > 
> > Shay
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux