Re: [PATCH] usbnet: jump to rx_cleanup case instead of calling skb_queue_tail

Leesoo Ahn <lsahn@xxxxxxxxxx> · Mon, 19 Dec 2022 17:09:21 +0900

On 22. 12. 19. 16:50, Greg KH wrote:
On Mon, Dec 19, 2022 at 04:41:16PM +0900, Leesoo Ahn wrote:
On 22. 12. 18. 17:55, Greg KH wrote:
On Sun, Dec 18, 2022 at 01:18:51AM +0900, Leesoo Ahn wrote:
The current source pushes skb into dev->done queue by calling
skb_queue_tail() and then, call skb_dequeue() to pop for rx_cleanup state
to free urb and skb next in usbnet_bh().
It wastes CPU resource with extra instructions. Instead, use return values
jumping to rx_cleanup case directly to free them. Therefore calling
skb_queue_tail() and skb_dequeue() is not necessary.

The follows are just showing difference between calling skb_queue_tail()
and using return values jumping to rx_cleanup state directly in usbnet_bh()
in Arm64 instructions with perf tool.

----------- calling skb_queue_tail() -----------
         │     if (!(dev->driver_info->flags & FLAG_RX_ASSEMBLE))
    7.58 │248:   ldr     x0, [x20, #16]
    2.46 │24c:   ldr     w0, [x0, #8]
    1.64 │250: ↑ tbnz    w0, #14, 16c
         │     dev->net->stats.rx_errors++;
    0.57 │254:   ldr     x1, [x20, #184]
    1.64 │258:   ldr     x0, [x1, #336]
    2.65 │25c:   add     x0, x0, #0x1
         │260:   str     x0, [x1, #336]
         │     skb_queue_tail(&dev->done, skb);
    0.38 │264:   mov     x1, x19
         │268:   mov     x0, x21
    2.27 │26c: → bl      skb_queue_tail
    0.57 │270: ↑ b       44    // branch to call skb_dequeue()

----------- jumping to rx_cleanup state -----------
         │     if (!(dev->driver_info->flags & FLAG_RX_ASSEMBLE))
    1.69 │25c:   ldr     x0, [x21, #16]
    4.78 │260:   ldr     w0, [x0, #8]
    3.28 │264: ↑ tbnz    w0, #14, e4    // jump to 'rx_cleanup' state
         │     dev->net->stats.rx_errors++;
    0.09 │268:   ldr     x1, [x21, #184]
    2.72 │26c:   ldr     x0, [x1, #336]
    3.37 │270:   add     x0, x0, #0x1
    0.09 │274:   str     x0, [x1, #336]
    0.66 │278: ↑ b       e4    // branch to 'rx_cleanup' state
Interesting, but does this even really matter given the slow speed of
the USB hardware?
It doesn't if USB hardware has slow speed but in software view, it's still
worth avoiding calling skb_queue_tail() and skb_dequeue() which work with
spinlock, if possible.
But can you actually measure that in either CPU load or in increased
transfer speeds?

thanks,

greg k-h

I think the follows are maybe what you would be interested in. I have 
tested both case with perf on the same machine and environments, also 
modified driver code a bit to go to rx_cleanup case, not to net stack in 
a specific packet.

----- calling skb_queue_tail() -----
-   11.58%     0.26%  swapper          [k] usbnet_bh
   - 11.32% usbnet_bh
      - 6.43% skb_dequeue
           6.34% _raw_spin_unlock_irqrestore
      - 2.21% skb_queue_tail
           2.19% _raw_spin_unlock_irqrestore
      - 1.68% consume_skb
         - 0.97% kfree_skbmem
              0.80% kmem_cache_free
           0.53% skb_release_data

----- jump to rx_cleanup directly -----
-    7.62%     0.18%  swapper          [k] usbnet_bh
   - 7.44% usbnet_bh
      - 4.63% skb_dequeue
           4.57% _raw_spin_unlock_irqrestore
      - 1.76% consume_skb
         - 1.03% kfree_skbmem
              0.86% kmem_cache_free
           0.56% skb_release_data
        0.54% smsc95xx_rx_fixup

The first case takes CPU resource a bit much by the result.

Thank you for reviewing, by the way.

Best regards,
Leesoo