On 08.09.22 14:41, Eric Dumazet wrote: > On Thu, Sep 8, 2022 at 2:40 AM Christian Borntraeger > <borntraeger@xxxxxxxxxxxxx> wrote: >> >> Am 07.09.22 um 18:06 schrieb Eric Dumazet: >>> On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@xxxxxxxxxxxxx> wrote: >>>> >>>> Since linear payload was removed even for single small messages, >>>> an additional page is required and we are measuring performance impact. >>>> >>>> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting") >>>> explicitely allowed "payload in skb->head for first skb put in the queue, >>>> to not impact RPC workloads." >>>> 472c2e07eef0 ("tcp: add one skb cache for tx") >>>> made that obsolete and removed it. >>>> When >>>> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache") >>>> reverted it, this piece was not reverted and not added back in. >>>> >>>> When running uperf with a request-response pattern with 1k payload >>>> and 250 connections parallel, we measure 13% difference in throughput >>>> for our PCI based network interfaces since 472c2e07eef0. >>>> (our IO MMU is sensitive to the number of mapped pages) >>> >>> >>> >>>> >>>> Could you please consider allowing linear payload for the first >>>> skb in queue again? A patch proposal is appended below. >>> >>> No. >>> >>> Please add a work around in your driver. >>> >>> You can increase throughput by 20% by premapping a coherent piece of >>> memory in which >>> you can copy small skbs (skb->head included) >>> >>> Something like 256 bytes per slot in the TX ring. >>> >> >> FWIW this regression was withthe standard mellanox driver (nothing s390 specific). > > I did not claim this was s390 specific. > > Only IOMMU mode. > > I would rather not add back something which makes TCP stack slower > (more tests in fast path) > for the majority of us _not_ using IOMMU. > > In our own tests, this trick of using linear skbs was only helping > benchmarks, not real workloads. > > Many drivers have to map skb->head a second time if they contain TCP payload, > thus adding yet another corner case in their fast path. > > - Typical RPC workloads are playing with TCP_NODELAY > - Typical bulk flows never have empty write queues... > > Really, I do not want this optimization back, this is not worth it. > > Again, a driver knows better if it is using IOMMU and if pathological > layouts can be optimized > to non SG ones, and using a pre-dma-map zone will also benefit pure > TCP ACK packets (which do not have any payload) > > Here is the changelog of a patch I did for our GQ NIC (not yet > upstreamed, but will be soon) > [...] Saeed, As discussed at LPC, could you please consider adding a workaround to the Mellanox driver, to use non-SG SKBs for small messages? As mentioned above we are seeing 13% throughput degradation, if 2 pages need to be mapped instead of 1. While Eric's ideas sound very promising, just using non-SG in these cases should be enough to mitigate the performance regression we see. Thank you in advance. Alexandra