On Thu, Aug 31, 2017 at 11:25 PM, Jason Wang <jasowang@xxxxxxxxxx> wrote: > > > On 2017年08月31日 22:30, Willem de Bruijn wrote: >>> >>> Incomplete results at this stage, but I do see this correlation between >>> flows. It occurs even while not running out of zerocopy descriptors, >>> which I cannot yet explain. >>> >>> Running two threads in a guest, each with a udp socket, each >>> sending up to 100 datagrams, or until EAGAIN, every msec. >>> >>> Sender A sends 1B datagrams. >>> Sender B sends VHOST_GOODCOPY_LEN, which is enough >>> to trigger zcopy_used in vhost net. >>> >>> A local receive process on the host receives both flows. To avoid >>> a deep copy when looping the packet onto the receive path, >>> changed skb_orphan_frags_rx to always return false (gross hack). >>> >>> The flow with the larger packets is redirected through netem on ifb0: >>> >>> modprobe ifb >>> ip link set dev ifb0 up >>> tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit >>> >>> tc qdisc add dev tap0 ingress >>> tc filter add dev tap0 parent ffff: protocol ip \ >>> u32 match ip dport 8000 0xffff \ >>> action mirred egress redirect dev ifb0 >>> >>> For 10 second run, packet count with various ifb0 queue lengths $LIMIT: >>> >>> no filter >>> rx.A: ~840,000 >>> rx.B: ~840,000 >>> >>> limit 1 >>> rx.A: ~500,000 >>> rx.B: ~3100 >>> ifb0: 3273 sent, 371141 dropped >>> >>> limit 100 >>> rx.A: ~9000 >>> rx.B: ~4200 >>> ifb0: 4630 sent, 1491 dropped >>> >>> limit 1000 >>> rx.A: ~6800 >>> rx.B: ~4200 >>> ifb0: 4651 sent, 0 dropped >>> >>> Sender B is always correctly rate limited to 1 MBps or less. With a >>> short queue, it ends up dropping a lot and sending even less. >>> >>> When a queue builds up for sender B, sender A throughput is strongly >>> correlated with queue length. With queue length 1, it can send almost >>> at unthrottled speed. But even at limit 100 its throughput is on the >>> same order as sender B. >>> >>> What is surprising to me is that this happens even though the number >>> of ubuf_info in use at limit 100 is around 100 at all times. In other >>> words, >>> it does not exhaust the pool. >>> >>> When forcing zcopy_used to be false for all packets, this effect of >>> sender A throughput being correlated with sender B does not happen. >>> >>> no filter >>> rx.A: ~850,000 >>> rx.B: ~850,000 >>> >>> limit 100 >>> rx.A: ~850,000 >>> rx.B: ~4200 >>> ifb0: 4518 sent, 876182 dropped >>> >>> Also relevant is that with zerocopy, the sender processes back off >>> and report the same count as the receiver. Without zerocopy, >>> both senders send at full speed, even if only 4200 packets from flow >>> B arrive at the receiver. >>> >>> This is with the default virtio_net driver, so without napi-tx. >>> >>> It appears that the zerocopy notifications are pausing the guest. >>> Will look at that now. >> >> It was indeed as simple as that. With 256 descriptors, queuing even >> a hundred or so packets causes the guest to stall the device as soon >> as the qdisc is installed. >> >> Adding this check >> >> + in_use = nvq->upend_idx - nvq->done_idx; >> + if (nvq->upend_idx < nvq->done_idx) >> + in_use += UIO_MAXIOV; >> + >> + if (in_use > (vq->num >> 2)) >> + zcopy_used = false; >> >> Has the desired behavior of reverting zerocopy requests to copying. >> >> Without this change, the result is, as previously reported, throughput >> dropping to hundreds of packets per second on both flows. >> >> With the change, pps as observed for a few seconds at handle_tx is >> >> zerocopy=165 copy=168435 >> zerocopy=0 copy=168500 >> zerocopy=65 copy=168535 >> >> Both flows continue to send at more or less normal rate, with only >> sender B observing massive drops at the netem. >> >> With the queue removed the rate reverts to >> >> zerocopy=58878 copy=110239 >> zerocopy=58833 copy=110207 >> >> This is not a 50/50 split, which impliesTw that some packets from the >> large >> packet flow are still converted to copying. Without the change the rate >> without queue was 80k zerocopy vs 80k copy, so this choice of >> (vq->num >> 2) appears too conservative. >> >> However, testing with (vq->num >> 1) was not as effective at mitigating >> stalls. I did not save that data, unfortunately. Can run more tests on >> fine >> tuning this variable, if the idea sounds good. > > > Looks like there're still two cases were left: To be clear, this patch is not intended to fix all issues. It is a small improvement to avoid HoL blocking due to queued zerocopy skbs. The trade-off is that reverting to copying in these cases increases cycle cost. I think that that is a trade-off worth making compared to the alternative drop in throughput. It probably would be good to be able to measure this without kernel instrumentation: export counters similar to net->tx_zcopy_err and net->tx_packets (though without reset to zero, as in vhost_net_tx_packet). > 1) sndbuf is not INT_MAX You mean the case where the device stalls, later zerocopy notifications are queued, but these are never cleaned in free_old_xmit_skbs, because it requires a start_xmit and by now the (only) socket is out of descriptors? A watchdog would help somewhat. With tx-napi, this case cannot occur, either, as free_old_xmit_skbs no longer depends on a call to start_xmit. > 2) tx napi is used for virtio-net I am not aware of any issue specific to the use of tx-napi? > 1) could be a corner case, and for 2) what your suggest here may not solve > the issue since it still do in order completion. Somewhat tangential, but it might also help to break the in-order completion processing in vhost_zerocopy_signal_used. Complete all descriptors between done_idx and upend_idx. done_idx should then only be forward to the oldest still not-completed descriptor. In the test I ran, where the oldest descriptors are held in a queue and all newer ones are tail-dropped, this would avoid blocking a full ring of completions, when only a small number (or 1) is actually delayed. Dynamic switching between copy and zerocopy using zcopy_used already returns completions out-of-order, so this is not a huge leap. _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization