Hi Jason, On 2018/06/29 18:30, Jason Wang wrote: > On 2018年06月29日 16:09, Toshiaki Makita wrote: ... >> To fix this, poll the work instead of enabling notification when >> busypoll is interrupted by something. IMHO signal_pending() and >> vhost_has_work() are kind of interruptions rather than signals to >> completely cancel the busypoll, so let's run busypoll after the >> necessary work is done. In order to avoid too long busyloop due to >> interruption, save the endtime in vq field and use it when reentering >> the work function. > > I think we don't care long busyloop unless e.g tx can starve rx? I just want to keep it user-controllable. Unless memorizing it busypoll can run unexpectedly long. >> I observed this problem makes tx performance poor but does not rx. I >> guess it is because rx notification from socket is not that heavy so >> did not impact the performance even when we failed to mask the >> notification. > > I think the reason is: > > For tx, we may only process used ring under handle_tx(). Busy polling > code can not recognize this case. > For rx, we call peek_head_len() after exiting busy loop, so if the work > is for rx, it can be processed in handle_rx() immediately. Sorry but I don't see the difference. Tx busypoll calls vhost_get_vq_desc() immediately after busypoll exits in vhost_net_tx_get_vq_desc(). The scenario I described above (in 2nd paragraph) is a bit simplified. To be clearer what I think is happening is: 1. handle_tx() runs busypoll with notification enabled (due to zerocopy callback or something). 2. Guest produces a packet in vring. 3. handle_tx() busypoll detects the produced packet and get it. 4. While vhost is processing the packet, guest kicks vring because it produced the packet. Vring notification is disabled automatically by event_idx and tx work is queued. 5. After processing the packet vhost enters busyloop again, but detects tx work and exits busyloop immediately. Then handle_tx() exits after enabling the notification. 6. Because the queued work was tx, handle_tx() is called immediately again. handle_tx() runs busypoll with notification enabled. 7. Repeat 2-6. > >> Anyway for consistency I fixed rx routine as well as tx. >> >> Performance numbers: >> >> - Bulk transfer from guest to external physical server. >> [Guest]->vhost_net->tap--(XDP_REDIRECT)-->i40e --(wire)--> [Server] > > Just to confirm in this case since zerocopy is enabled, we are in fact > use the generic XDP datapath? For some reason zerocopy was not applied for most packets, so in most cases driver XDP was used. I was going to dig into it but not yet. > >> - Set 10us busypoll. >> - Guest disables checksum and TSO because of host XDP. >> - Measured single flow Mbps by netperf, and kicks by perf kvm stat >> (EPT_MISCONFIG event). >> >> Before After >> Mbps kicks/s Mbps kicks/s >> UDP_STREAM 1472byte 247758 27 >> Send 3645.37 6958.10 >> Recv 3588.56 6958.10 >> 1byte 9865 37 >> Send 4.34 5.43 >> Recv 4.17 5.26 >> TCP_STREAM 8801.03 45794 9592.77 2884 > > Impressive numbers. > >> >> Signed-off-by: Toshiaki Makita <makita.toshiaki@xxxxxxxxxxxxx> >> --- ... >> +static bool vhost_busy_poll_interrupted(struct vhost_dev *dev) >> +{ >> + return unlikely(signal_pending(current)) || vhost_has_work(dev); >> } >> static void vhost_net_disable_vq(struct vhost_net *n, >> @@ -437,10 +438,21 @@ static int vhost_net_tx_get_vq_desc(struct >> vhost_net *net, >> if (r == vq->num && vq->busyloop_timeout) { >> preempt_disable(); >> - endtime = busy_clock() + vq->busyloop_timeout; >> - while (vhost_can_busy_poll(vq->dev, endtime) && >> - vhost_vq_avail_empty(vq->dev, vq)) >> + if (vq->busyloop_endtime) { >> + endtime = vq->busyloop_endtime; >> + vq->busyloop_endtime = 0; > > Looks like endtime may be before current time. So we skip busy loop in > this case. Sure, I'll add a condition. > >> + } else { >> + endtime = busy_clock() + vq->busyloop_timeout; >> + } >> + while (vhost_can_busy_poll(endtime)) { >> + if (vhost_busy_poll_interrupted(vq->dev)) { >> + vq->busyloop_endtime = endtime; >> + break; >> + } >> + if (!vhost_vq_avail_empty(vq->dev, vq)) >> + break; >> cpu_relax(); >> + } >> preempt_enable(); >> r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov), >> out_num, in_num, NULL, NULL); ... >> @@ -642,39 +658,51 @@ static void vhost_rx_signal_used(struct >> vhost_net_virtqueue *nvq) >> static int vhost_net_rx_peek_head_len(struct vhost_net *net, >> struct sock *sk) >> { >> - struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX]; >> - struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX]; >> - struct vhost_virtqueue *vq = &nvq->vq; >> + struct vhost_net_virtqueue *rnvq = &net->vqs[VHOST_NET_VQ_RX]; >> + struct vhost_net_virtqueue *tnvq = &net->vqs[VHOST_NET_VQ_TX]; >> + struct vhost_virtqueue *rvq = &rnvq->vq; >> + struct vhost_virtqueue *tvq = &tnvq->vq; > > NIt: I admit the variable name is confusing. It would be better to do > the renaming in a separate patch to ease review. OK. > >> unsigned long uninitialized_var(endtime); >> - int len = peek_head_len(rvq, sk); >> + int len = peek_head_len(rnvq, sk); >> - if (!len && vq->busyloop_timeout) { >> + if (!len && tvq->busyloop_timeout) { >> /* Flush batched heads first */ >> - vhost_rx_signal_used(rvq); >> + vhost_rx_signal_used(rnvq); >> /* Both tx vq and rx socket were polled here */ >> - mutex_lock_nested(&vq->mutex, 1); >> - vhost_disable_notify(&net->dev, vq); >> + mutex_lock_nested(&tvq->mutex, 1); >> + vhost_disable_notify(&net->dev, tvq); >> preempt_disable(); >> - endtime = busy_clock() + vq->busyloop_timeout; >> + if (rvq->busyloop_endtime) { >> + endtime = rvq->busyloop_endtime; >> + rvq->busyloop_endtime = 0; >> + } else { >> + endtime = busy_clock() + tvq->busyloop_timeout; >> + } >> - while (vhost_can_busy_poll(&net->dev, endtime) && >> - !sk_has_rx_data(sk) && >> - vhost_vq_avail_empty(&net->dev, vq)) >> + while (vhost_can_busy_poll(endtime)) { >> + if (vhost_busy_poll_interrupted(&net->dev)) { >> + rvq->busyloop_endtime = endtime; >> + break; >> + } >> + if (sk_has_rx_data(sk) || >> + !vhost_vq_avail_empty(&net->dev, tvq)) >> + break; >> cpu_relax(); > > Actually, I plan to poll guest rx refilling as well here to avoid rx > kicks. Want to draft another patch to do it as well? It's as simple as > add a vhost_vq_avail_empty for rvq above. OK. will try it. > >> + } >> preempt_enable(); >> - if (!vhost_vq_avail_empty(&net->dev, vq)) >> - vhost_poll_queue(&vq->poll); >> - else if (unlikely(vhost_enable_notify(&net->dev, vq))) { >> - vhost_disable_notify(&net->dev, vq); >> - vhost_poll_queue(&vq->poll); >> + if (!vhost_vq_avail_empty(&net->dev, tvq)) { >> + vhost_poll_queue(&tvq->poll); >> + } else if (unlikely(vhost_enable_notify(&net->dev, tvq))) { >> + vhost_disable_notify(&net->dev, tvq); >> + vhost_poll_queue(&tvq->poll); >> } >> - mutex_unlock(&vq->mutex); >> + mutex_unlock(&tvq->mutex); >> - len = peek_head_len(rvq, sk); >> + len = peek_head_len(rnvq, sk); >> } >> return len; ... >> @@ -889,7 +918,12 @@ static void handle_rx(struct vhost_net *net) >> goto out; >> } >> } >> - vhost_net_enable_vq(net, vq); >> + if (unlikely(vq->busyloop_endtime)) { >> + /* Busy poll is interrupted. */ >> + vhost_poll_queue(&vq->poll); >> + } else { >> + vhost_net_enable_vq(net, vq); >> + } > > This is to reduce the rx wake ups. Better with another patch. > > So I suggest to split the patches into: > > 1 better naming of variable of vhost_net_rx_peek_head_len(). > 2 avoid tx kicks (most of what this patch did) > 3 avoid rx wakeups (exactly the above 6 lines did) > 4 avoid rx kicks OK, I'll reorganize the patch. Thank you for your feedback! > > Btw, tonghao is doing some refactoring of busy polling as well. Depends > on the order of being merged, one of you may need rebasing. Thanks for sharing. I'll take a look. -- Toshiaki Makita