Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

Koichiro Den <den@xxxxxxxxxxxxx> · Tue, 22 Aug 2017 22:42:19 +0900

On Mon, 2017-08-21 at 23:10 -0400, Willem de Bruijn wrote:
> > > > Interesting, deadlock could be treated as a a radical case of the
> > > > discussion
> > > > here https://patchwork.kernel.org/patch/3787671/.
> > > > 
> > > > git grep tells more similar skb_orphan() cases. Do we need to change
> > > > them
> > > > all (or part)?
> > > 
> > > Most skb_orphan calls are not relevant to the issue of transmit delay.
> > 
> > 
> > Yes, but at least we should audit the ones in drivers/net.
> 
> Do you mean other virtual device driver transmit paths, like xen,
> specifically?
> 
> > > > Actually, we may meet similar issues at many other places (e.g netem).
> > > 
> > > Netem is an interesting case. Because it is intended to mimic network
> > > delay, at least in the case where it calls skb_orphan, it may make
> > > sense to release all references, including calling skb_zcopy_clear.
> > > 
> > > In general, zerocopy reverts to copy on all paths that may cause
> > > unbounded delay due to another process. Guarding against delay
> > > induced by the administrator is infeasible. It is always possible to
> > > just pause the nic. Netem is one instance of that, and not unbounded.
> > 
> > 
> > The problem is, admin may only delay the traffic in e.g one interface, but
> > it actually delay or stall all traffic inside a VM.
> 
> Understood. Ideally we can remove the HoL blocking cause of this,
> itself.
> 
> > > > Need
> > > > to consider a complete solution for this. Figuring out all places that
> > > > could
> > > > delay a packet is a method.
> > > 
> > > The issue described in the referenced patch seems like head of line
> > > blocking between two flows. If one flow delays zerocopy descriptor
> > > release from the vhost-net pool, it blocks all subsequent descriptors
> > > in that pool from being released, also delaying other flows that use
> > > the same descriptor pool. If the pool is empty, all transmission stopped.
> > > 
> > > Reverting to copy tx when the pool reaches a low watermark, as the
> > > patch does, fixes this.
> > 
> > 
> > An issue of the referenced patch is that sndbuf could be smaller than low
> > watermark.
We cannot determine the low watermark properly because of not only sndbuf size
issue but also the fact that the upper vhost-net cannot directly see how much
descriptor is currently available at the virtio-net tx queue. It depends on
multiqueue settings or other senders which are also using the same tx queue.
Note that in the latter case if they constantly transmitting, the deadlock could
not occur(*), however if it has just temporarily fulfill some portion of the
pool in the mean time, then the low watermark cannot be helpful.
(*: That is because it's reliable enough in the sense I mention below.)

Keep in this in mind, let me briefly describe the possible deadlock I mentioned:
(1). vhost-net on L1 guest has nothing to do sendmsg until the upper layer sets
new descriptors, which depends only on the vhost-net zcopy callback and adding
newly used descriptors.
(2). vhost-net callback depends on the skb freeing on the xmit path only.
(3). the xmit path depends (possibly only) on the vhost-net sendmsg.
As you see, it's enough to bring about the situation above that L1 virtio-net
reaches its limit earlier than the L0 host processing. The vhost-net pool could
be almost full or empty, whatever.

Also note that relaxing the restriction (2) (and thus remove the dependency (3)
-> (1)) seems quite natural but it needs to be reliable enough to never miss the
last trigger if it's based on interruption. We might need to do some
modification on virtio interruption mitigation in none tx napi case but it's too
costly and not fair to cover the upper layer peculiarity.

So I think drivers which holds characteristics such as (2) is worth checking
whether we should revise.
> > 
> > > Perhaps the descriptor pool should also be
> > > revised to allow out of order completions. Then there is no need to
> > > copy zerocopy packets whenever they may experience delay.
> > 
> > 
> > Yes, but as replied in the referenced thread, windows driver may treat out
> > of order completion as a bug.
> 
> Interesting. I missed that. Perhaps the zerocopy optimization
> could be gated on guest support for out of order completions.
> 
> > > On the point of counting copy vs zerocopy: the new msg_zerocopy
> > > variant of ubuf_info has a field to record whether a deep copy was
> > > made. This can be used with vhost-net zerocopy, too.
> > 
> > 
> > Just to make sure I understand. It's still not clear to me how to reuse this
> > for vhost-net, e.g zerocopy flag is in a union which is not used by
> > vhost_net.
> 
> True, but that is not set in stone. I went back and forth on that when
> preparing fix 0a4a060bb204 ("sock: fix zerocopy_success regression
> with msg_zerocopy"). The field can be moved outside the union and
> initialized in the other zerocopy paths.

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization