Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Mar 22, 2021 at 9:51 AM Stefan Hajnoczi <stefanha@xxxxxxxxxx> wrote:
>
> On Thu, Mar 18, 2021 at 10:59:20AM -0700, Jiang Wang . wrote:
> > On Wed, Mar 17, 2021 at 8:45 AM Stefan Hajnoczi <stefanha@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Mar 16, 2021 at 09:56:44PM +0000, jiang.wang wrote:
> > > > Add supports for datagram type for virtio-vsock. Datagram
> > > > sockets are connectionless and unreliable. To avoid contention
> > > > with stream and other sockets, add two more virtqueues and
> > > > a new feature bit to identify if those two new queues exist or not.
> > > >
> > > > Also add descriptions for resouce management of datagram, which
> > > > does not use the existing credit update mechanism associated with
> > > > stream sockets.
> > > >
> > > > Signed-off-by: Jiang Wang <jiang.wang@xxxxxxxxxxxxx>
> > > > ---
> > > >  virtio-vsock.tex | 72 ++++++++++++++++++++++++++++++++++++++++++++++++--------
> > > >  1 file changed, 62 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/virtio-vsock.tex b/virtio-vsock.tex
> > > > index da7e641..a2ff0a4 100644
> > > > --- a/virtio-vsock.tex
> > > > +++ b/virtio-vsock.tex
> > > > @@ -9,14 +9,27 @@ \subsection{Device ID}\label{sec:Device Types / Socket Device / Device ID}
> > > >
> > > >  \subsection{Virtqueues}\label{sec:Device Types / Socket Device / Virtqueues}
> > > >  \begin{description}
> > > > +\item[0] stream rx
> > > > +\item[1] stream tx
> > > > +\item[2] dgram rx
> > > > +\item[3] dgram tx
> > > > +\item[4] event
> > > > +\end{description}
> > > > +The virtio socket device uses 5 queues if feature bit VIRTIO_VSOCK_DRGAM is set. Otherwise, it
> > > > +only uses 3 queues, as the following.
> > > > +
> > > > +\begin{description}
> > > >  \item[0] rx
> > > >  \item[1] tx
> > > >  \item[2] event
> > > >  \end{description}
> > > >
> > > > +
> > > >  \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature bits}
> > > >
> > > > -There are currently no feature bits defined for this device.
> > > > +\begin{description}
> > > > +\item[VIRTIO_VSOCK_DGRAM (0)] Device has support for datagram sockets type.
> > > > +\end{description}
> > >
> > > By convention feature bits are named VIRTIO_<device>_F_<feature>. Please
> > > add the "_F_".
> > >
> > Sure.
> >
> > > >
> > > >  \subsection{Device configuration layout}\label{sec:Device Types / Socket Device / Device configuration layout}
> > > >
> > > > @@ -76,6 +89,7 @@ \subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Op
> > > >       le32 flags;
> > > >       le32 buf_alloc;
> > > >       le32 fwd_cnt;
> > > > +     le64 timestamp;
> > >
> > > Adding this field breaks old devices and drivers. Please make this field
> > > conditional on the dgram socket type or the VIRTIO_VSOCK_F_DGRAM feature
> > > bit.
> > >
> > > Also, are all the other fields still used with dgram sockets? Maybe you
> > > can use a union instead to reuse some space?
> >
> > I will make this new field depending on the dgram socket type.
> >
> > For the union idea, could I change the order of those fields? Dgram does not use
> > flags and fwd_cnt fields, and I want to put them together with a union
> > of a le64 timestamp.
> > Something like:
> >
> > struct virtio_vsock_hdr {
> >   le64 src_cid;
> >   le64 dst_cid;
> >   le32 src_port;
> >   le32 dst_port;
> >   le32 len;
> >   le16 type;
> >   le16 op;
> >   le32 buf_alloc;
> >  union {
> >         struct {
> >               le32 flags;
> >               le32 fwd_cnt;
> >         } stream;
> >         le64 dgram_timestamp;
> >      } internal; // or a better name
> > };
>
> Unfortunately reordering the fields would break existing devices and
> drivers since they access flags and fwd_cnt at a specific offset in
> struct virtio_vsock_hdr.
>
> Thinking more about the union idea, I think it's more trouble than it is
> worth. You could just write dgram_timestamp accessor functions that
> split/join the le64 value into the existing le32 flags and fwd_cnt
> fields.

Sure. Maybe we don't need this field at all. Will discuss it at the
end of email.

> >
> > > >  };
> > > >  \end{lstlisting}
> > > >
> > > > @@ -121,11 +135,14 @@ \subsubsection{Virtqueue Flow Control}\label{sec:Device Types / Socket Device /
> > > >  packets.  With additional resources, it becomes possible to process incoming
> > > >  packets even when outgoing packets cannot be sent.
> > > >
> > > > -Eventually even the additional resources will be exhausted and further
> > > > +For stream type, eventually even the additional resources will be exhausted and further
> > > >  processing is not possible until the other side processes the virtqueue that
> > > >  it has neglected.  This stop to processing prevents one side from causing
> > > >  unbounded resource consumption in the other side.
> > > >
> > > > +For datagram type, the additional resources have a fixed size limit to prevent
> > > > +unbounded resource consumption.
> > > > +
> > > >  \drivernormative{\paragraph}{Device Operation: Virtqueue Flow Control}{Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
> > > >
> > > >  The rx virtqueue MUST be processed even when the tx virtqueue is full so long as there are additional resources available to hold packets outside the tx virtqueue.
> > > > @@ -140,12 +157,15 @@ \subsubsection{Addressing}\label{sec:Device Types / Socket Device / Device Opera
> > > >  consists of a (cid, port number) tuple. The header fields used for this are
> > > >  \field{src_cid}, \field{src_port}, \field{dst_cid}, and \field{dst_port}.
> > > >
> > > > -Currently only stream sockets are supported. \field{type} is 1 for stream
> > > > -socket types.
> > > > +Currently stream and datagram (dgram) sockets are supported. \field{type} is 1 for stream
> > > > +socket types. \field{type} is 3 for dgram socket types.
> > > >
> > > >  Stream sockets provide in-order, guaranteed, connection-oriented delivery
> > > >  without message boundaries.
> > > >
> > > > +Datagram socekts provide connectionless unreliable messages of
> > >
> > > s/socekts/sockets/
> > >
> > > > +a fixed maximum length.
> > > > +
> > > >  \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device / Device Operation / Buffer Space Management}
> > > >  \field{buf_alloc} and \field{fwd_cnt} are used for buffer space management of
> > > >  stream sockets. The guest and the device publish how much buffer space is
> > > > @@ -170,20 +190,41 @@ \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device /
> > > >  previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows
> > > >  communicating updates any time a change in buffer space occurs.
> > > >
> > > > +For datagram sockets, \field{buf_alloc} is also used on the rx side. Unlike stream
> > > > +sockets, dgram sockets do not use VIRTIO_VSOCK_OP_CREDIT_UPDATE or
> > > > +VIRTIO_VSOCK_OP_CREDIT_REQUEST packets. The dgram sender does not check if
> > > > +the \field{buf_alloc} of the peer is full or not. If it is full, the packet
> > > > +will be dropped. To limit resouce usage of the sender, \field{dgram_tx_bytes}
> > >
> > > s/resouce/resource/
> > >
> > > \field{dgram_tx_bytes} is not included in any struct definition?
> >
> > dgram_tx_bytes is a field in struct vhost_vsock and virtio_vsock. But
> > I don't see them
> > mentioned in the spec.
>
> The VIRTIO specification does not depend on the internals of device
> implementations. Someone reading the spec must be able to implement a
> driver or a device without knowledge of the Linux virtio_vsock or
> vhost_vsock implementations.
>
> If dgram_tx_bytes is a new concept that device implementations must
> incorporate, then please describe it fully in the spec.
>

Sure. Will likely drop this field. Will discuss more at the end of the email.

> >
> > > > +is used for each VM and host. Only payload bytes are counted and header bytes are not
> > >
> > > Please use the VIRTIO specification terms "driver" and "device" instead
> > > of VM and host.
> > Sure.
> >
> > > It's not clear to me what "used" means. What exactly do the driver and
> > > device need to do?
> >
> > In the  Linux and KVM case, I added the dgram_tx_bytes to vhost_vsock
> > and virtio_vsock.
> > Then in virtio_transport_send_pkt() and vhost_transport_send_pkt(),
> > the code will increase and check
> > dgram_tx_bytes first. If dgram_tx_bytes is no less than
> > VIRTIO_VSOCK_MAX_DGRAM_BUF_SIZE,
> > the function will free the pkt with virtio_transport_free_pkt() and
> > then return. The dgram_tx_bytes
> > is decreased in vhost_transport_do_send_pkt, just before each
> > virtio_transport_free_pkt() is called.
> > It is similar for the device case.
> >
> > The goal is to limit the memory usage for the vsock. Since dgram does
> > not use credit, the sender
> > can send far more packets than the receiver can handle. If we don't
> > add the above check, the
> > sender can use lots of memory and cause OOM in the Linux kernel.
> >
> > I am not sure if these details are too much for a spec or not. But I
> > think it will be good to
> > add some description. Or we can just say that the sender ( device or
> > the driver ) must not use
> > unlimited resources.
>
> I'm not sure why this mechanism is needed since the virtqueue has a
> fixed size and the receiver has socket buffers (rcvbuf) of fixed sizes.
> Memory usage is bounded so I don't understand how OOM can occur.
>
> I would have expected the following:
> 1. When sending, drop VIRTIO_VSOCK_OP_RW packets instead of sending if
>    the virtqueue is full.
> 2. When receiving, drop VIRTIO_VSOCK_OP_RW packets if the socket rcvbuf
>    is full.
>
> That's all. No additional accounting mechanism is necessary for
> unreliable delivery.
>

You are right. I checked again and the OOM problem is likely a problem in my
implementation. I will drop this additional accounting.

Following are more
details about my implementation problem. I reused the send_pkt functions
of the stream for datagram too. When a virtqueue is full, the code does not
drop dgram packets, but add it back to the send_pkt_list ( because stream
type does that). I think I should fix that part to drop
the dgram packets instead of adding a new accounting mechanism.
Will give it a try.

> > > > +included. If \field{dgram_tx_bytes} is equal to VIRTIO_VSOCK_MAX_DGRAM_BUF_SIZE,
> > > > +the send will fail.
> > >
> > > Does the device reject dgram packets or does the driver refuse to send
> > > more dgram packets?
> >
> > The sender will drop the packets. This one is independent of the rx
> > side. If the device is the
> > sender, device will drop the packets. If the driver is the sender, the
> > driver will drop the packets.
> >
> > > > +
> > > > +For dgram, since \field{dgram_tx_byte} is shared within a VM or host, a tx_timeout and
> > > > +a timer are used for each outgoing packet. If a packet is not delivered within
> > > > +tx_timeout, it will be dropped to make free space for other dgram sockets.
> > >
> > > What does this mean? Does the timeout cause the device to drop packets
> > > that haven't been handled yet (e.g. read by a host userspace process)?
> >
> > This is on the sender side (regardless of the driver or the device ).
> > This  is similar to
> > above mentioned dgram_tx_bytes. Dgram_tx_bytes only makes sure if the memory
> > usage reaches a limit, the sender will not consume more memory. But
> > dgram_tx_bytes
> > is shared among different dgram sockets for a device or the driver, a
> > misbehave or
> > malicious dgram socket can potentially use all memory and block other
> > dgram sockets.
> > from sending any new packets. To handle this case,
> > the tx_timeout is used to free some space from the sender's memory pool, so that
> > other dgram sockets can continue to send some packets.
>
> Why is the timestamp included in the header? It seems to only be used by
> the side that is sending so it's not obvious to me that it needs to be
> sent to the other side.
>
> How does timestamp ensure fairness? It seems like the malicious sending
> could simple send more packets in order to exhaust all available memory
> again when its old packets expire.
>
Since I will drop the additional accounting, I will probably drop this timestamp
too. Will describe more in the end of the email.

> >
> > > POSIX Sockets have a receive socket buffer (rcvbuf) that is used for
> > > memory accounting and dropping packets. Operating systems implementing
> > > POSIX Sockets would use that to decide when incoming packets are
> > > dropped. It seems like dgram_tx_byte does something similar at the
> > > device level and I'm not sure why it's necessary?
> >
> > dgram_tx_byte is for the tx (sender) side. The receive buffer you mentioned
> > is on the receiving side.
>
> POSIX Sockets have a send socket buffer (sndbuf) for that. The sender
> should use a similar approach to how UDP sockets work.
>
> > Thanks for all the comments. I will fix those spelling errors and use
> > a spell check
> > next time. I hope I answered all your questions. Please let me know if I missed
> > any questions or anything still not clear.
>
> To summarize:
>
> 1. Why is dgram_tx_bytes needed since the virtqueue already has a finite
>    size?

The problem is in my implementation as mentioned above. Will drop this.

>
> 2. Does the timestamp field provide fairness between senders? I don't
>    understand its purpose.
>

The timestamp is associated with my additional accounting. I will drop
this field.

> 3. vsock dgram should probably follow the UDP approach to memory
>    accounting instead of inventing new mechanisms. Applications and
>    developers will probably be happiest with UDP-like behavior.
>

Sure. I checked with UDP briefly. I see SO_SNDBUF, which is per socket.
Also, I found net.ipv4.udp_mem, which is a global value.

After dropping my additional accounting. I think there is still a question
about if we want to protect the shared dgram virtqueue
against bad dgram sockets or not. And if so, how to do it, or what to write
in the spec. For example, if a bad dgram socket keeps sending lots of
data to a port that no socket is receiving,
then those packets will only be dropped by the receiver (device or the
driver), or
when the virtqueue is full. Other good dgram sockets will compete
with this bad one on the tx side. In my current implementation, it
depends on how the Linux scheduler schedules those processes.
The bad one is unlikely to make the virtqueue full all the time and
completely block
other good dgram sockets because the other end is still receiving and
cleaning the virtqueue. But it will waste a lot of resources. I think
that is fine and we don't need to add strict requirements about it
in the spec.

I don't know if UDP has a similar situation as shared virtqueue or not. The
net.ipv4.udp_mem looks like just a global accounting. If you have any
suggestions about this, please let me know.

Thank you!

> Stefan
_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization



[Index of Archives]     [KVM Development]     [Libvirt Development]     [Libvirt Users]     [CentOS Virtualization]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux