Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK

Ian Campbell <ian.campbell@xxxxxxxxxx> · Tue, 12 Apr 2016 17:37:54 +0100

For some reason your mails in this thread only appear in the gmail web UI and not on the IMAP version of my mailbox (my own and Michael's mails are fine).
So I'm replying via the web interface, sorry for the inevitable formatting mess :-/

I've CCd another mailbox in the hopes of getting your mails in that IMAP folder instead/aswell so I can avoid this next time.

On 12 April 2016 at 14:59, Stefan Hajnoczi <stefanha@xxxxxxxxxx> wrote:
> > > One wrinkle I came across, which I'm not sure if it is by design or a

> > > problem is that I can see this sequence coming from the guest (with

> > > other activity in between):

> > >

> > >     1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX

> > >     2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX

> > >     3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX

How did you trigger this sequence?  I'd like to reproduce it.

Nothing magic. I've written some logging into my backend and captured the result for a simple backend initiated connection.

In the log "TX" and "RX" indicate the thread doing the processing (with "TX" being the one which processes the guest's TX ring, i.e. data coming from the guest to the host). "<=" indicates a buffer going from guest to host and "=>" is from host to guest. NB that guest to host replies are queued synchronously by the TX thread onto the RX ring which is why the somewhat odd looking "TX: =>" combination can occur. A host initiated connection also happens from the TX thread in the same way.

The trace is of a simple request response (which both fit in one buffer in each direction), the lines without an "?X:" prefix are my annotations/guesses as to what is going on:

TX: =>SRC:00000002.00010002 DST:00000003.00000948
TX:   LEN:00000000 TYPE:0001 OP:1=REQUEST
TX:  FLAGS:00000000 BUF_ALLOC:00008000 FWD_CNT:00000000

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:2=RESPONSE
TX:  FLAGS:00000000 BUF_ALLOC:00040000 FWD_CNT:00000000

REQUEST + RESPONSE == Channel open successfully

RX: =>SRC:00000002.00010002 DST:00000003.00000948
RX:   LEN:0000005e TYPE:0001 OP:5=RW
RX:  FLAGS:00000000 BUF_ALLOC:00008000 FWD_CNT:00000000

Host sends a request to the guest

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:6=CREDIT_UPDATE
TX:  FLAGS:00000000 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest replies with a credit update

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000091 TYPE:0001 OP:5=RW
TX:  FLAGS:00000000 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest replies with the answer to the request

RX: =>SRC:00000002.00010002 DST:00000003.00000948
RX:   LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
RX:  FLAGS:00000002 BUF_ALLOC:00008000 FWD_CNT:00000091

Host has sent its only request, so host app must have done shutdown(SHUT_WR) I suppose and host therefore sends SHUTDOWN_TX.

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
TX:  FLAGS:00000001 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest SHUTDOWN_RX. I'm not sure if this is a direct kernel response to the SHUTDOWN_TX or if the application inside the guest saw an EOF when reading the socket and did the corresponding shutdown(SHUT_RD).

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
TX:  FLAGS:00000002 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest SHUTDOWN_TX, I presume that having sent the only response it is going to it then does shutdown(SHUT_WR).

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
TX:  FLAGS:00000003 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest shuts down both directions.

Perhaps the guest end is turning shutdown(foo) directly into a vsock message without or-ing in the current state?

> > > I orignally had my backend close things down at #2, however this meant

> > > that when #3 arrived it was for a non-existent socket (or, worse, an

> > > active one if the ports got reused). I checked v5 of the spec

> > > proposal[0] which says:

> > >     If these bits are set and there are no more virtqueue buffers

> > >     pending the socket is disconnected.

> > >

> > > but I'm not entirely sure if this behaviour contradicts this or not

> > > (the bits have both been set at #2, but not at the same time).

> > >

> > > BTW, how does one tell if there are no more virtqueue buffers pending

> > > or not while processing the op?

> >

> > #2 is odd.  The shutdown bits are sticky so they cannot be cleared once

> > set.  I would have expected just #1 and #3.  The behavior you observe

> > look like a bug.

> >

> > The spec text does not convey the meaning of OP_SHUTDOWN well.

> > OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible

> > for this connection.  "there are no more virtqueue buffers pending the

> > socket" really means that this isn't an immediate close from the

> > perspective of the application.  If the application still has unread rx

> > buffers then the socket stays readable until the rx data has been fully

> > read.

>

> Yes but you also wrote:

>       If these bits are set and there are no more virtqueue buffers

>       pending the socket is disconnected.

>

> how does remote know that there are no buffers pending and so it's safe

> to reuse the same source/destination address now?  Maybe destination

> should send RST at that point?

You are right, the source/destination address could be reused while the

remote still has the connection in their table.  Connection

establishment would fail with a RST reply.

I can think of two solutions:

1. Implementations must remove connections from their table as soon as

   SHUTDOWN_TX|SHUTDOWN_RX is received.  This way the source/destination

   address tuple can be reused immediately, i.e. new connections with

   the same source/destination would be possible while an application is

   still draining the receive buffers of an old connection.

2. Extend the connection lifecycle so that an A->B

   SHUTDOWN_TX|SHUTDOWN_RX must be followed by a by a B->A RST to close

   a connection.  This way the source/destination address is only in use

   once at a time.

Option #2 seems safer because there is no overlap in source/destination

address usage.

I agree (I hadn't seen this when I replied to Michael similar suggestion).

Ian. (sorry again for the formatting)
_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization