> -----Original Message----- > From: Dexuan Cui <decui@xxxxxxxxxxxxx> > Sent: Friday, May 10, 2019 8:57 PM > To: Sunil Muthuswamy <sunilmut@xxxxxxxxxxxxx>; KY Srinivasan <kys@xxxxxxxxxxxxx>; Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>; > Stephen Hemminger <sthemmin@xxxxxxxxxxxxx>; Sasha Levin <sashal@xxxxxxxxxx>; David S. Miller <davem@xxxxxxxxxxxxx>; > Michael Kelley <mikelley@xxxxxxxxxxxxx> > Cc: netdev@xxxxxxxxxxxxxxx; linux-hyperv@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx > Subject: RE: [PATCH] hv_sock: Fix data loss upon socket close > > > From: Sunil Muthuswamy <sunilmut@xxxxxxxxxxxxx> > > Sent: Wednesday, May 8, 2019 4:11 PM > > > > Currently, when a hvsock socket is closed, the socket is shutdown and > > immediately a RST is sent. There is no wait for the FIN packet to arrive > > from the other end. This can lead to data loss since the connection is > > terminated abruptly. This can manifest easily in cases of a fast guest > > hvsock writer and a much slower host hvsock reader. Essentially hvsock is > > not following the proper STREAM(TCP) closing handshake mechanism. > > Hi Sunil, > It looks to me the above description is inaccurate. > > In the upstream Linux kernel, closing a hv_sock file descriptor may hang > in vmbus_hvsock_device_unregister() -> msleep(), until the host side of > the connection is closed. This is bad and should be fixed, but I don't think > the current code can cause data loss: when Linux calls hvs_destruct() -> > vmbus_hvsock_device_unregister() -> vmbus_device_unregister() -> ... > -> vmbus_close() to close the channel, Linux knows the host app has > already called close(), and normally that means the host app has > received all the data from the connection. > > BTW, technically speaking, in hv_sock there is no RST packet, while there > is indeed a payload_len==0 packet, which is similar to TCP FIN. > > I think by saying "a RST is sent" you mean Linux VM closes the channel. > > > The fix involves adding support for the delayed close of hvsock, which is > > in-line with other socket providers such as virtio. > > With this "delayed close" patch, Linux's close() won't hang until the host > also closes the connection. This is good! > The next version of the patch will only focus on implementing the delayed (or background) close logic. I will update the title and the description of the next version patch to more accurately reflect the change. > > While closing, the > > socket waits for a constant timeout, for the FIN packet to arrive from the > > other end. On timeout, it will terminate the connection (i.e a RST). > > As I mentioned above, I suppose the "RST" means Linux closes the channel. > > When Linux closes a connection, the FIN packet is written into the shared > guest-to-host channel ringbuffer immediately, so the host is able to see it > immediately, but the real question is: what if the host kernel and/or host app > can not (timely) receive the data from the ringbuffer, inclding the FIN? > > Does the host kernel guarantee it *always* timely fetches/caches all the > data from a connection, even if the host app has not accept()'d the > conection, or the host app is reading from the connection too slowly? > > If the host doesn't guarantee that, then even with this patch there is still > a chance Linux can time out, and close the channel before the host > finishes receiving all the data. > TCP protocol does not guarantee that all the data gets delivered, especially during close. The applications are required to mitigate this by using a combination of shutdown() and subsequent read() on both client and server side. > I'm curious how Windows guest implements the "async close"? > Does Windows guest also use the same timeout strategy here? If yes, > what's the timeout value used? > Windows also implements the delayed close logic in a similar fashion. You can lookup the MSDN article on 'graceful shutdown'. > > diff --git a/net/vmw_vsock/hyperv_transport.c > > b/net/vmw_vsock/hyperv_transport.c > > index a827547..62b986d 100644 > > Sorry, I need more time to review the rest of patch. Will try to reply ASAP. > > > -static int hvs_update_recv_data(struct hvsock *hvs) > > +static int hvs_update_recv_data(struct vsock_sock *vsk) > > { > > struct hvs_recv_buf *recv_buf; > > u32 payload_len; > > + struct hvsock *hvs = vsk->trans; > > > > recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1); > > payload_len = recv_buf->hdr.data_size; > > @@ -543,8 +591,12 @@ static int hvs_update_recv_data(struct hvsock *hvs) > > if (payload_len > HVS_MTU_SIZE) > > return -EIO; > > > > - if (payload_len == 0) > > + /* Peer shutdown */ > > + if (payload_len == 0) { > > + struct sock *sk = sk_vsock(vsk); > > hvs->vsk->peer_shutdown |= SEND_SHUTDOWN; > > + sk->sk_state_change(sk); > > + } > > Can you please explain why we need to call this sk->sk_state_change()? > > When we call hvs_update_recv_data(), we hold the lock_sock(sk) lock, and we > know there is at least one byte to read. Since we hold the lock, the other > code paths, which normally are also requried to acquire the lock before > checking vsk->peer_shutdown, can not race with us. > This was to wakeup any waiters on socket state change. Since the updated patch now only focuses on adding the delayed close logic, I will remove this in the next version. Thanks for the review so far. > Thanks, > -- Dexuan