On Wed, May 29, 2019 at 03:04:46AM +0800, Yao Liu wrote: > On Tue, May 28, 2019 at 12:57:59PM -0400, Josef Bacik wrote: > > On Tue, May 28, 2019 at 02:07:43AM +0800, Yao Liu wrote: > > > On Fri, May 24, 2019 at 09:07:42AM -0400, Josef Bacik wrote: > > > > On Fri, May 24, 2019 at 05:43:54PM +0800, Yao Liu wrote: > > > > > Some I/O requests that have been sent succussfully but have not yet been > > > > > replied won't be resubmitted after reconnecting because of server restart, > > > > > so we add a list to track them. > > > > > > > > > > Signed-off-by: Yao Liu <yotta.liu@xxxxxxxxx> > > > > > > > > Nack, this is what the timeout stuff is supposed to handle. The commands will > > > > timeout and we'll resubmit them if we have alive sockets. Thanks, > > > > > > > > Josef > > > > > > > > > > On the one hand, if num_connections == 1 and the only sock has dead, > > > then we do nbd_genl_reconfigure to reconnect within dead_conn_timeout, > > > nbd_xmit_timeout will not resubmit commands that have been sent > > > succussfully but have not yet been replied. The log is as follows: > > > > > > [270551.108746] block nbd0: Receive control failed (result -104) > > > [270551.108747] block nbd0: Send control failed (result -32) > > > [270551.108750] block nbd0: Request send failed, requeueing > > > [270551.116207] block nbd0: Attempted send on invalid socket > > > [270556.119584] block nbd0: reconnected socket > > > [270581.161751] block nbd0: Connection timed out > > > [270581.165038] block nbd0: shutting down sockets > > > [270581.165041] print_req_error: I/O error, dev nbd0, sector 5123224 flags 8801 > > > [270581.165149] print_req_error: I/O error, dev nbd0, sector 5123232 flags 8801 > > > [270581.165580] block nbd0: Connection timed out > > > [270581.165587] print_req_error: I/O error, dev nbd0, sector 844680 flags 8801 > > > [270581.166184] print_req_error: I/O error, dev nbd0, sector 5123240 flags 8801 > > > [270581.166554] block nbd0: Connection timed out > > > [270581.166576] print_req_error: I/O error, dev nbd0, sector 844688 flags 8801 > > > [270581.167124] print_req_error: I/O error, dev nbd0, sector 5123248 flags 8801 > > > [270581.167590] block nbd0: Connection timed out > > > [270581.167597] print_req_error: I/O error, dev nbd0, sector 844696 flags 8801 > > > [270581.168021] print_req_error: I/O error, dev nbd0, sector 5123256 flags 8801 > > > [270581.168487] block nbd0: Connection timed out > > > [270581.168493] print_req_error: I/O error, dev nbd0, sector 844704 flags 8801 > > > [270581.170183] print_req_error: I/O error, dev nbd0, sector 5123264 flags 8801 > > > [270581.170540] block nbd0: Connection timed out > > > [270581.173333] block nbd0: Connection timed out > > > [270581.173728] block nbd0: Connection timed out > > > [270581.174135] block nbd0: Connection timed out > > > > > > On the other hand, if we wait nbd_xmit_timeout to handle resubmission, > > > the I/O requests will have a big delay. For example, if timeout time is 30s, > > > and from sock dead to nbd_genl_reconfigure returned OK we only spend > > > 2s, the I/O requests will still be handled by nbd_xmit_timeout after 30s. > > > > We have to wait for the full timeout anyway to know that the socket went down, > > so it'll be re-submitted right away and then we'll wait on the new connection. > > > > Now we could definitely have requests that were submitted well after the first > > thing that failed, so their timeout would be longer than simply retrying them, > > but we have no idea of knowing which ones timed out and which ones didn't. This > > way lies pain, because we have to matchup tags with handles. This is why we > > rely on the generic timeout infrastructure, so everything is handled correctly > > without ending up with duplicate submissions/replies. Thanks, > > > > Josef > > > > But as I mentioned before, if num_connections == 1, nbd_xmit_timeout won't re-submit > commands and I/O error will occur. Should we change the condition > if (config->num_connections > 1) > to > if (config->num_connections >= 1) > ? Only if you don't have the patch 3 in place though right? If you fix patch 3 to allow requeuing if you have a dead connection timer set then you can requeue and everything is a-ok. Thanks, Josef