This also reminds me ... perhaps we should increase the echo timeout when using multichannel, so we don't unnecessarily increase traffic to the server. Thoughts? Currently SMB_ECHO_INTERVAL_DEFAULT is 60 seconds On Thu, Apr 29, 2021 at 10:48 AM Shyam Prasad N <nspmangalore@xxxxxxxxx> wrote: > > Attaching updated patch with can_echo() instead of referencing echoes directly. > I'll submit another patch for the other changes. > > Regards, > Shyam > > On Thu, Apr 29, 2021 at 5:59 PM Shyam Prasad N <nspmangalore@xxxxxxxxx> wrote: > > > > Hi Aurélien, > > > > Replies below... > > > > On Thu, Apr 29, 2021 at 5:16 PM Aurélien Aptel <aaptel@xxxxxxxx> wrote: > > > > > > > > > Hi Shyam, > > > > > > Ok so I ended up looking at a lot of code to check this... And I'm still > > > unsure. > > > > > > First, it looks like server->echoes is only used for smb2 and there is > > > a generic server->ops->can_echo() you can use that just returns > > > server->echoes it for smb2. > > Agree. I can use can_echo() instead. > > > > > > > > Shyam Prasad N <nspmangalore@xxxxxxxxx> writes: > > > > Hi, > > > > > > > > Recently, some xfstests and later some manual testing on multi-channel > > > > revealed that we detect unresponsive servers on some of the channels > > > > to the same server. > > > > > > > > The issue is seen when a channel is setup and sits idle without any > > > > traffic. Generally, we enable echoes and oplocks on a channel during > > > > the first request, based on the number of credits available on the > > > > channel. So on idle channels, we trip in our logic to check server > > > > unresponsiveness. > > > > > > That makes sense but while looking at the code I see we always queue > > > echo request in cifs_get_tcp_session(), which is called when adding a > > > channel. > > > > > > cifs_ses_add_channel() { > > > ctx.echo_interval = ses->server->echo_interval / HZ; > > > > > > chan->server = cifs_get_tcp_session(&ctx); > > > > > > rc = cifs_negotiate_protocol(xid, ses) { > > > server->tcpStatus = CifsGood; > > > } > > > > > > rc = cifs_setup_session(xid, ses, cifs_sb->local_nls); > > > } > > > > > > cifs_get_tcp_session() { > > > > > > INIT_DELAYED_WORK(&tcp_ses->echo, cifs_echo_request); > > > > > > tcp_ses->tcpStatus = CifsNeedNegotiate; > > > > > > tcp_ses->lstrp = jiffies; > > > > > > if (ctx->echo_interval >= SMB_ECHO_INTERVAL_MIN && > > > ctx->echo_interval <= SMB_ECHO_INTERVAL_MAX) > > > tcp_ses->echo_interval = ctx->echo_interval * HZ; > > > else > > > tcp_ses->echo_interval = SMB_ECHO_INTERVAL_DEFAULT * HZ; > > > > > > queue_delayed_work(cifsiod_wq, &tcp_ses->echo, tcp_ses->echo_interval); > > > } > > > > > > cifs_echo_request() { > > > > > > if (server->tcpStatus == CifsNeedNegotiate) > > > echo_interval = 0; <=== branch taken > > > else > > > echo_interval = server->echo_interval; > > > > > > if (server->tcpStatus not in {NeedReconnect, Exiting, New} > > > && server->can_echo() > > > && jiffies > server->lstrp + echo_interval - HZ) > > > { > > > server->echo(); > > > } > > > > > > queue_delayed_work(cifsiod_wq, &server->echo, server->echo_interval); > > > ===> echo_interval = 0 so calls itself immediatly > > > } > > > > > > SMB2_echo() { > > > if (server->tcpStatus == CifsNeedNegotiate) { > > > /* No need to send echo on newly established connections */ > > > mod_delayed_work(cifsiod_wq, &server->reconnect, 0); > > > ====> calls smb2_reconnect_server() immediatly if NeedNego > > > return rc; > > > } > > > > > > } > > > > > > smb2_reconnect_server() { > > > // channel has no TCONs so it does essentially nothing > > > } > > > > > > server_unresponsive() { > > > if (server->tcpStatus in {Good, NeedNegotiate} > > > && jiffies > server->lstrp + 3 * server->echo_interval) > > > { > > > cifs_server_dbg(VFS, "has not responded in %lu seconds. Reconnecting...\n", > > > (3 * server->echo_interval) / HZ); > > > cifs_reconnect(server); > > > return true; > > > } > > > return false; > > > } > > > > > > So it looks to me that cifs_get_tcp_session() starts the > > > cifs_echo_request() work, which calls itself with no delay in an > > > infinite loop doing nothing (that's probably bad...) until session_setup > > > succeeds, after which the delay between the self-call is set. > > > > Perhaps we think that 1 sec is too much for us to complete negotiate? > > But echo_interval of 0 looks bad. > > > > Ideally, we should queue echo work only when session setup is complete. > > What do you think? > > > > > > > > During session_setup(): > > > > > > * last response time (lstrp) gets updated > > > > > > * sending/recv requests should interact with the server > > > credits and call change_conf(), which should enable server->echoes > > > ==> is that part not working? > > It's working. But since these channels are idle, change_conf is not > > called for these channels. > > So echoes=false till there's some activity on the channel. > > > > > > > > Once enabled, the echo_request workq will finally send the echo on the > > > wire, which should -/+ 1 credit and update lstrp. > > > > > > > Attached a one-line fix for this. Have tested it in my environment. > > > > Another approach to fix this could be to enable echoes during > > > > initialization of a server struct. Or soon after the session setup. > > > > But I felt that this approach is better. Let me know if you feel > > > > otherwise. > > > > > > I think the idea of your change is ok but there's probably also an issue > > > in crediting in session_setup()/change_conf() if echoes is not enabled > > > at this point no? > > > > > > Cheers, > > > -- > > > Aurélien Aptel / SUSE Labs Samba Team > > > GPG: 1839 CB5F 9F5B FB9B AA97 8C99 03C8 A49B 521B D5D3 > > > SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, DE > > > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 247165 (AG München) > > > > > > > > > -- > > Regards, > > Shyam > > > > -- > Regards, > Shyam -- Thanks, Steve