Re: [EXT] [PATCH v3 3/5] Revert "scsi: qla2xxx: Fix unbound sleep in fcport delete path."

Martin Wilck <mwilck@xxxxxxxx> · Thu, 02 Apr 2020 11:39:20 +0200

Hello Arun,

On Wed, 2020-04-01 at 10:12 -0700, Arun Easi wrote:
> On Fri, 27 Mar 2020, 9:47am, mwilck@xxxxxxxx wrote:
> 
> > External Email
> > 
> > -----------------------------------------------------------------
> > -----
> > From: Martin Wilck <mwilck@xxxxxxxx>
> > 
> > This reverts commit c3b6a1d397420a0fdd97af2f06abfb78adc370df.
> > Aborting the sleep was risky, because after return from
> > qlt_free_session_done() the driver starts freeing resources,
> > which is dangerous while we know that there's pending IO.
> > 
> > The previous patch "scsi: qla2xxx: check UNLOADING before posting
> > async
> > work" avoids sending this IO in the first place, and thus obsoletes
> > the dangerous timeout.
> > 
> > Signed-off-by: Martin Wilck <mwilck@xxxxxxxx>
> > ---
> >  drivers/scsi/qla2xxx/qla_target.c | 4 ----
> >  1 file changed, 4 deletions(-)
> > 
> > diff --git a/drivers/scsi/qla2xxx/qla_target.c
> > b/drivers/scsi/qla2xxx/qla_target.c
> > index 622e733..eec1338 100644
> > --- a/drivers/scsi/qla2xxx/qla_target.c
> > +++ b/drivers/scsi/qla2xxx/qla_target.c
> > @@ -1019,7 +1019,6 @@ void qlt_free_session_done(struct work_struct
> > *work)
> >  
> >  	if (logout_started) {
> >  		bool traced = false;
> > -		u16 cnt = 0;
> >  
> >  		while (!READ_ONCE(sess->logout_completed)) {
> >  			if (!traced) {
> > @@ -1029,9 +1028,6 @@ void qlt_free_session_done(struct work_struct
> > *work)
> >  				traced = true;
> >  			}
> >  			msleep(100);
> > -			cnt++;
> > -			if (cnt > 200)
> > -				break;
> 
> By taking this code out, it would leave a stuck FC target delete
> thread 
> and thus preventing the module unload itself, in case of a bug in
> this 
> logic (which was seen in some instances).
> 
> How about increasing the wait time to say 25 seconds (typical worst
> case 
> is 2 * RA_TOV = 20 seconds) and then alerting user with a "WARN",
> but 
> still break out?
> 

I've seen the kernel crash because of this breakout. A crash happens if
another code path was blocked because of either the stopped FW or
anything else, and would time out after this code. 

The actual crash I've seen is already been fixed by patch 1 of this
series, because qla2x00_terminate_rport_io() checks UNLOADING before
sending IO. So if it was just about that, we could keep the breakout.

I suspect that the "unbound sleep" occurred only because of the 
hanging access to stopped FW (note that c3b6a1d39742 was made just ~2
months after 45235022da99 "scsi: qla2xxx: Fix driver unload by shutting
down chip"). If that's true, the hang wouldn't occur any more with
patch 1-2 of my series applied. If it's wrong and the hang does occur
again, IMO we should find out what is hanging, and fix it, rather than
blindly quit this wait loop and free resources.

But if you want to keep it in, I'm not going to insist; I'm more
interested in getting the first two patches merged.

Regards,
Martin

(Btw, I suppose you'll admit the combination of msleep() and counting
isn't the state of the art of implementing a timeout in the kernel).

> Regards,
> -Arun
> 
> >  		}
> >  
> >  		ql_dbg(ql_dbg_disc, vha, 0xf087,
> >