Re: Branch to use for most current Qlogic target code.

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Mon, 13 Feb 2012 14:28:24 -0800

On Fri, 2012-02-10 at 20:25 -0800, Nicholas A. Bellinger wrote:
> On Thu, 2012-02-09 at 14:01 +0800, Jim Barber wrote:
> > On 9/02/2012 11:29 AM, Nicholas A. Bellinger wrote:
> 
> <SNIP>
> 
> > > You should still definitely *not* be seeing NON_EXISTENT_LUN errors
> > > after a session reset, regardless of the unsupported bits mentioned
> > > above.
> > > 
> > > What does your running qla2xxx configuration look like..?  Also, what
> > > lio-core.git HEAD is this with again..?
> > 
> > The errors I posted were encountered with a kernel compiled from the latest qla_tgt-3.3 HEAD.
> > The first entry of a 'git log' shows:
> > 
> > 	commit 7a4dc04b51cf01d5cea3332db61c0ac609b09b13
> > 	Author: Roland Dreier <roland@xxxxxxxxxxxxxxx>
> > 	Date:   Wed Jan 11 16:58:05 2012 -0800
> > 
> > For your question regarding what does my configuration look like, I'll give you what I think you might need (but I'm just guessing).
> > >From the targetcli command, an 'ls' at the root results in the following:
> > 
> 
> Hi again Jim,
> 
> Thanks for the additional information.  So looking at your configuration
> below, nothing appears out of the ordinary.
> 
> After looking the problem output from the previous email a bit more, one
> thing that is strange is the fact that all of the NON_EXISTENT_LUN are
> for LUN > 0, when only LUN=0 has been configured on the individual
> tcm_qla2xxx ports.    Really quite strange for the ESX client to start
> hammering new LUNs that aren't registered during TMR handling..
> 
> I've tried reproducing this case with unsupported TMRs with a normal
> Linux FC qla2xxx client, but I've not had any luck to reproduce a
> similar case thus far..  What I'm not sure about yet is if a single
> rejected ABORT is actually making ESX client blow up (instead of falling
> back to LUN_RESET for example), or if it's the locally generated events
> (eg: Unknown task mgmt fn 0xfffd) on the qla_target.c side not being
> handled that is causing the hard failure.
> 
> If at all possible, it would very be helpful to try to reproduce using
> ql2xextended_error_logging so that I can get a better idea of what's
> going on.  This is enabled at module load time with:
> 
>     modprobe qla2xxx ql2xextended_error_logging=0xffffff
> 
> Since you've mention it only seems to happen occasionally, you'll likely
> end up with gigs of logs with debug enabled.
> 
> I'm still looking at how to reproduce this case, so any more logs would
> be very helpful to that end.
> 

Hi Jim,

So with the patches from this morning to add support for TMR_ABORT_TASK,
I'd like you to try re-testing with ESXv5 client.

At this point I'm pretty sure the reason you are seeing ABORT_TASK is
because your backend device is taking a long time (> 30 secs) to return
outstanding I/O, causing SCSI timeouts on the FC initiator to fire..

In this scenario the explicit TMR_ABORT_TASK will wait for the
outstanding I/O descriptor in question to complete, before returning the
referenced tag with SAM_STAT_TASK_ABORTED, and completing TMR_ABORT_TASK
with TMR_FUNCTION_COMPLETE.

I've just pushed this series for testing into lio-core.git/master, so
please go ahead and pull at your earliest convience and retest with the
following master HEAD:

commit c851cb5a4043e11ae7844dc8a01d4787a4cc9573
Author: Nicholas Bellinger <nab@xxxxxxxxxxxxxxx>
Date:   Mon Feb 13 01:26:48 2012 -0800

    qla_target/tcm_qla2xxx: Fix TMR_ABORT_TASK with target mode usage

Also, I've added some debug code, so you should not have to run with
'modprobe qla2xxx ql2xextended_error_logging=0xffffff' anymore.

Thanks!

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html