On Fri, 2012-02-10 at 20:25 -0800, Nicholas A. Bellinger wrote: > On Thu, 2012-02-09 at 14:01 +0800, Jim Barber wrote: > > On 9/02/2012 11:29 AM, Nicholas A. Bellinger wrote: > > <SNIP> > > > > You should still definitely *not* be seeing NON_EXISTENT_LUN errors > > > after a session reset, regardless of the unsupported bits mentioned > > > above. > > > > > > What does your running qla2xxx configuration look like..? Also, what > > > lio-core.git HEAD is this with again..? > > > > The errors I posted were encountered with a kernel compiled from the latest qla_tgt-3.3 HEAD. > > The first entry of a 'git log' shows: > > > > commit 7a4dc04b51cf01d5cea3332db61c0ac609b09b13 > > Author: Roland Dreier <roland@xxxxxxxxxxxxxxx> > > Date: Wed Jan 11 16:58:05 2012 -0800 > > > > For your question regarding what does my configuration look like, I'll give you what I think you might need (but I'm just guessing). > > >From the targetcli command, an 'ls' at the root results in the following: > > > > Hi again Jim, > > Thanks for the additional information. So looking at your configuration > below, nothing appears out of the ordinary. > > After looking the problem output from the previous email a bit more, one > thing that is strange is the fact that all of the NON_EXISTENT_LUN are > for LUN > 0, when only LUN=0 has been configured on the individual > tcm_qla2xxx ports. Really quite strange for the ESX client to start > hammering new LUNs that aren't registered during TMR handling.. > > I've tried reproducing this case with unsupported TMRs with a normal > Linux FC qla2xxx client, but I've not had any luck to reproduce a > similar case thus far.. What I'm not sure about yet is if a single > rejected ABORT is actually making ESX client blow up (instead of falling > back to LUN_RESET for example), or if it's the locally generated events > (eg: Unknown task mgmt fn 0xfffd) on the qla_target.c side not being > handled that is causing the hard failure. > > If at all possible, it would very be helpful to try to reproduce using > ql2xextended_error_logging so that I can get a better idea of what's > going on. This is enabled at module load time with: > > modprobe qla2xxx ql2xextended_error_logging=0xffffff > > Since you've mention it only seems to happen occasionally, you'll likely > end up with gigs of logs with debug enabled. > > I'm still looking at how to reproduce this case, so any more logs would > be very helpful to that end. > Hi Jim, So with the patches from this morning to add support for TMR_ABORT_TASK, I'd like you to try re-testing with ESXv5 client. At this point I'm pretty sure the reason you are seeing ABORT_TASK is because your backend device is taking a long time (> 30 secs) to return outstanding I/O, causing SCSI timeouts on the FC initiator to fire.. In this scenario the explicit TMR_ABORT_TASK will wait for the outstanding I/O descriptor in question to complete, before returning the referenced tag with SAM_STAT_TASK_ABORTED, and completing TMR_ABORT_TASK with TMR_FUNCTION_COMPLETE. I've just pushed this series for testing into lio-core.git/master, so please go ahead and pull at your earliest convience and retest with the following master HEAD: commit c851cb5a4043e11ae7844dc8a01d4787a4cc9573 Author: Nicholas Bellinger <nab@xxxxxxxxxxxxxxx> Date: Mon Feb 13 01:26:48 2012 -0800 qla_target/tcm_qla2xxx: Fix TMR_ABORT_TASK with target mode usage Also, I've added some debug code, so you should not have to run with 'modprobe qla2xxx ql2xextended_error_logging=0xffffff' anymore. Thanks! --nab -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html