Re: target crashes with vSphere 6 hosts

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Sun, 07 Feb 2016 21:04:47 -0800

On Sun, 2016-02-07 at 20:56 -0800, Nicholas A. Bellinger wrote:
> On Sun, 2016-02-07 at 22:58 -0500, Dan Lane wrote:
> > Nicholas,
> >      Okay, I have updated my server to the latest stable kernel
> > (4.4.1).  It looked promising for a little while but then it died...
> > Once again this happened with zero traffic from the vSphere 5.5 host
> > other than starting it up.  There aren't even any VMs on that host!
> > 
> > Please see the attached kernel log.  Let me know if there's anything
> > else you would like me to try, let me know.  I have a large number of
> > hosts that I can use to test various versions of ESXi if needed as
> > well (plus I have access to /all/ versions of ESX/ESXi).
> > 
> 
> The BUG_ON in your log is:
> 
>    kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3099!
> 
> which means your running still a stock v4.4.1 build, and not the
> target-pending/4.4-stable branch that I prepared for you two weeks ago.
> 
> To repeat, the series to address the bug your hitting is not in upstream
> yet, and you need to pull in target-pending/4.4-stable as per the
> instructions here:
> 
> http://www.spinics.net/lists/target-devel/msg11721.html
> 
> It looks like you forgot the last 'git pull $ORIGIN $BRANCH' part.
> 
> So from within your linux-stable.git tree, do a:
> 
>   git pull git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git refs/heads/4.4-stable
> 
> To verify, the HEAD should be commit 'c375cb3'
> 
> root@haakon3:/usr/src/target-pending.git# git log --oneline -15
> c375cb3 qla2xxx: Fix interaction issue between qla2xxx and Target Core Module
> b5f5656 Merge branch 'queue-fixes' into 4.4-stable
> 570d3f8 target: Fix TAS handling for multi-session se_node_acls
> c768173 target: Fix LUN_RESET active TMR descriptor handling
> fd0a32e target: Fix LUN_RESET active I/O handling for ACK_KREF
> 718a6b3 qla2xxx: Fix warning reported by static checker
> c96d32e target: convert DISCARD limits to/from linux 512b sectors
> fab683e scsi: qla2xxxx: avoid type mismatch in comparison
> 20c08b3 target/user: Make sure netlink would reach all network namespaces
> 21aaa23 target: Obtain se_node_acl->acl_kref during get_initiator_node_acl
> d36ad77 target: Convert ACL change queue_depth se_session reference usage
> 26a99c1 iscsi-target: Fix potential dead-lock during node acl delete
> f246c94 ib_srpt: Convert acl lookup to modern get_initiator_node_acl usage
> 7fbef3d tcm_fc: Convert acl lookup to modern get_initiator_node_acl usage
> afd2ff9 Linux 4.4
> 

Also, beyond this bug your log still contains ABORT_TASKS, and as you
mentioned, at idle with no VMs present.

This means that beyond this specific bug your hitting, there is still
something else going on with your setup.

To repeat, you should not be seeing ABORT_TASKs at idle or during normal
operation.  If you are, then you need to seriously consider looking into
the ESX host side tunables mentioned earlier, and/or figure out why your
target backend is taking so long to respond to I/O.

Wrt to the ESX host side tunables, the qlogic driver on ESX default
queue depth is 64:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1267

and the ql2xcmdtimeout is 20 seconds listed here:

http://support.qlogic.com/SupportCenter/servlet/fileField?id=0BE80000000CaoA

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html