On Tue, 2016-01-26 at 09:11 -0500, Dan Lane wrote: > Thanks for all the information Nicholas, there's only one part that I > think is really in question - the idea that the backend can't keep up > with the workload. You may remember I had similar problems in the > past that I contacted the mailing list for. After talking about the > problem quite a bit I decided I needed a better backend, so I built > out this new system that should have absolutely no problem keeping up. > I have 20x 10k enterprise raptor drive in RAID6 with an SSD acting as > read and write cache via LSI Cachecade. > > BTW, I get the "ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for > ref_tag" error with vsphere5, but LIO doesn't crash in that case. > Also, I seem to get the error regardless of the system load. > To clarify, an ABORT_TASK means the default SCSI timeout threshold has been reached by ESX. What I meant earlier by 'if the backend is able to keep-up' is that you should not be hitting constant ABORT_TASK during normal operation. Otherwise your ESX Hosts are keeping too many I/Os in flight for the default FC LLD SCSI timeout for what the backend storage is able to process. For example using iSCSI with ESX >= v5.x, the SCSI timeout is hard-coded to 5 seconds. Which means ABORT_TASKs will be generated when ESX iSCSI does not receive a I/O response (from target) after this period, as long as the iSCSI session I_T nexus (and associated ESX multipath state) is still active. With iSCSI, this queue depth is controlled with default_cmdsn_depth value target side, or with explicit NodeACLs to define this on a per initiator login basis. The ESX iSCSI initiator honors this limit for limiting the outstanding I/O per sesssion, separate from some well known ESX scheduling fairness issues with multi-LUN sessions. However with FC and qla2xxx target mode, the only (target side) limit is the hardcoded qla2xxx hw frame limit of ATIO_ENTRY_CNT_24XX=4096. Which means (as a work-around) you can consider the following changing the following for your ESX FC hosts driver: - lowering the default FC LLD host queue_depth, and/or - increasing FC LLD SCSI timeout Last time I checked these settings where configurable for ESX FC LLD with qla2xxx, but you'll still need to grok the actual parameter names if you want to consider changing them. > I really appreciate the time you put into the kernel update information! > Most certainly. I'll be updating the 4.4-stable branch with Quinn's additional qla2xxx patch shortly. > Thanks, > Dan > > On Tue, Jan 26, 2016 at 1:55 AM, Nicholas A. Bellinger > <nab@xxxxxxxxxxxxxxx> wrote: > > Hi Dan, > > > > (Adding Quinn + Giri CC') > > > > On Mon, 2016-01-25 at 16:08 -0500, Dan Lane wrote: > >> Update: If it matters, I tried loading a host with ESXi 5.5 u3b and > >> that also crashed the filer. > > > > Thanks for your bug report. > > > > Note this bug is not specific to ESXi 6.0, and the scenario occurs when > > backend driver exports are unable to keep up with host workload, > > resulting in internal ESX ABORT_TASK + LUN_RESET to trigger across > > multiple local+remote ports. > > > > The following WIP series is for addressing this bug: > > > > http://www.spinics.net/lists/target-devel/msg11691.html > > > > I've been verifying this on iscsi-target exports over the last weeks, > > and the specific bug your hitting is AFAICT not qla2xxx driver specific. > > > >> Still waiting on an answer about the > >> applying of upstream commits to an OS like fedora or any other ideas > >> about the cause of this. > >> > > > > So you'll want to build a v4.4 kernel until these patches are merged for > > v4.5-rc, and eventually back-ported into v4.2.y stable. > > > > Note the recent qla2xxx target fixes from v4.5-rc1 are something you'll > > want too. > > > > A v4.4 based '$ORIGIN $BRANCH' with everything is here: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git 4.4-stable > > > > If your not similar git cloning + building kernel source, have a look at: > > > > http://kernelnewbies.org/KernelBuild > > https://fedoraproject.org/wiki/BuildingUpstreamKernel > > > > So do a fresh 'git clone' of linux-stable.git and then: > > > > git checkout --track -b linux-4.4.y > > > > to switch to a new local branch, and then: > > > > git pull '$ORIGIN $BRANCH' > > > > to do a remote merge using the full target-pending.git 4.4-stable path > > from above. > > > > You can do a 'make defconfig' or use the local 4.2.y > > /boot/config-$VERSION, and build vmlinux + modules + initrd > > from there. > > > > Please let the list know your progress. > > > >> On Sun, Jan 24, 2016 at 8:11 PM, Roland Dreier <roland@xxxxxxxxxxxxxxx> wrote: > >> >> I have tried a large number of other hosts and they all act the same > >> >> way regardless of hardware. ESXi <6 is no problem, but 6 and newer > >> >> crash the filer very quickly. > >> > > >> > You're crashing because of > >> > > >> > Jan 24 10:02:09 dracofiler kernel: kernel BUG at > >> > drivers/scsi/qla2xxx/qla_target.c:3105! > >> > > >> > which is the BUG_ON in > >> > > >> > void qlt_free_cmd(struct qla_tgt_cmd *cmd) > >> > { > >> > struct qla_tgt_sess *sess = cmd->sess; > >> > > >> > ql_dbg(ql_dbg_tgt, cmd->vha, 0xe074, > >> > "%s: se_cmd[%p] ox_id %04x\n", > >> > __func__, &cmd->se_cmd, > >> > be16_to_cpu(cmd->atio.u.isp24.fcp_hdr.ox_id)); > >> > > >> > BUG_ON(cmd->cmd_in_wq); > >> > > >> > It seems we're freeing a command before we process it. > >> > > >> > what logging do you have from target or qla2xxx before you hit the > >> > crash? I'm wondering why the initiator is aborting commands (although > >> > we still shouldn't crash even if it does abort commands). > >> > > >> > You could try applying upstream commit 193b50b9d54a ("qla2xxx: Replace > >> > QLA_TGT_STATE_ABORTED with a bit.") which seems like it might be > >> > related, though I'm not sure whether it really will help. > >> > > >> > - R. > >> -- > >> To unsubscribe from this list: send the line "unsubscribe target-devel" in > >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe target-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html