Roland, The BUG_ON in this code area is sign of early free and/or double free that is triggered in the LUN Reset path. The cmd_kref is being decrement too fast. In this case it looks like the cmd is on the way to be free (i.e. 2nd free), while a new incarnation of the command is being schedule to do new work. The cmd_in_wq flag tells us that the work element was scheduled or in the work queue by the new incarnation. The early free of the command was triggered in TCM. Please take a look at the following patches that Nicholas has submitted and Qlogic’s attempted to submit in the past. The "qla2xxx: Fix interaction issue between qla2xxx and Target Core Module” patch was asked to drop because it did not fit Bart’s implementation. QLogic’s patch [PATCH v2 08/16] qla2xxx: Replace QLA_TGT_STATE_ABORTED with a bit. http://marc.info/?l=linux-scsi&m=145038473432645&w=2 [PATCH 10/20] qla2xxx: Fix interaction issue between qla2xxx and Target Core Module http://marc.info/?l=linux-scsi&m=144953818431730&w=2 Nicholas’s Patch series or just patch 1 in the series. [PATCH 0/2] target: Fix LUN_RESET active I/O + TMR handling http://marc.info/?l=linux-scsi&m=145259439425172&w=2 [PATCH 1/2] target: Fix LUN_RESET active I/O handling for ACK_KREF http://marc.info/?l=linux-scsi&m=145259452525227&w=2 ---- Nicholas, Please consider "[PATCH 10/20] qla2xxx: Fix interaction issue between qla2xxx and Target Core Module” patch for the older branch or Pre-Bart. Thanks. Regards, Quinn Tran On 1/26/16, 6:11 AM, "Dan Lane" <dracodan@xxxxxxxxx> wrote: >Thanks for all the information Nicholas, there's only one part that I >think is really in question - the idea that the backend can't keep up >with the workload. You may remember I had similar problems in the >past that I contacted the mailing list for. After talking about the >problem quite a bit I decided I needed a better backend, so I built >out this new system that should have absolutely no problem keeping up. >I have 20x 10k enterprise raptor drive in RAID6 with an SSD acting as >read and write cache via LSI Cachecade. > >BTW, I get the "ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for >ref_tag" error with vsphere5, but LIO doesn't crash in that case. >Also, I seem to get the error regardless of the system load. > >I really appreciate the time you put into the kernel update information! > >Thanks, >Dan > >On Tue, Jan 26, 2016 at 1:55 AM, Nicholas A. Bellinger ><nab@xxxxxxxxxxxxxxx> wrote: >> Hi Dan, >> >> (Adding Quinn + Giri CC') >> >> On Mon, 2016-01-25 at 16:08 -0500, Dan Lane wrote: >>> Update: If it matters, I tried loading a host with ESXi 5.5 u3b and >>> that also crashed the filer. >> >> Thanks for your bug report. >> >> Note this bug is not specific to ESXi 6.0, and the scenario occurs when >> backend driver exports are unable to keep up with host workload, >> resulting in internal ESX ABORT_TASK + LUN_RESET to trigger across >> multiple local+remote ports. >> >> The following WIP series is for addressing this bug: >> >> http://www.spinics.net/lists/target-devel/msg11691.html >> >> I've been verifying this on iscsi-target exports over the last weeks, >> and the specific bug your hitting is AFAICT not qla2xxx driver specific. >> >>> Still waiting on an answer about the >>> applying of upstream commits to an OS like fedora or any other ideas >>> about the cause of this. >>> >> >> So you'll want to build a v4.4 kernel until these patches are merged for >> v4.5-rc, and eventually back-ported into v4.2.y stable. >> >> Note the recent qla2xxx target fixes from v4.5-rc1 are something you'll >> want too. >> >> A v4.4 based '$ORIGIN $BRANCH' with everything is here: >> >> git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git 4.4-stable >> >> If your not similar git cloning + building kernel source, have a look at: >> >> http://kernelnewbies.org/KernelBuild >> https://fedoraproject.org/wiki/BuildingUpstreamKernel >> >> So do a fresh 'git clone' of linux-stable.git and then: >> >> git checkout --track -b linux-4.4.y >> >> to switch to a new local branch, and then: >> >> git pull '$ORIGIN $BRANCH' >> >> to do a remote merge using the full target-pending.git 4.4-stable path >> from above. >> >> You can do a 'make defconfig' or use the local 4.2.y >> /boot/config-$VERSION, and build vmlinux + modules + initrd >> from there. >> >> Please let the list know your progress. >> >>> On Sun, Jan 24, 2016 at 8:11 PM, Roland Dreier <roland@xxxxxxxxxxxxxxx> wrote: >>> >> I have tried a large number of other hosts and they all act the same >>> >> way regardless of hardware. ESXi <6 is no problem, but 6 and newer >>> >> crash the filer very quickly. >>> > >>> > You're crashing because of >>> > >>> > Jan 24 10:02:09 dracofiler kernel: kernel BUG at >>> > drivers/scsi/qla2xxx/qla_target.c:3105! >>> > >>> > which is the BUG_ON in >>> > >>> > void qlt_free_cmd(struct qla_tgt_cmd *cmd) >>> > { >>> > struct qla_tgt_sess *sess = cmd->sess; >>> > >>> > ql_dbg(ql_dbg_tgt, cmd->vha, 0xe074, >>> > "%s: se_cmd[%p] ox_id %04x\n", >>> > __func__, &cmd->se_cmd, >>> > be16_to_cpu(cmd->atio.u.isp24.fcp_hdr.ox_id)); >>> > >>> > BUG_ON(cmd->cmd_in_wq); >>> > >>> > It seems we're freeing a command before we process it. >>> > >>> > what logging do you have from target or qla2xxx before you hit the >>> > crash? I'm wondering why the initiator is aborting commands (although >>> > we still shouldn't crash even if it does abort commands). >>> > >>> > You could try applying upstream commit 193b50b9d54a ("qla2xxx: Replace >>> > QLA_TGT_STATE_ABORTED with a bit.") which seems like it might be >>> > related, though I'm not sure whether it really will help. >>> > >>> > - R. >>> -- >>> To unsubscribe from this list: send the line "unsubscribe target-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> ��.n��������+%������w��{.n����j�����{ay�ʇڙ���f���h������_�(�階�ݢj"��������G����?���&��