Re: Kernel Oops while closing iSCSI connection [transport_free_dev_tasks]

Tregaron Bayly <tbayly@xxxxxxxxxxxx> · Wed, 25 Jul 2012 16:23:06 +0000 (UTC)

Nicholas A. Bellinger <nab <at> linux-iscsi.org> writes:

> 
> On Sun, 2012-05-06 at 18:31 +0200, Henning Becker wrote:
> > Am Samstag, 14. April 2012, 14:37:47 schrieb Nicholas A. Bellinger:
> > > On Sat, 2012-04-14 at 18:35 +0200, Henning Becker wrote:
> > > > Am Dienstag, 10. April 2012, 23:54:06 schrieb Nicholas A. Bellinger:
> 
> <SNIP>
> 
> > > Hi Henning,
> > > 
> > > Ok, I think I've identified the cause of this oops within iscsi-target.
> > > 
> > > It has to do with the ordering in which your scripts are tearing down
> > > the configfs layout.  Looking at the inotify log again I see the
> > > following ordering:
> > > 
> > > # Tear down LUN=0 from TPG=1
> > > /sys/kernel/config/target/iscsi/iqn.2012-
04.lan.storage:iscsi.storage/tpgt_1
> > > /lun/lun_0/ DELETE,ISDIR statistics
> > > /sys/kernel/config/target/iscsi/iqn.2012-
04.lan.storage:iscsi.storage/tpgt_
> > > 1/lun/lun_0/ DELETE_SELF
> > > /sys/kernel/config/target/iscsi/iqn.2012-
04.lan.storage:iscsi.storage/tpgt_
> > > 1/lun/ DELETE,ISDIR lun_0
> > > 
> > > # Release IBLOCK backend device
> > > /sys/kernel/config/target/core/iblock_0/iscsiLUNTest/ DELETE_SELF
> > > /sys/kernel/config/target/core/iblock_0/ DELETE,ISDIR iscsiLUNTest
> > > 
> > > # Echo '0 > enable' to disable TPG
> > > /sys/kernel/config/target/iscsi/iqn.2012-04.lan.storage:iscsi.storage/
> > > CLOSE_NOWRITE,CLOSE,ISDIR
> > > /sys/kernel/config/target/iscsi/iqn.2012-
04.lan.storage:iscsi.storage/tpgt_
> > > 1/ MODIFY enable
> > > /sys/kernel/config/target/iscsi/iqn.2012-
04.lan.storage:iscsi.storage/tpgt_
> > > 1/ OPEN enable
> > > 
> > > So it appears with your custom scripts that the LUN=0 + IBLOCK backend
> > > is being released *before* explicitly disabling the TPG and forcing all
> > > of the active sessions to shutdown.
> > > 
> > > The OOPs itself is being caused by the removal of the IBLOCK backend, as
> > > there is code in the iscsi_cmd descriptor release path that depends upon
> > > the backend being in place (although removing the TPG LUN is OK)..  This
> > > is a genuine bug, for which I'll need to think some more to best resolve
> > > in order to avoid extra overhead within the existing data I/O fast
> > > path..
> > > 
> > > That said, the work-around for this bug is to change your custom scripts
> > > to follow what rtslib/lio-utils currently does for TPG removal.  That
> > > is:
> > > 
> > > 1: Echo '0 > enable' to disable TPG
> > > 2: Tear down NodeACLs+MappedLUNs from TPG
> > > 3: Tear down LUN from TPG
> > > 4: Tear down entire TPG
> > > 4: Release IBLOCK backend device
> > 
> > Hi Nicholas,
> > I'm just running my cluster according to your specs for 3 weeks now and the 
> > problem has not occured anymore. 
> > > 
> 
> Hi Henning,
> 
> Thanks for confirmation that the backend device shutdown ordering is the
> root cause trigger for the bug you've seen..  As mentioned, I still need
> to think some more about what the proper resolution should actually be
> here..
> 
> > > I'm quite certain this will avoid the bug in question by forcing
> > > shutdown of all active sessions at step #1, instead of doing this part
> > > at the end of the sequence as done in your current setup.
> > > 
> > > Please give it a shot and let me know if you have problems getting your
> > > scripts to sync with what the official userspace code is doing here.
> > 
> > Which official userspace code does that? I'm currently just calling lio_node 
> > and it didn't refuse me, to release an iblock which is still connected to a 
> > portal.
> > 
> 
> What I meant here is that the important part is currently disabling the
> TPG before bringing down the TPG LUN associated with the backend with
> active IO, ahead of the backend itself.  This will shutdown all active
> iSCSI sessions (and hence outstanding I/Os) to underlying backend
> devices, and after it's completed it will be safe to remove an
> associated backend device.
> 
> So the main issue is still the final release of the backend device
> (after it's been released from TPG LUN) to ensure that any remaining
> outstanding I/O that is still referencing se_device memory is allowed to
> complete before 'rmdir /sys/kernel/config/target/core/$HBA/$DEV' is
> releasing se_device.
> 
> --nab
> 
> 

We are experiencing what I believe to be this same oops on kernel version 3.4.1 
during removal of an iscsi target.  

BUG: unable to handle kernel paging request at 000000066474e5d9
IP: [<ffffffffa049e278>] transport_free_dev_tasks+0xf8/0x120 [target_core_mod]
PGD 0 
Oops: 0000 [#1] PREEMPT SMP 
CPU 1 
Modules linked in: md5 ip6table_filter ip6_tables ebtable_nat ebtables 
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state 
nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter 
ip_tables x_tables 8021q garp bridge stp llc target_core_file target_core_iblock 
iscsi_target_mod sunrpc af_packet ipv6 binfmt_misc target_core_mod configfs 
vhost_net macvtap macvlan tun kvm container coretemp microcode serio_raw pcspkr 
i2c_i801 iTCO_wdt iTCO_vendor_support ixgbe(O) i5000_edac edac_core i5k_amb 
ioatdma dca sg ses enclosure e1000e shpchp pci_hotplug ext4 mbcache jbd2 sd_mod 
crc_t10dif ahci libahci qla2xxx scsi_transport_fc scsi_tgt megaraid_sas button 
radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash 
dm_log dm_mod [last unloaded: mperf]

Pid: 3525, comm: iscsi_ttx Tainted: G           O 3.4.1-1.BH #3 Supermicro 
X7DB8/X7DB8
RIP: 0010:[<ffffffffa049e278>]  [<ffffffffa049e278>] 
transport_free_dev_tasks+0xf8/0x120 [target_core_mod]
RSP: 0018:ffff88040b515e00  EFLAGS: 00010246
RAX: 000000066474e551 RBX: ffff88041941c270 RCX: ffff88041941c440
RDX: ffff88040b515e00 RSI: 0000000000000286 RDI: ffff8804181449c0
RBP: ffff88040b515e30 R08: ffff88041941c470 R09: dead000000200200
R10: dead000000100100 R11: 0000000000000001 R12: ffff88040b515e00
R13: ffff8804181449c0 R14: 0000000000000000 R15: ffff8803fe2c04c0
FS:  0000000000000000(0000) GS:ffff88042fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000066474e5d9 CR3: 0000000400ed5000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process iscsi_ttx (pid: 3525, threadinfo ffff88040b514000, task 
ffff8803fe2c04c0)
Stack:
ffff88040b515e00 ffff88040b515e00 0000000000000000 ffff88041941c270
ffff88041941c128 ffff88041941c040 ffff88040b515e50 ffffffffa04a26f1
ffff88040b515e60 ffff880419dd1c00 ffff88040b515e60 ffffffffa0594071
Call Trace:
[<ffffffffa04a26f1>] transport_generic_free_cmd+0x61/0x90 [target_core_mod]
[<ffffffffa0594071>] iscsit_free_cmd+0x21/0x50 [iscsi_target_mod]
[<ffffffffa059acc7>] iscsi_target_tx_thread+0x497/0x680 [iscsi_target_mod]
[<ffffffffa059a830>] ? iscsit_send_text_rsp+0x330/0x330 [iscsi_target_mod]
[<ffffffffa059a830>] ? iscsit_send_text_rsp+0x330/0x330 [iscsi_target_mod]
[<ffffffff8105f626>] kthread+0x96/0xa0
[<ffffffff815125e4>] kernel_thread_helper+0x4/0x10
[<ffffffff8105f590>] ? kthread_freezable_should_stop+0x60/0x60
[<ffffffff815125e0>] ? gs_change+0x13/0x13
Code: 00 00 ad de 49 b9 00 02 20 00 00 00 ad de 4c 89 ef 48 89 42 08 48 89 10 4d 
89 55 30 4d 89 4d 38 48 8b 43 78 48 8b 80 88 01 00 00 <ff> 90 88 00 00 00 48 8b 
45 d0 4c 39 e0 75 99 48 83 c4 18 5b 41 
RIP  [<ffffffffa049e278>] transport_free_dev_tasks+0xf8/0x120 [target_core_mod]
RSP <ffff88040b515e00>
CR2: 000000066474e5d9
---[ end trace 30b5ec5ccc64a33d ]---

We do use a custom script but wrap around targetcli rather than using sysfs 
directly.  Our process is:

1) targetcli <target iqn>/tpg1/luns delete lunX
2) targetcli /iscsi delete <target iqn> (if this was the last lun being 
exported)
3) targetcli /backstores/block delete <backstore name>

According to the response to Henning previously the best approach is to disable 
the tpg prior to teardown but that is not desirable here as we will have several 
other luns possibly exported to which we do not wish to lose communication.  If 
we delay the tear down of the backstore until several minutes later and cause 
the initiator to issue a 'delete' in the intervening period would we still need 
to disable the tpg to safely delete it?

Any further thoughts on addressing the bug in the iscsi_cmd_descriptor release 
path to eliminate the need for this workaround?

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html