Re: Kernel oops during iscsi target shutdown

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Martin,

On Sun, 2012-12-23 at 17:06 +0100, Martin Svec wrote:
> Hello,
> 
> my iscsi stress tests triggered a bug during target shutdown in latest stable
> kernels (3.7.1 and earlier). Turning off an iscsi target with active connections
> during heavy random write i/o sometimes leads to kernel oopses on different
> places, hung kernel tasks and other symptoms of memory corruption.

Thanks for the detailed bug report, and apologies for the holiday
delay..

>  This calltrace best shows what happens:
> 
> BUG: unable to handle kernel NULL pointer dereference at (null)
> Entering kdb (current=0xffff88022693d9d0, pid 15) on processor 1 Oops: (null)
> due to oops @ 0xffffffff812d6bb6
> Pid: 15, comm: kworker/1:0 Not tainted 3.7.1-00004-ga57fbc7-dirty #23 Dell Inc. PowerEdge R510/0DPRKF
> RIP: 0010:[<ffffffff812d6bb6>]  [<ffffffff812d6bb6>] __list_add+0x26/0xd0
> Call Trace:
> [<ffffffffa00952ba>] iscsit_add_cmd_to_response_queue+0x9a/0x100 [iscsi_target_mod]
> [<ffffffffa00a20f6>] lio_queue_status+0x26/0x30 [iscsi_target_mod]
> [<ffffffffa003d818>] target_complete_ok_work+0x198/0x200 [target_core_mod]
> [<ffffffff81056913>] process_one_work+0x133/0x410
> [<ffffffffa003d680>] ? target_qf_do_work+0x2a0/0x2a0 [target_core_mod]
> [<ffffffff81057800>] worker_thread+0x160/0x3c0
> [<ffffffff810576a0>] ? manage_workers+0x280/0x280
> [<ffffffff8105cfc6>] kthread+0xc6/0xd0
> [<ffffffff8105cf00>] ? flush_kthread_worker+0xb0/0xb0
> [<ffffffff8164c02c>] ret_from_fork+0x7c/0xb0
> [<ffffffff8105cf00>] ? flush_kthread_worker+0xb0/0xb0
> 
> I put few printks to the code and found that iscsit_add_cmd_to_response_queue
> was called _after_ iscsit_close_connection. So the iscsi termination logic seems
> to be broken in the way that iscsi connection is closed and kfree'd while there are
> still pending iscsi_cmds somewhere in the target referencing the deallocated
> connection. Below is a dmesg snippet where iscsi_conn with address
> ffff880224924800 is kfree'd (time 432.338664) but after successful termination
> of the target, two outstanding cmds use this connection address in
> iscsit_add_cmd_to_response_queue (time 432.479821).
> 
> Dec 22 21:35:16 goose2 kernel: [  432.338662] Moving to TARG_CONN_STATE_FREE.
> Dec 22 21:35:16 goose2 kernel: [  432.338664] MARTIN: freeing connection ffff880224924800.
> Dec 22 21:35:16 goose2 kernel: [  432.338665] Decremented iSCSI connection count to 0 from node: iqn.1998-01.com.vmware:hv-esx8
> Dec 22 21:35:16 goose2 kernel: [  432.338666] Moving to TARG_SESS_STATE_FAILED.
> Dec 22 21:35:16 goose2 kernel: [  432.338671] TARGET_CORE[iSCSI]: Deregistered fabric_sess
> Dec 22 21:35:16 goose2 kernel: [  432.338673] Moving to TARG_SESS_STATE_FREE.
> Dec 22 21:35:16 goose2 kernel: [  432.338674] Released iSCSI session from node: iqn.1998-01.com.vmware:hv-esx8
> Dec 22 21:35:16 goose2 kernel: [  432.338675] Decremented number of active iSCSI Sessions on iSCSI TPG: 1 to 0
> Dec 22 21:35:16 goose2 kernel: [  432.338677] Released 1 iSCSI Session(s) from Target Portal Group: 1
> Dec 22 21:35:16 goose2 kernel: [  432.338678] iSCSI_TPG[1] - Disabled iSCSI Target Portal Group
> Dec 22 21:35:16 goose2 kernel: [  432.338717] LIO_Target_ConfigFS: DEREGISTER -> iqn.2011-07.com.zoner:inetstore.vmw0.san2 TPGT: 1 PORTAL: 192.168.224.10:3260
> Dec 22 21:35:16 goose2 kernel: [  432.338720] CORE[iqn.2011-07.com.zoner:inetstore.vmw0.san2] - Removed Network Portal: 192.168.224.10:3260,1 on TCP
> Dec 22 21:35:16 goose2 kernel: [  432.338788] CORE[0] - Removed Network Portal: 192.168.224.10:3260 on TCP
> Dec 22 21:35:16 goose2 kernel: [  432.338789] LIO_Target_ConfigFS: delnpfromtpg done!
> Dec 22 21:35:16 goose2 kernel: [  432.338838] iSCSI_TPG[1]_LUN[2] - Removed ACL for InitiatorNode: iqn.1998-01.com.vmware:hv-esx9 Mapped LUN: 2
> Dec 22 21:35:16 goose2 kernel: [  432.338857] iSCSI_TPG[1] - Freeing ACL for iSCSI InitiatorNode: iqn.1998-01.com.vmware:hv-esx9 Mapped LUN: 2
> Dec 22 21:35:16 goose2 kernel: [  432.338877] iSCSI_TPG[1]_LUN[1] - Removed ACL for InitiatorNode: iqn.1998-01.com.vmware:hv-esx9 Mapped LUN: 1
> Dec 22 21:35:16 goose2 kernel: [  432.338894] iSCSI_TPG[1] - Freeing ACL for iSCSI InitiatorNode: iqn.1998-01.com.vmware:hv-esx9 Mapped LUN: 1
> Dec 22 21:35:16 goose2 kernel: [  432.338933] iSCSI_TPG[1] - Deleted ACL with TCQ Depth: 64 for iSCSI Initiator Node: iqn.1998-01.com.vmware:hv-esx9
> Dec 22 21:35:16 goose2 kernel: [  432.338965] iSCSI_TPG[1]_LUN[2] - Removed ACL for InitiatorNode: iqn.1998-01.com.vmware:hv-esx8 Mapped LUN: 2
> Dec 22 21:35:16 goose2 kernel: [  432.338983] iSCSI_TPG[1] - Freeing ACL for iSCSI InitiatorNode: iqn.1998-01.com.vmware:hv-esx8 Mapped LUN: 2
> Dec 22 21:35:16 goose2 kernel: [  432.339003] iSCSI_TPG[1]_LUN[1] - Removed ACL for InitiatorNode: iqn.1998-01.com.vmware:hv-esx8 Mapped LUN: 1
> Dec 22 21:35:16 goose2 kernel: [  432.339024] iSCSI_TPG[1] - Freeing ACL for iSCSI InitiatorNode: iqn.1998-01.com.vmware:hv-esx8 Mapped LUN: 1
> Dec 22 21:35:16 goose2 kernel: [  432.339063] iSCSI_TPG[1] - Deleted ACL with TCQ Depth: 64 for iSCSI Initiator Node: iqn.1998-01.com.vmware:hv-esx8
> Dec 22 21:35:16 goose2 kernel: [  432.339184] iSCSI_TPG[1]_LUN[2] - Deactivated iSCSI Logical Unit from device object
> Dec 22 21:35:16 goose2 kernel: [  432.339313] iSCSI_TPG[1]_LUN[1] - Deactivated iSCSI Logical Unit from device object
> Dec 22 21:35:16 goose2 kernel: [  432.339353] LIO_Target_ConfigFS: DEREGISTER -> Releasing TPG
> Dec 22 21:35:16 goose2 kernel: [  432.339355] Released 0 iSCSI Session(s) from Target Portal Group: 1
> Dec 22 21:35:16 goose2 kernel: [  432.339366] TARGET_CORE[Normal]: Deallocating iSCSI struct se_portal_group for endpoint: iqn.2011-07.com.zoner:inetstore.vmw0.san2 Portal Tag 1
> Dec 22 21:35:16 goose2 kernel: [  432.339398] CORE[iqn.2011-07.com.zoner:inetstore.vmw0.san2]_TPG[1] - Deleted iSCSI Target Portal Group
> ...
> Dec 22 21:35:16 goose2 lrmd: [3117]: info: operation stop[153] on tpgvmw0 for client 3120: pid 8900 exited with return code 0
> ...
> Dec 22 21:35:17 goose2 kernel: [  432.479815] Incremented dev->dev_cur_ordered_id: 42204 for SIMPLE: 42204
> Dec 22 21:35:17 goose2 kernel: [  432.479821] MARTIN: Adding ffff880210641a40 cmd to conn ffff880224924800 as qreq ffff88020fca4330
> Dec 22 21:35:17 goose2 kernel: [  432.483560] Incremented dev->dev_cur_ordered_id: 42205 for SIMPLE: 42241
> Dec 22 21:35:17 goose2 kernel: [  432.483564] MARTIN: Adding ffff88020e39eac0 cmd to conn ffff880224924800 as qreq ffff88020fca4230
> ...

Ok, so this is signaling that the connection is being allowed to fully
shutdown + release while se_cmd are still outstanding to TCM.
Definitely a regression bug..

Digging into this a bit more, I'm starting to think the root problem is
that the CMD_T_ACTIVE bit in se_cmd->transport_state is getting cleared
by transport_cmd_check_stop() during transport_generic_new_cmd() for
pending WRITEs before the cmd->se_tfo->write_pending() callback occurs,
but is not getting reset until way too late in target_complete_cmd().

This bit used to be set each time in transport_add_cmd_to_queue() when a
se_cmd was added to the legacy transport_processing_thread() logic
(including for iSCSI WRITEs), but this has since been removed starting
from v3.6 code.  (Here is the v3.5.x reference for good measure)

http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=blob;f=drivers/target/target_core_transport.c;h=0981707216b7f937db1b7a6931b9df113048ecef;hb=refs/heads/linux-3.5.y#l600

So here is a quick patch to explicitly re-set CMD_T_ACTIVE within
target_execute_cmd(), which is the function that iscsi-target now calls
in >= v3.6.x to submit individual WRITEs once the full data payload has
been received.

diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index 978762d..1689c8b 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -1687,6 +1687,7 @@ void target_execute_cmd(struct se_cmd *cmd)
                return;
        }
 
+       cmd->transport_state |= CMD_T_ACTIVE;
        cmd->t_state = TRANSPORT_PROCESSING;
        spin_unlock_irq(&cmd->t_state_lock);
 
Please give it a shot, and let me know if your able to reproduce.

Thanks,

--nab
 
> Dec 22 21:35:17 goose2 kernel: [  432.510296] LIO_Target_ConfigFS: DEREGISTER -> iqn.2011-07.com.zoner:inetstore.vmw0.san2
> Dec 22 21:35:17 goose2 kernel: [  432.510302] CORE[0] - Deleted iSCSI Target IQN: iqn.2011-07.com.zoner:inetstore.vmw0.san2
> Dec 22 21:35:17 goose2 kernel: [  432.510338] LIO_Target_ConfigFS: DEREGISTER -> iqn.2011-07.com.zoner:inetstore.vmw0.san1
> Dec 22 21:35:17 goose2 kernel: [  432.510340] CORE[0] - Deleted iSCSI Target IQN: iqn.2011-07.com.zoner:inetstore.vmw0.san1
> Dec 22 21:35:17 goose2 kernel: [  432.510485] Target_Core_ConfigFS: Calling se_free_virtual_device() for se_dev_ptr: ffff880211081c00
> Dec 22 21:35:17 goose2 kernel: [  432.510527] Target_Core_ConfigFS: Deallocating se_subsystem_dev_t: ffff8802102d1000
> Dec 22 21:35:17 goose2 kernel: [  432.510604] Target_Core_ConfigFS: Calling se_free_virtual_device() for se_dev_ptr: ffff88020fc71000
> Dec 22 21:35:17 goose2 kernel: [  432.510666] Target_Core_ConfigFS: Deallocating se_subsystem_dev_t: ffff8802103c3000
> Dec 22 21:35:17 goose2 kernel: [  432.510690] CORE_HBA[12] - Detached HBA from Generic Target Core
> Dec 22 21:35:17 goose2 kernel: [  432.510708] CORE_HBA[11] - Detached HBA from Generic Target Core
> Dec 22 21:35:17 goose2 kernel: [  432.510729] CORE_HBA[10] - Detached HBA from Generic Target Core
> ...
> Dec 22 21:35:44 goose2 kernel: [  460.265788] BUG: soft lockup - CPU#4 stuck for 22s! [prmount:8994]
> 
> Tested with 3.7.1 stable kernel, 3.6.x looks to have the same problem.
> I skipped v3.4 and v3.5 so I don't know if the bug was introduced in v3.6
> or earlier.
> 
> Martin
> 
> --
> To unsubscribe from this list: send the line "unsubscribe target-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux