On 06/03/2015 12:06 PM, Alex Gorbachev wrote: > Hi Mike, > > On Tue, Jun 2, 2015 at 2:45 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote: >> On 06/02/2015 08:11 AM, Alex Gorbachev wrote: >>> Hi Mike, >>> >>> Thank you for your responses. I have sent last night's logs in a >>> separate email. Basically, in absence of any ceph related warnings, >>> there seems to be a large set of ESXi rescans, then a message such as: >>> >>> Jun 1 22:57:03 roc-4r-scd212 kernel: [531823.664766] iSCSI Login >>> timeout on Network Portal 10.70.2.211:3260 >> >> Yeah, it looks like some commands timed out, esx tried to >> abort and that failed. It then escalated to session level recovery and >> it could not relogin, because the login code was waiting for the >> outstanding tasks as we see this: > > Would you say there is no graceful mechanism for LIO and ESXi to just > give up on one login and try another one? I guess it is sort of being debated. SCST and I think tgt allow the new login even though a cmd/task is still hung. When discussing why LIO does not do the same, we thought there was something in the SCSI/iSCSI specs that said we had to do it how LIO does it. > >> >> >> Jun 1 22:56:56 roc-4r-scd212 kernel: [531817.070242] Unexpected ret: -32 send data 48 >> Jun 1 22:56:57 roc-4r-scd212 kernel: [531817.326345] Unexpected ret: -32 send data 48 >> Jun 1 22:57:01 roc-4r-scd212 kernel: [531821.680011] Unexpected ret: -32 send data 48 >> Jun 1 22:57:02 roc-4r-scd212 kernel: [531822.384283] Unexpected ret: -32 send data 48 >> Jun 1 22:57:03 roc-4r-scd212 kernel: [531823.376666] Unexpected ret: -32 send data 48 >> Jun 1 22:57:03 roc-4r-scd212 kernel: [531823.664766] iSCSI Login timeout on Network Portal 10.70.2.211:3260 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373632] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [iscsi_trx:23174] >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373680] Modules linked in: rbd libceph libcrc32c iscsi_target_mod target_core_file target_core_pscsi target_core_iblock target_core_mod configfs xt_multiport iptable_filter ip_tables x_tables enhanceio_rand(OE) enhanceio_lru(OE) enhanceio_fifo(OE) enhanceio(OE) ipmi_devintf ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev sb_edac edac_core mei_me mei lpc_ich ioatdma ipmi_si ipmi_msghandler 8250_fintek 8021q garp mrp stp shpchp llc wmi bonding mac_hid lp parport mlx4_en vxlan ip6_udp_tunnel udp_tunnel hid_generic igb usbhid mpt2sas ahci hid i2c_algo_bit mlx4_core dca libahci raid_class ptp scsi_transport_sas pps_core >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373742] CPU: 4 PID: 23174 Comm: iscsi_trx Tainted: G C OE 4.1.0-040100rc2-generic #201505032335 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373744] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373747] task: ffff88105912e4a0 ti: ffff880844a44000 task.ti: ffff880844a44000 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373749] RIP: 0010:[<ffffffff8180379d>] [<ffffffff8180379d>] _raw_spin_lock_bh+0x1d/0x70 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373758] RSP: 0018:ffff880844a47d08 EFLAGS: 00000286 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373760] RAX: 00000000abc6abc6 RBX: ffff8808446b5410 RCX: 00000000f7b8aa20 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373762] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8808446b5224 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373764] RBP: ffff880844a47d08 R08: ffff8808446b5410 R09: 0000000000000101 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373766] R10: 0000000000000004 R11: dead000000200200 R12: 0000000000000004 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373768] R13: dead000000200200 R14: 0000000000000004 R15: ffff880844a47ce8 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373770] FS: 0000000000000000(0000) GS:ffff88085fb00000(0000) knlGS:0000000000000000 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373774] CR2: 00007f19f67ff2d0 CR3: 0000000001e0f000 CR4: 00000000001407e0 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373776] Stack: >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373778] ffff880844a47d28 ffffffffc04194c7 ffff8808446b5140 ffff88085bfb7000 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373781] ffff880844a47d58 ffffffffc0424665 0000000064d064d0 ffff8808446b5140 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373784] 0000000000000001 0000000000000001 ffff880844a47d98 ffffffffc042473e >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373788] Call Trace: >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373804] [<ffffffffc04194c7>] iscsit_stop_dataout_timer+0x27/0x90 [iscsi_target_mod] >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373813] [<ffffffffc0424665>] __iscsit_free_cmd+0x85/0xa0 [iscsi_target_mod] >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373821] [<ffffffffc042473e>] iscsit_free_cmd+0xbe/0x160 [iscsi_target_mod] >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373829] [<ffffffffc042d8dc>] iscsit_close_connection+0x47c/0x770 [iscsi_target_mod] >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373837] [<ffffffffc0418c83>] iscsit_take_action_for_connection_exit+0x83/0x110 [iscsi_target_mod] >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373845] [<ffffffffc042c947>] iscsi_target_rx_thread+0x207/0x2b0 [iscsi_target_mod] >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373848] [<ffffffff817ff4fe>] ? __schedule+0x38e/0x8db >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373856] [<ffffffffc042c740>] ? iscsi_target_tx_thread+0x1d0/0x1d0 [iscsi_target_mod] >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373861] [<ffffffff8109cdc9>] kthread+0xc9/0xe0 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373865] [<ffffffff8109cd00>] ? flush_kthread_worker+0x90/0x90 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373868] [<ffffffff81803fe2>] ret_from_fork+0x42/0x70 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373871] [<ffffffff8109cd00>] ? flush_kthread_worker+0x90/0x90 >> Jun 1 22:57:16 roc-4r-scd212 kernel: [531836.373872] Code: 00 00 00 00 eb ce 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 65 81 05 0f 81 80 7e 00 02 00 00 48 89 e5 b8 00 00 02 00 f0 0f c1 07 <89> c2 c1 ea 10 66 39 c2 75 02 5d c3 0f b7 ca b8 00 80 00 00 44 >> >> >> >>> >>> Then more errors. I have attached the syslog from the LIO node, as >>> well as logs from one of the six ESXi nodes, but this one has the most >>> storage. >>> >>> Do you think there is an ESXi scan storm or such that breaks LIO? Or >> >> How are you detecting scans in the logs? The non existent lun messages in >> the logs right? If so, are the scans before or after the aborts? I thought it was >> after. In the logs I was looking at the issue at around Jun 1 22:41:23, so >> I'm not sure if there was a event before the log started or not. > > Right, the nonexistent LUN. There are scans before and after the > aborts (I sent you the logs via email from yesterday that show a lot > of these). In the log you sent before, the aborts happened first. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html