Hi Mike, I can also seem to reproduce this behaviour. If I shutdown a Ceph node, the delay while Ceph works out that the OSD's are down seems to trigger similar error messages. It seems fairly reliable that if a OSD is down for more than 10 seconds that LIO will have this problem. Below is an excerpt from the kernel log, showing the OSD's going down, timing out and then coming back up. Even when the OSD's are back up ESXi never seems to be able to resume IO to the LUN (at least not within 24 hours). However if you remember I am working on the Active/Standby ALUA solution, when I promote the other LIO node to active for the LUN, access resumes immediately. This is a test/dev cluster for my ALUA resource agents, so please let me know if you need me to run any commands. Ideally it would be best to make LIO recover cleanly, but extending the timeout if possible would probably help in the majority of day to day cases. Feb 9 16:33:34 ceph-iscsi1 kernel: [ 3656.412071] libceph: osd1 10.3.2.31:6800 socket closed (con state OPEN) Feb 9 16:33:34 ceph-iscsi1 kernel: [ 3656.418621] libceph: osd0 10.3.2.31:6805 socket closed (con state OPEN) Feb 9 16:33:45 ceph-iscsi1 kernel: [ 3666.927352] ABORT_TASK: Found referenced iSCSI task_tag: 26731 Feb 9 16:33:45 ceph-iscsi1 kernel: [ 3666.927356] ABORT_TASK: ref_tag: 26731 already complete, skipping Feb 9 16:33:45 ceph-iscsi1 kernel: [ 3666.927369] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 26731 Feb 9 16:33:56 ceph-iscsi1 kernel: [ 3678.230684] libceph: osd0 down Feb 9 16:33:56 ceph-iscsi1 kernel: [ 3678.230689] libceph: osd1 down Feb 9 16:33:57 ceph-iscsi1 kernel: [ 3679.279227] ABORT_TASK: Found referenced iSCSI task_tag: 26733 Feb 9 16:33:57 ceph-iscsi1 kernel: [ 3679.279855] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 26733 Feb 9 16:33:57 ceph-iscsi1 kernel: [ 3679.279861] ABORT_TASK: Found referenced iSCSI task_tag: 26735 Feb 9 16:33:57 ceph-iscsi1 kernel: [ 3679.279862] ABORT_TASK: ref_tag: 26735 already complete, skipping Feb 9 16:33:57 ceph-iscsi1 kernel: [ 3679.279863] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 26735 Feb 9 16:34:14 ceph-iscsi1 kernel: [ 3696.143515] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 26739 Feb 9 16:34:28 ceph-iscsi1 kernel: [ 3709.731706] ABORT_TASK: Found referenced iSCSI task_tag: 26745 Feb 9 16:34:28 ceph-iscsi1 kernel: [ 3709.731711] ABORT_TASK: ref_tag: 26745 already complete, skipping Feb 9 16:34:28 ceph-iscsi1 kernel: [ 3709.731712] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 26745 Feb 9 16:34:37 ceph-iscsi1 kernel: [ 3718.905787] libceph: osd1 up Feb 9 16:34:40 ceph-iscsi1 kernel: [ 3721.904940] libceph: osd0 up Feb 9 16:34:55 ceph-iscsi1 kernel: [ 3736.924123] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 26754 Feb 9 16:35:22 ceph-iscsi1 kernel: [ 3764.102399] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 26762 Feb 9 16:35:36 ceph-iscsi1 kernel: [ 3777.689640] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 26768 Feb 9 16:36:03 ceph-iscsi1 kernel: [ 3804.866891] ABORT_TASK: Found referenced iSCSI task_tag: 26777 Feb 9 16:36:03 ceph-iscsi1 kernel: [ 3804.866896] ABORT_TASK: ref_tag: 26777 already complete, skipping Feb 9 16:36:03 ceph-iscsi1 kernel: [ 3804.866897] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 26777 Feb 9 16:36:30 ceph-iscsi1 kernel: [ 3832.011008] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000000a Feb 9 16:36:30 ceph-iscsi1 kernel: [ 3832.011137] ABORT_TASK: Found referenced iSCSI task_tag: 26792 Feb 9 16:36:30 ceph-iscsi1 kernel: [ 3832.011139] ABORT_TASK: ref_tag: 26792 already complete, skipping Feb 9 16:36:30 ceph-iscsi1 kernel: [ 3832.011141] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 26792 Feb 9 16:36:44 ceph-iscsi1 kernel: [ 3845.613219] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000014 Feb 9 16:36:57 ceph-iscsi1 kernel: [ 3859.215023] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000001e Feb 9 16:37:11 ceph-iscsi1 kernel: [ 3872.791732] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000028 Feb 9 16:37:11 ceph-iscsi1 kernel: [ 3872.791826] ABORT_TASK: Found referenced iSCSI task_tag: 26809 Feb 9 16:37:11 ceph-iscsi1 kernel: [ 3872.791827] ABORT_TASK: ref_tag: 26809 already complete, skipping Feb 9 16:37:11 ceph-iscsi1 kernel: [ 3872.791828] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 26809 Feb 9 16:37:24 ceph-iscsi1 kernel: [ 3886.378032] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000032 Feb 9 16:37:38 ceph-iscsi1 kernel: [ 3899.958060] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000003c Feb 9 16:37:38 ceph-iscsi1 kernel: [ 3899.958144] ABORT_TASK: Found referenced iSCSI task_tag: 26819 Nick -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Mike Christie Sent: 06 February 2015 04:09 To: reistlin87; ceph-users@xxxxxxxxxxxxxx Subject: Re: ISCSI LIO hang after 2-3 days of working Not sure if there are multiple problems. On 02/05/2015 04:46 AM, reistlin87 wrote: > Feb 3 13:17:01 is-k13bi32e2s6vdi-002 CRON[10237]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) > Feb 3 13:25:01 is-k13bi32e2s6vdi-002 CRON[10242]: (root) CMD (command > -v debian-sa1 > /dev/null && debian-sa1 1 1) Feb 3 13:35:01 > is-k13bi32e2s6vdi-002 CRON[10247]: (root) CMD (command -v debian-sa1 > > /dev/null && debian-sa1 1 1) Feb 3 13:45:01 is-k13bi32e2s6vdi-002 > CRON[10252]: (root) CMD (command -v debian-sa1 > /dev/null && > debian-sa1 1 1) Feb 3 13:55:01 is-k13bi32e2s6vdi-002 CRON[10258]: > (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Feb > 3 14:02:48 is-k13bi32e2s6vdi-002 kernel: [699523.553713] Detected > MISCOMPARE for addr: ffff88007989d000 buf: ffff88007b1ac800 Feb 3 > 14:02:48 is-k13bi32e2s6vdi-002 kernel: [699523.553969] Target/iblock: > Send MISCOMPARE check condition and sense For some reason COMPARE_AND_WRITE/ATS failed indicating data miscompared. I am not sure if that is related to the hang or not. > Feb 3 14:05:01 is-k13bi32e2s6vdi-002 CRON[10263]: (root) CMD (command > -v debian-sa1 > /dev/null && debian-sa1 1 1) Feb 3 14:05:17 > is-k13bi32e2s6vdi-002 kernel: [699672.627202] ABORT_TASK: Found > referenced iSCSI task_tag: 5216104 ...... A couple minutes later we see commands have timed out and ESXi is trying to abort them to figure out what is going on and clean them up. > Feb 3 14:22:17 is-k13bi32e2s6vdi-002 kernel: [700693.927187] iSCSI > Login timeout on Network Portal 10.1.1.7:3260 That did not go well. ESXi did not get the responses it wanted and it looks like it escalated its error handling process and is trying to relogin. This hangs, because it looks like LIO is waiting for commands it has sent to RBD to complete. > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.613521] NMI > watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [iscsi_ttx:2002] Feb > 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.613764] Modules > linked in: ib_srpt(E) ib_cm(E) ib_sa(E) ib_mad(E) ib_core(E) > ib_addr(E) tcm_qla2xxx(E) qla2xxx(E) tcm_loop( > E) tcm_fc(E) libfc(E) scsi_transport_fc(E) iscsi_target_mod(E) > target_core_pscsi(E) target_core_file(E) target_core_iblock(E) > target_core_mod(E) configfs(E) rbd(E) libceph(E) li > bcrc32c(E) ppdev(E) coretemp(E) microcode(E) serio_raw(E) i2c_piix4(E) > shpchp(E) parport_pc(E) lp(E) parport(E) psmouse(E) floppy(E) > vmw_pvscsi(E) vmxnet3(E) Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: > [700706.614912] irq event stamp: 24644920 Feb 3 14:22:30 > is-k13bi32e2s6vdi-002 kernel: [700706.615021] hardirqs last enabled > at (24644919): [<ffffffff8105e5bd>] __local_bh_enable_ip+0x6d/0xd0 Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615307] hardirqs last disabled at (24644920): [<ffffffff8173c8b7>] _raw_spin_lock_irq+0x17/0x50 Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615560] softirqs last enabled at (24644918): [<ffffffffa01792d2>] iscsit_conn_all_queues_empty+0x72/0x90 [iscsi_target_mod] Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615904] softirqs last disabled at (24644916): [<ffffffffa01792b8>] iscsit_conn_all_queues_empty+0x58/0x90 [iscsi_target_mod] > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.616212] CPU: 1 PID: 2002 Comm: iscsi_ttx Tainted: G E 3.18.0-ceph #1 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.616429] Hardware > name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference > Platform, BIOS 6.00 07/09/2012 Feb 3 14:22:30 is-k13bi32e2s6vdi-002 > kernel: [700706.616705] task: ffff88007a9921d0 ti: ffff88007a99c000 > task.ti: ffff88007a99c000 Feb 3 14:22:30 is-k13bi32e2s6vdi-002 > kernel: [700706.616916] RIP: 0010:[<ffffffff8173cf91>] > [<ffffffff8173cf91>] _raw_spin_unlock_irqrestore+0x41/0x70 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.617173] RSP: > 0018:ffff88007a99fc78 EFLAGS: 00000282 Feb 3 14:22:30 > is-k13bi32e2s6vdi-002 kernel: [700706.617311] RAX: 0000000000000000 > RBX: 0000000000000001 RCX: 0000000047e347e2 Feb 3 14:22:30 > is-k13bi32e2s6vdi-002 kernel: [700706.617516] RDX: 00000000000047e3 > RSI: 0000000000000001 RDI: ffffffff8173cf8f Feb 3 14:22:30 > is-k13bi32e2s6vdi-002 kernel: [700706.617720] RBP: ffff88007a99fc88 > R08: 0000000000000001 R09: 0000000000000000 Feb 3 14:22:30 > is-k13bi32e2s6vdi-002 kernel: [700706.617924] R10: 0000000000000000 > R11: 0000000000000000 R12: ffffffffa01137bb Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.618128] R13: 0000000000000000 R14: 0000000000000046 R15: ffff88007a99fc58 Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.618445] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.618669] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.618817] CR2: 00007f30e19f2000 CR3: 0000000036bfa000 CR4: 00000000000007e0 Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619095] Stack: > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619167] > ffff88007b6b6330 ffff88007b6b6480 ffff88007a99fcc8 ffffffffa011384b > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619388] > ffff88007a99fcb8 ffffffff8173cf09 0000000000000001 ffff88007b6b6330 Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619608] 0000000000000001 0000000000000001 ffff88007a99fcf8 ffffffffa0115285 Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619830] Call Trace: Is the LIO box in this state still by any chance? If so could you send the output of cat /sys/kernel/debug/ceph/<fsid>/osdc for the rbd devices being used by lio? Could you also send the ESXi vmkernel log (I think it is in /var/log/vmkernel.log) for when this happened? _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com