Re: ISCSI LIO hang after 2-3 days of working

Nick Fisk <nick@xxxxxxxxxx> · Tue, 10 Feb 2015 11:00:19 -0000

Hi Mike,

I can also seem to reproduce this behaviour. If I shutdown a Ceph node, the
delay while Ceph works out that the OSD's are down seems to trigger similar
error messages. It seems fairly reliable that if a OSD is down for more than
10 seconds that LIO will have this problem.

Below is an excerpt from the kernel log, showing the OSD's going down,
timing out and then coming back up. Even when the OSD's are back up ESXi
never seems to be able to resume IO to the LUN (at least not within 24
hours).

However if you remember I am working on the Active/Standby ALUA solution,
when I promote the other LIO node to active for the LUN, access resumes
immediately. 

This is a test/dev cluster for my ALUA resource agents, so please let me
know if you need me to run any commands. Ideally it would be best to make
LIO recover cleanly, but extending the timeout if possible would probably
help in the majority of day to day cases.

Feb  9 16:33:34 ceph-iscsi1 kernel: [ 3656.412071] libceph: osd1
10.3.2.31:6800 socket closed (con state OPEN)
Feb  9 16:33:34 ceph-iscsi1 kernel: [ 3656.418621] libceph: osd0
10.3.2.31:6805 socket closed (con state OPEN)
Feb  9 16:33:45 ceph-iscsi1 kernel: [ 3666.927352] ABORT_TASK: Found
referenced iSCSI task_tag: 26731
Feb  9 16:33:45 ceph-iscsi1 kernel: [ 3666.927356] ABORT_TASK: ref_tag:
26731 already complete, skipping
Feb  9 16:33:45 ceph-iscsi1 kernel: [ 3666.927369] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 26731
Feb  9 16:33:56 ceph-iscsi1 kernel: [ 3678.230684] libceph: osd0 down
Feb  9 16:33:56 ceph-iscsi1 kernel: [ 3678.230689] libceph: osd1 down
Feb  9 16:33:57 ceph-iscsi1 kernel: [ 3679.279227] ABORT_TASK: Found
referenced iSCSI task_tag: 26733
Feb  9 16:33:57 ceph-iscsi1 kernel: [ 3679.279855] ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 26733
Feb  9 16:33:57 ceph-iscsi1 kernel: [ 3679.279861] ABORT_TASK: Found
referenced iSCSI task_tag: 26735
Feb  9 16:33:57 ceph-iscsi1 kernel: [ 3679.279862] ABORT_TASK: ref_tag:
26735 already complete, skipping
Feb  9 16:33:57 ceph-iscsi1 kernel: [ 3679.279863] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 26735
Feb  9 16:34:14 ceph-iscsi1 kernel: [ 3696.143515] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 26739
Feb  9 16:34:28 ceph-iscsi1 kernel: [ 3709.731706] ABORT_TASK: Found
referenced iSCSI task_tag: 26745
Feb  9 16:34:28 ceph-iscsi1 kernel: [ 3709.731711] ABORT_TASK: ref_tag:
26745 already complete, skipping
Feb  9 16:34:28 ceph-iscsi1 kernel: [ 3709.731712] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 26745
Feb  9 16:34:37 ceph-iscsi1 kernel: [ 3718.905787] libceph: osd1 up
Feb  9 16:34:40 ceph-iscsi1 kernel: [ 3721.904940] libceph: osd0 up
Feb  9 16:34:55 ceph-iscsi1 kernel: [ 3736.924123] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 26754
Feb  9 16:35:22 ceph-iscsi1 kernel: [ 3764.102399] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 26762
Feb  9 16:35:36 ceph-iscsi1 kernel: [ 3777.689640] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 26768
Feb  9 16:36:03 ceph-iscsi1 kernel: [ 3804.866891] ABORT_TASK: Found
referenced iSCSI task_tag: 26777
Feb  9 16:36:03 ceph-iscsi1 kernel: [ 3804.866896] ABORT_TASK: ref_tag:
26777 already complete, skipping
Feb  9 16:36:03 ceph-iscsi1 kernel: [ 3804.866897] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 26777
Feb  9 16:36:30 ceph-iscsi1 kernel: [ 3832.011008] TARGET_CORE[iSCSI]:
Detected NON_EXISTENT_LUN Access for 0x0000000a
Feb  9 16:36:30 ceph-iscsi1 kernel: [ 3832.011137] ABORT_TASK: Found
referenced iSCSI task_tag: 26792
Feb  9 16:36:30 ceph-iscsi1 kernel: [ 3832.011139] ABORT_TASK: ref_tag:
26792 already complete, skipping
Feb  9 16:36:30 ceph-iscsi1 kernel: [ 3832.011141] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 26792
Feb  9 16:36:44 ceph-iscsi1 kernel: [ 3845.613219] TARGET_CORE[iSCSI]:
Detected NON_EXISTENT_LUN Access for 0x00000014
Feb  9 16:36:57 ceph-iscsi1 kernel: [ 3859.215023] TARGET_CORE[iSCSI]:
Detected NON_EXISTENT_LUN Access for 0x0000001e
Feb  9 16:37:11 ceph-iscsi1 kernel: [ 3872.791732] TARGET_CORE[iSCSI]:
Detected NON_EXISTENT_LUN Access for 0x00000028
Feb  9 16:37:11 ceph-iscsi1 kernel: [ 3872.791826] ABORT_TASK: Found
referenced iSCSI task_tag: 26809
Feb  9 16:37:11 ceph-iscsi1 kernel: [ 3872.791827] ABORT_TASK: ref_tag:
26809 already complete, skipping
Feb  9 16:37:11 ceph-iscsi1 kernel: [ 3872.791828] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 26809
Feb  9 16:37:24 ceph-iscsi1 kernel: [ 3886.378032] TARGET_CORE[iSCSI]:
Detected NON_EXISTENT_LUN Access for 0x00000032
Feb  9 16:37:38 ceph-iscsi1 kernel: [ 3899.958060] TARGET_CORE[iSCSI]:
Detected NON_EXISTENT_LUN Access for 0x0000003c
Feb  9 16:37:38 ceph-iscsi1 kernel: [ 3899.958144] ABORT_TASK: Found
referenced iSCSI task_tag: 26819

Nick

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
Mike Christie
Sent: 06 February 2015 04:09
To: reistlin87; ceph-users@xxxxxxxxxxxxxx
Subject: Re:  ISCSI LIO hang after 2-3 days of working

Not sure if there are multiple problems.

On 02/05/2015 04:46 AM, reistlin87 wrote:
> Feb  3 13:17:01 is-k13bi32e2s6vdi-002 CRON[10237]: (root) CMD (   cd / &&
run-parts --report /etc/cron.hourly)
> Feb  3 13:25:01 is-k13bi32e2s6vdi-002 CRON[10242]: (root) CMD (command 
> -v debian-sa1 > /dev/null && debian-sa1 1 1) Feb  3 13:35:01 
> is-k13bi32e2s6vdi-002 CRON[10247]: (root) CMD (command -v debian-sa1 > 
> /dev/null && debian-sa1 1 1) Feb  3 13:45:01 is-k13bi32e2s6vdi-002 
> CRON[10252]: (root) CMD (command -v debian-sa1 > /dev/null && 
> debian-sa1 1 1) Feb  3 13:55:01 is-k13bi32e2s6vdi-002 CRON[10258]: 
> (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Feb  
> 3 14:02:48 is-k13bi32e2s6vdi-002 kernel: [699523.553713] Detected 
> MISCOMPARE for addr: ffff88007989d000 buf: ffff88007b1ac800 Feb  3 
> 14:02:48 is-k13bi32e2s6vdi-002 kernel: [699523.553969] Target/iblock: 
> Send MISCOMPARE check condition and sense

For some reason COMPARE_AND_WRITE/ATS failed indicating data miscompared. I
am not sure if that is related to the hang or not.

> Feb  3 14:05:01 is-k13bi32e2s6vdi-002 CRON[10263]: (root) CMD (command 
> -v debian-sa1 > /dev/null && debian-sa1 1 1) Feb  3 14:05:17 
> is-k13bi32e2s6vdi-002 kernel: [699672.627202] ABORT_TASK: Found 
> referenced iSCSI task_tag: 5216104

......

A couple minutes later we see commands have timed out and ESXi is trying to
abort them to figure out what is going on and clean them up.

> Feb  3 14:22:17 is-k13bi32e2s6vdi-002 kernel: [700693.927187] iSCSI 
> Login timeout on Network Portal 10.1.1.7:3260

That did not go well. ESXi did not get the responses it wanted and it looks
like it escalated its error handling process and is trying to relogin. This
hangs, because it looks like LIO is waiting for commands it has sent to RBD
to complete.

> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.613521] NMI 
> watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [iscsi_ttx:2002] Feb  
> 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.613764] Modules 
> linked in: ib_srpt(E) ib_cm(E) ib_sa(E) ib_mad(E) ib_core(E) 
> ib_addr(E) tcm_qla2xxx(E) qla2xxx(E) tcm_loop(
> E) tcm_fc(E) libfc(E) scsi_transport_fc(E) iscsi_target_mod(E) 
> target_core_pscsi(E) target_core_file(E) target_core_iblock(E) 
> target_core_mod(E) configfs(E) rbd(E) libceph(E) li
> bcrc32c(E) ppdev(E) coretemp(E) microcode(E) serio_raw(E) i2c_piix4(E) 
> shpchp(E) parport_pc(E) lp(E) parport(E) psmouse(E) floppy(E) 
> vmw_pvscsi(E) vmxnet3(E) Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: 
> [700706.614912] irq event stamp: 24644920 Feb  3 14:22:30 
> is-k13bi32e2s6vdi-002 kernel: [700706.615021] hardirqs last  enabled 
> at (24644919): [<ffffffff8105e5bd>] __local_bh_enable_ip+0x6d/0xd0 Feb  3
14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615307] hardirqs last
disabled at (24644920): [<ffffffff8173c8b7>] _raw_spin_lock_irq+0x17/0x50
Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615560] softirqs last
enabled at (24644918): [<ffffffffa01792d2>]
iscsit_conn_all_queues_empty+0x72/0x90 [iscsi_target_mod] Feb  3 14:22:30
is-k13bi32e2s6vdi-002 kernel: [700706.615904] softirqs last disabled at
(24644916): [<ffffffffa01792b8>] iscsit_conn_all_queues_empty+0x58/0x90
[iscsi_target_mod]
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.616212] CPU: 1 PID:
2002 Comm: iscsi_ttx Tainted: G            E  3.18.0-ceph #1
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.616429] Hardware 
> name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
> Platform, BIOS 6.00 07/09/2012 Feb  3 14:22:30 is-k13bi32e2s6vdi-002 
> kernel: [700706.616705] task: ffff88007a9921d0 ti: ffff88007a99c000 
> task.ti: ffff88007a99c000 Feb  3 14:22:30 is-k13bi32e2s6vdi-002 
> kernel: [700706.616916] RIP: 0010:[<ffffffff8173cf91>]  
> [<ffffffff8173cf91>] _raw_spin_unlock_irqrestore+0x41/0x70
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.617173] RSP: 
> 0018:ffff88007a99fc78  EFLAGS: 00000282 Feb  3 14:22:30 
> is-k13bi32e2s6vdi-002 kernel: [700706.617311] RAX: 0000000000000000 
> RBX: 0000000000000001 RCX: 0000000047e347e2 Feb  3 14:22:30 
> is-k13bi32e2s6vdi-002 kernel: [700706.617516] RDX: 00000000000047e3 
> RSI: 0000000000000001 RDI: ffffffff8173cf8f Feb  3 14:22:30 
> is-k13bi32e2s6vdi-002 kernel: [700706.617720] RBP: ffff88007a99fc88 
> R08: 0000000000000001 R09: 0000000000000000 Feb  3 14:22:30 
> is-k13bi32e2s6vdi-002 kernel: [700706.617924] R10: 0000000000000000 
> R11: 0000000000000000 R12: ffffffffa01137bb Feb  3 14:22:30
is-k13bi32e2s6vdi-002 kernel: [700706.618128] R13: 0000000000000000 R14:
0000000000000046 R15: ffff88007a99fc58 Feb  3 14:22:30 is-k13bi32e2s6vdi-002
kernel: [700706.618445] FS:  0000000000000000(0000)
GS:ffff88007fd00000(0000) knlGS:0000000000000000 Feb  3 14:22:30
is-k13bi32e2s6vdi-002 kernel: [700706.618669] CS:  0010 DS: 0000 ES: 0000
CR0: 000000008005003b Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel:
[700706.618817] CR2: 00007f30e19f2000 CR3: 0000000036bfa000 CR4:
00000000000007e0 Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel:
[700706.619095] Stack:
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619167]  
> ffff88007b6b6330 ffff88007b6b6480 ffff88007a99fcc8 ffffffffa011384b 
> Feb  3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619388]  
> ffff88007a99fcb8 ffffffff8173cf09 0000000000000001 ffff88007b6b6330 Feb  3
14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619608]  0000000000000001
0000000000000001 ffff88007a99fcf8 ffffffffa0115285 Feb  3 14:22:30
is-k13bi32e2s6vdi-002 kernel: [700706.619830] Call Trace:

Is the LIO box in this state still by any chance? If so could you send the
output of

cat /sys/kernel/debug/ceph/<fsid>/osdc

for the rbd devices being used by lio?

Could you also send the ESXi vmkernel log (I think it is in
/var/log/vmkernel.log) for when this happened?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com