Not sure if there are multiple problems. On 02/05/2015 04:46 AM, reistlin87 wrote: > Feb 3 13:17:01 is-k13bi32e2s6vdi-002 CRON[10237]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) > Feb 3 13:25:01 is-k13bi32e2s6vdi-002 CRON[10242]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) > Feb 3 13:35:01 is-k13bi32e2s6vdi-002 CRON[10247]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) > Feb 3 13:45:01 is-k13bi32e2s6vdi-002 CRON[10252]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) > Feb 3 13:55:01 is-k13bi32e2s6vdi-002 CRON[10258]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) > Feb 3 14:02:48 is-k13bi32e2s6vdi-002 kernel: [699523.553713] Detected MISCOMPARE for addr: ffff88007989d000 buf: ffff88007b1ac800 > Feb 3 14:02:48 is-k13bi32e2s6vdi-002 kernel: [699523.553969] Target/iblock: Send MISCOMPARE check condition and sense For some reason COMPARE_AND_WRITE/ATS failed indicating data miscompared. I am not sure if that is related to the hang or not. > Feb 3 14:05:01 is-k13bi32e2s6vdi-002 CRON[10263]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) > Feb 3 14:05:17 is-k13bi32e2s6vdi-002 kernel: [699672.627202] ABORT_TASK: Found referenced iSCSI task_tag: 5216104 ...... A couple minutes later we see commands have timed out and ESXi is trying to abort them to figure out what is going on and clean them up. > Feb 3 14:22:17 is-k13bi32e2s6vdi-002 kernel: [700693.927187] iSCSI Login timeout on Network Portal 10.1.1.7:3260 That did not go well. ESXi did not get the responses it wanted and it looks like it escalated its error handling process and is trying to relogin. This hangs, because it looks like LIO is waiting for commands it has sent to RBD to complete. > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.613521] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [iscsi_ttx:2002] > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.613764] Modules linked in: ib_srpt(E) ib_cm(E) ib_sa(E) ib_mad(E) ib_core(E) ib_addr(E) tcm_qla2xxx(E) qla2xxx(E) tcm_loop( > E) tcm_fc(E) libfc(E) scsi_transport_fc(E) iscsi_target_mod(E) target_core_pscsi(E) target_core_file(E) target_core_iblock(E) target_core_mod(E) configfs(E) rbd(E) libceph(E) li > bcrc32c(E) ppdev(E) coretemp(E) microcode(E) serio_raw(E) i2c_piix4(E) shpchp(E) parport_pc(E) lp(E) parport(E) psmouse(E) floppy(E) vmw_pvscsi(E) vmxnet3(E) > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.614912] irq event stamp: 24644920 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615021] hardirqs last enabled at (24644919): [<ffffffff8105e5bd>] __local_bh_enable_ip+0x6d/0xd0 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615307] hardirqs last disabled at (24644920): [<ffffffff8173c8b7>] _raw_spin_lock_irq+0x17/0x50 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615560] softirqs last enabled at (24644918): [<ffffffffa01792d2>] iscsit_conn_all_queues_empty+0x72/0x90 [iscsi_target_mod] > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.615904] softirqs last disabled at (24644916): [<ffffffffa01792b8>] iscsit_conn_all_queues_empty+0x58/0x90 [iscsi_target_mod] > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.616212] CPU: 1 PID: 2002 Comm: iscsi_ttx Tainted: G E 3.18.0-ceph #1 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.616429] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/09/2012 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.616705] task: ffff88007a9921d0 ti: ffff88007a99c000 task.ti: ffff88007a99c000 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.616916] RIP: 0010:[<ffffffff8173cf91>] [<ffffffff8173cf91>] _raw_spin_unlock_irqrestore+0x41/0x70 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.617173] RSP: 0018:ffff88007a99fc78 EFLAGS: 00000282 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.617311] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000047e347e2 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.617516] RDX: 00000000000047e3 RSI: 0000000000000001 RDI: ffffffff8173cf8f > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.617720] RBP: ffff88007a99fc88 R08: 0000000000000001 R09: 0000000000000000 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.617924] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa01137bb > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.618128] R13: 0000000000000000 R14: 0000000000000046 R15: ffff88007a99fc58 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.618445] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.618669] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.618817] CR2: 00007f30e19f2000 CR3: 0000000036bfa000 CR4: 00000000000007e0 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619095] Stack: > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619167] ffff88007b6b6330 ffff88007b6b6480 ffff88007a99fcc8 ffffffffa011384b > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619388] ffff88007a99fcb8 ffffffff8173cf09 0000000000000001 ffff88007b6b6330 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619608] 0000000000000001 0000000000000001 ffff88007a99fcf8 ffffffffa0115285 > Feb 3 14:22:30 is-k13bi32e2s6vdi-002 kernel: [700706.619830] Call Trace: Is the LIO box in this state still by any chance? If so could you send the output of cat /sys/kernel/debug/ceph/<fsid>/osdc for the rbd devices being used by lio? Could you also send the ESXi vmkernel log (I think it is in /var/log/vmkernel.log) for when this happened? _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com