iscsi issues with ceph (Nautilus) + tcmu-runner

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



So, we've been running with iscsi enabled (tcmu-runner) on our Nautilus ceph
cluster for a couple of weeks, and started using it with our vsphere cluster.
Things looked good so we put it in production, but yesterday morning we
experienced a freeze of all iSCSO I/O one of the ESXi nodes, and the only way
to recover was a reboot of the client node (and the VMs with it).

At first I thought this was a glitch on the VMware side, as we saw lots of
warnings, but after digging in the kernel logs on the ceph side, we saw quite
a few of the messages below, leading up to the freeze.

We've got two iscsi gateways enabled, and have set things up as per
https://docs.ceph.com/docs/mimic/rbd/iscsi-initiators/

Kernel is 4.19 / Debian 10. This is single 10G link for the ESXi nodes, with
separate vlan and dedicated vmkernel nic on the storage vlan to talk to the
iscsi gateways on the ceph side. Ceph is 10 nodes with a mix of HDD and SSD.
The affected pool is replication on HDD w/NVMe block db in front.

No overrun / errors on the interfaces, and no mtu mismatch - not that I believe
this is connected to the error below. No errors/warnings on the ceph side.

Anyone seen this before ? Can provide more details/logs as required. If this
happens again, we'll be moving the VMs to an NFS backed vm store (even
if iscsi is probably preferred) until we can find a solution to go into full
prod.

Cheers,
Phil


[Tue May 12 05:15:45 2020] WARNING: CPU: 8 PID: 2448784 at kernel/workqueue.c:2917 __flush_work.cold.54+0x1f/0x29
[Tue May 12 05:15:45 2020] Modules linked in: fuse cbc ceph libceph libcrc32c crc32c_generic fscache target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_user uio target_core_mod configfs binfmt_misc 8021q garp stp mrp llc bonding intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ast cryptd ipmi_ssif ttm intel_cstate drm_kms_helper intel_uncore mei_me iTCO_wdt drm pcc_cpufreq intel_rapl_perf joydev pcspkr sg iTCO_vendor_support mei ioatdma ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter evdev dm_mod sunrpc ip_tables x_tables autofs4 squashfs zstd_decompress xxhash loop overlay hid_generic usbhid hid sd_mod ahci xhci_pci ehci_pci libahci ehci_hcd xhci_hcd libata ixgbe igb nvme mxm_wmi lpc_ich
[Tue May 12 05:15:45 2020]  dca usbcore i2c_i801 scsi_mod mfd_core mdio usb_common nvme_core crc32c_intel i2c_algo_bit wmi button
[Tue May 12 05:15:45 2020] CPU: 8 PID: 2448784 Comm: kworker/u32:0 Tainted: G        W         4.19.0-8-amd64 #1 Debian 4.19.98-1
[Tue May 12 05:15:45 2020] Hardware name: Supermicro SYS-6018R-TD8/X10DDW-i, BIOS 3.2 12/16/2019
[Tue May 12 05:15:45 2020] Workqueue: tmr-user target_tmr_work [target_core_mod]
[Tue May 12 05:15:45 2020] RIP: 0010:__flush_work.cold.54+0x1f/0x29
[Tue May 12 05:15:45 2020] Code: 69 2c 04 00 0f 0b e9 4a d3 ff ff 48 c7 c7 d8 f9 83 be e8 56 2c 04 00 0f 0b e9 41 d6 ff ff 48 c7 c7 d8 f9 83 be e8 43 2c 04 00 <0f> 0b 45 31 ed e9 2b d6 ff ff 49 8d b4 24 b0 00 00 00 48 c7 c7 b8
[Tue May 12 05:15:45 2020] RSP: 0018:ffffa0180b537d38 EFLAGS: 00010246
[Tue May 12 05:15:45 2020] RAX: 0000000000000024 RBX: ffff94f619c3fad0 RCX: 0000000000000006
[Tue May 12 05:15:45 2020] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff94f77fa166b0
[Tue May 12 05:15:45 2020] RBP: ffff94f619c3fad0 R08: 00000000000013ea R09: 0000000000aaaaaa
[Tue May 12 05:15:45 2020] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[Tue May 12 05:15:45 2020] R13: 0000000000000001 R14: ffffffffbda98400 R15: ffff94f5a3ecc088
[Tue May 12 05:15:45 2020] FS:  0000000000000000(0000) GS:ffff94f77fa00000(0000) knlGS:0000000000000000
[Tue May 12 05:15:45 2020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue May 12 05:15:45 2020] CR2: 0000555d72f0f000 CR3: 000000048240a005 CR4: 00000000003606e0
[Tue May 12 05:15:45 2020] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Tue May 12 05:15:45 2020] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Tue May 12 05:15:45 2020] Call Trace:
[Tue May 12 05:15:45 2020]  ? __irq_work_queue_local+0x50/0x60
[Tue May 12 05:15:45 2020]  ? irq_work_queue+0x46/0x50
[Tue May 12 05:15:45 2020]  ? wake_up_klogd+0x30/0x40
[Tue May 12 05:15:45 2020]  ? vprintk_emit+0x215/0x270
[Tue May 12 05:15:45 2020]  ? get_work_pool+0x40/0x40
[Tue May 12 05:15:45 2020]  __cancel_work_timer+0x10a/0x190
[Tue May 12 05:15:45 2020]  ? printk+0x58/0x6f
[Tue May 12 05:15:45 2020]  core_tmr_abort_task+0xd6/0x130 [target_core_mod]
[Tue May 12 05:15:45 2020]  target_tmr_work+0xc4/0x140 [target_core_mod]
[Tue May 12 05:15:45 2020]  process_one_work+0x1a7/0x3a0
[Tue May 12 05:15:45 2020]  worker_thread+0x30/0x390
[Tue May 12 05:15:45 2020]  ? create_worker+0x1a0/0x1a0
[Tue May 12 05:15:45 2020]  kthread+0x112/0x130
[Tue May 12 05:15:45 2020]  ? kthread_bind+0x30/0x30
[Tue May 12 05:15:45 2020]  ret_from_fork+0x1f/0x40
[Tue May 12 05:15:45 2020] ---[ end trace 3835b5fe0aa99038 ]---
[Tue May 12 05:15:45 2020] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 94585265
[Tue May 12 05:15:45 2020] ABORT_TASK: Found referenced iSCSI task_tag: 94585266
[Tue May 12 05:15:45 2020] ------------[ cut here ]------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux