Re: rbd freezes/timeout

Leon Ruumpol <l.ruumpol@xxxxxxxxx> · Thu, 16 Sep 2021 14:00:10 +0200

Hello,

The first trace of this problem, does anyone have an idea how to proceed?

[do sep 16 06:06:15 2021] WARNING: CPU: 5 PID: 12793 at
net/ceph/osd_client.c:558 request_reinit+0x12f/0x150 [libceph]
[do sep 16 06:06:15 2021] Modules linked in: rpcsec_gss_krb5 auth_rpcgss
oid_registry nfsv3 nfs_acl nfs lockd grace fscache rbd libceph tun
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bnx2fc(O) cnic(O) uio
fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6
nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT
nf_reject_ipv4 xt_tcpudp xt_multiport dm_multipath xt_conntrack
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter sunrpc
skx_edac crct10dif_pclmul crc32_pclmul nls_iso8859_1 ghash_clmulni_intel
pcbc nls_cp437 vfat aesni_intel aes_x86_64 crypto_simd cryptd fat
glue_helper dm_mod ipmi_si ipmi_devintf sg wdat_wdt ipmi_msghandler lpc_ich
i2c_i801 acpi_power_meter ip_tables x_tables sd_mod hid_generic usbhid hid
ahci libahci xhci_pci i40e(O) megaraid_sas(O)
[do sep 16 06:06:15 2021]  libata xhci_hcd scsi_dh_rdac scsi_dh_hp_sw
scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt
[do sep 16 06:06:15 2021] CPU: 5 PID: 12793 Comm: kworker/5:1 Tainted: G
     W  O      4.19.0+1 #1
[do sep 16 06:06:15 2021] Workqueue: events handle_timeout [libceph]
[do sep 16 06:06:15 2021] RIP: e030:request_reinit+0x12f/0x150 [libceph]
[do sep 16 06:06:15 2021] Code: 89 f9 48 c7 c2 b1 f7 5f c0 48 c7 c6 96 2d
60 c0 48 c7 c7 98 db 61 c0 31 c0 e8 fd 24 dd c0 e9 37 ff ff ff 0f 0b e9 41
ff ff ff <0f> 0b e9 60 ff ff ff 0f 0b 0f 1f 84 00 00 00 00 00 e9 42 ff ff ff
[do sep 16 06:06:15 2021] RSP: e02b:ffffc9004e863da0 EFLAGS: 00010202
[do sep 16 06:06:15 2021] RAX: 0000000000000002 RBX: ffff888214114400 RCX:
0000000000000000
[do sep 16 06:06:15 2021] RDX: ffff88823eb4c378 RSI: ffff888254ff0200 RDI:
ffff88823eb4c0c0
[do sep 16 06:06:15 2021] RBP: ffff888214114d00 R08: 0000000000000000 R09:
0000000000000000
[do sep 16 06:06:15 2021] R10: ffff88823eb4c358 R11: ffff888249bd7760 R12:
ffff88823eb4c0c0
[do sep 16 06:06:15 2021] R13: fffffffffffffffe R14: 0000000000000000 R15:
0000000000000001
[do sep 16 06:06:15 2021] FS:  00007fa0bfce8880(0000)
GS:ffff8882ae140000(0000) knlGS:0000000000000000
[do sep 16 06:06:15 2021] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[do sep 16 06:06:15 2021] CR2: 00007fd99b97e8d0 CR3: 000000027dda0000 CR4:
0000000000040660
[do sep 16 06:06:15 2021] Call Trace:
[do sep 16 06:06:15 2021]  handle_timeout+0x398/0x6f0 [libceph]
[do sep 16 06:06:15 2021]  process_one_work+0x165/0x370
[do sep 16 06:06:15 2021]  worker_thread+0x49/0x3e0
[do sep 16 06:06:15 2021]  kthread+0xf8/0x130
[do sep 16 06:06:15 2021]  ? rescuer_thread+0x310/0x310
[do sep 16 06:06:15 2021]  ? kthread_bind+0x10/0x10
[do sep 16 06:06:15 2021]  ret_from_fork+0x35/0x40
[do sep 16 06:06:15 2021] ---[ end trace 9611d2cd27856122 ]---

On Thu, 9 Sept 2021 at 12:30, Leon Ruumpol <l.ruumpol@xxxxxxxxx> wrote:

> Hello,
>
> We have a ceph cluster with CephFS and RBD images enabled, from Xen-NG we
> connect directly to rbd images. Several times a day the VMs suffer from a
> high load/iowait which makes them temporarily inaccessible (arround 10~30
> seconds), in the logs on xen-ng I find this:
>
> [Thu Sep  9 02:16:06 2021] rbd: rbd4: encountered watch error: -107
> [Thu Sep  9 02:17:47 2021] rbd: rbd3: encountered watch error: -107
> [Thu Sep  9 02:18:55 2021] rbd: rbd4: encountered watch error: -107
> [Thu Sep  9 02:19:54 2021] rbd: rbd3: encountered watch error: -107
> [Thu Sep  9 02:49:39 2021] rbd: rbd3: encountered watch error: -107
> [Thu Sep  9 03:47:25 2021] rbd: rbd3: encountered watch error: -107
> [Thu Sep  9 03:48:07 2021] rbd: rbd4: encountered watch error: -107
> [Thu Sep  9 04:47:30 2021] rbd: rbd3: encountered watch error: -107
> [Thu Sep  9 04:47:55 2021] rbd: rbd4: encountered watch error: -107
>
> Version XEN: XCP-ng release 8.2.0 (xenenterprise) / Kernel 4.19.0+1 /
> running on 4 physical nodes.
>
> The Ceph cluster consists of 6 physical nodes, with48 osds (nvme), 3 mgr,
> 3 mon, 3 mds services connected with 2x10Gbps trunks from all hosts. Ceph
> status/detail is OK, no iowait/high cpu/network spikes. We have looked in
> the logs for a reason, but we are unable to match it with anything.
> Sometimes a scrub is in progress during these watch errors, but this does
> not always occur. Where is the best place to continue the search?
>
> Ceph.conf
>
> [global]
> fsid = ******
> mon_initial_members = ceph-c01-mon-n1, ceph-c01-mon-n2, ceph-c01-mon-n3
> mon_host = *.*.*.170,*.*.*.171,*.*.*.172
> public network = *.*.*.0/19
> cluster network = *.*.*.0/19
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
>
> Ceph versions:
>
> Versions:
> {
>     "mon": {
>         "ceph version 14.2.22() nautilus(stable)": 3
>     },
>     "mgr": {
>         "ceph version 14.2.22() nautilus(stable)": 3
>     },
>     "osd": {
>         "ceph version 14.2.22() nautilus(stable)": 48
>     },
>     "mds": {
>         "ceph version 14.2.22() nautilus(stable)": 3
>     },
>     "overall": {
>         "ceph version 14.2.22() nautilus(stable)": 57
>     }
> }
>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx