Hello, The first trace of this problem, does anyone have an idea how to proceed? [do sep 16 06:06:15 2021] WARNING: CPU: 5 PID: 12793 at net/ceph/osd_client.c:558 request_reinit+0x12f/0x150 [libceph] [do sep 16 06:06:15 2021] Modules linked in: rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv3 nfs_acl nfs lockd grace fscache rbd libceph tun iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport dm_multipath xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter sunrpc skx_edac crct10dif_pclmul crc32_pclmul nls_iso8859_1 ghash_clmulni_intel pcbc nls_cp437 vfat aesni_intel aes_x86_64 crypto_simd cryptd fat glue_helper dm_mod ipmi_si ipmi_devintf sg wdat_wdt ipmi_msghandler lpc_ich i2c_i801 acpi_power_meter ip_tables x_tables sd_mod hid_generic usbhid hid ahci libahci xhci_pci i40e(O) megaraid_sas(O) [do sep 16 06:06:15 2021] libata xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt [do sep 16 06:06:15 2021] CPU: 5 PID: 12793 Comm: kworker/5:1 Tainted: G W O 4.19.0+1 #1 [do sep 16 06:06:15 2021] Workqueue: events handle_timeout [libceph] [do sep 16 06:06:15 2021] RIP: e030:request_reinit+0x12f/0x150 [libceph] [do sep 16 06:06:15 2021] Code: 89 f9 48 c7 c2 b1 f7 5f c0 48 c7 c6 96 2d 60 c0 48 c7 c7 98 db 61 c0 31 c0 e8 fd 24 dd c0 e9 37 ff ff ff 0f 0b e9 41 ff ff ff <0f> 0b e9 60 ff ff ff 0f 0b 0f 1f 84 00 00 00 00 00 e9 42 ff ff ff [do sep 16 06:06:15 2021] RSP: e02b:ffffc9004e863da0 EFLAGS: 00010202 [do sep 16 06:06:15 2021] RAX: 0000000000000002 RBX: ffff888214114400 RCX: 0000000000000000 [do sep 16 06:06:15 2021] RDX: ffff88823eb4c378 RSI: ffff888254ff0200 RDI: ffff88823eb4c0c0 [do sep 16 06:06:15 2021] RBP: ffff888214114d00 R08: 0000000000000000 R09: 0000000000000000 [do sep 16 06:06:15 2021] R10: ffff88823eb4c358 R11: ffff888249bd7760 R12: ffff88823eb4c0c0 [do sep 16 06:06:15 2021] R13: fffffffffffffffe R14: 0000000000000000 R15: 0000000000000001 [do sep 16 06:06:15 2021] FS: 00007fa0bfce8880(0000) GS:ffff8882ae140000(0000) knlGS:0000000000000000 [do sep 16 06:06:15 2021] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [do sep 16 06:06:15 2021] CR2: 00007fd99b97e8d0 CR3: 000000027dda0000 CR4: 0000000000040660 [do sep 16 06:06:15 2021] Call Trace: [do sep 16 06:06:15 2021] handle_timeout+0x398/0x6f0 [libceph] [do sep 16 06:06:15 2021] process_one_work+0x165/0x370 [do sep 16 06:06:15 2021] worker_thread+0x49/0x3e0 [do sep 16 06:06:15 2021] kthread+0xf8/0x130 [do sep 16 06:06:15 2021] ? rescuer_thread+0x310/0x310 [do sep 16 06:06:15 2021] ? kthread_bind+0x10/0x10 [do sep 16 06:06:15 2021] ret_from_fork+0x35/0x40 [do sep 16 06:06:15 2021] ---[ end trace 9611d2cd27856122 ]--- On Thu, 9 Sept 2021 at 12:30, Leon Ruumpol <l.ruumpol@xxxxxxxxx> wrote: > Hello, > > We have a ceph cluster with CephFS and RBD images enabled, from Xen-NG we > connect directly to rbd images. Several times a day the VMs suffer from a > high load/iowait which makes them temporarily inaccessible (arround 10~30 > seconds), in the logs on xen-ng I find this: > > [Thu Sep 9 02:16:06 2021] rbd: rbd4: encountered watch error: -107 > [Thu Sep 9 02:17:47 2021] rbd: rbd3: encountered watch error: -107 > [Thu Sep 9 02:18:55 2021] rbd: rbd4: encountered watch error: -107 > [Thu Sep 9 02:19:54 2021] rbd: rbd3: encountered watch error: -107 > [Thu Sep 9 02:49:39 2021] rbd: rbd3: encountered watch error: -107 > [Thu Sep 9 03:47:25 2021] rbd: rbd3: encountered watch error: -107 > [Thu Sep 9 03:48:07 2021] rbd: rbd4: encountered watch error: -107 > [Thu Sep 9 04:47:30 2021] rbd: rbd3: encountered watch error: -107 > [Thu Sep 9 04:47:55 2021] rbd: rbd4: encountered watch error: -107 > > Version XEN: XCP-ng release 8.2.0 (xenenterprise) / Kernel 4.19.0+1 / > running on 4 physical nodes. > > The Ceph cluster consists of 6 physical nodes, with48 osds (nvme), 3 mgr, > 3 mon, 3 mds services connected with 2x10Gbps trunks from all hosts. Ceph > status/detail is OK, no iowait/high cpu/network spikes. We have looked in > the logs for a reason, but we are unable to match it with anything. > Sometimes a scrub is in progress during these watch errors, but this does > not always occur. Where is the best place to continue the search? > > Ceph.conf > > [global] > fsid = ****** > mon_initial_members = ceph-c01-mon-n1, ceph-c01-mon-n2, ceph-c01-mon-n3 > mon_host = *.*.*.170,*.*.*.171,*.*.*.172 > public network = *.*.*.0/19 > cluster network = *.*.*.0/19 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > filestore_xattr_use_omap = true > > Ceph versions: > > Versions: > { > "mon": { > "ceph version 14.2.22() nautilus(stable)": 3 > }, > "mgr": { > "ceph version 14.2.22() nautilus(stable)": 3 > }, > "osd": { > "ceph version 14.2.22() nautilus(stable)": 48 > }, > "mds": { > "ceph version 14.2.22() nautilus(stable)": 3 > }, > "overall": { > "ceph version 14.2.22() nautilus(stable)": 57 > } > } > > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx