After passing the stage where CVE Patch (CVE-2021-20288: Unauthorized
global_id reuse in cephx) for mon_warn_on_insecure_global_id_reclaim
came into play and doing further rolling upgrades up to the latest
version we are facing into a weird behavior executing: ceph.target on a
single node
all VMs (on the updated node and also the remaining ones), depending on
the IOPS workload will soon or later stop responding to write requests
hitting kernel msg "blocked for more than x seconds", Virtio block
device timeout is a way higher than virtio scsi but it doesn't really
matter, both RBD VM becoming stale.
If the IOPS workload is high, it initially happen, if workload is low it
can take up to 4 hours until the bug is triggered.
The only way to solve this unrespoding block device hung up is by
restarting every single VM. After rebooting VMs they will blame about
"orphan cleanup" but happily i never seen a failing storage yet since
fsck seem to be able to fix the inode corrupt.
After restarting every single VMs they continue to work proberly until
we execute ceph.target again.
The logs doesn't show any special issues except these:
2021-07-27 11:40:36.904 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3
handle_auth_request failed to assign global_id
2021-07-27 11:40:36.924 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3
handle_auth_request failed to assign global_id
2021-07-27 11:40:36.928 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3
handle_auth_request failed to assign global_id
2021-07-27 11:40:37.304 7f9c3b4f5700 1 mon.xxx-infra3@2(probing) e3
handle_auth_request failed to assign global_id
2021-07-27 11:45:21.161 7f1400cc2700 0 auth: could not find secret_id=10715
2021-07-27 11:45:21.161 7f1400cc2700 0 cephx: verify_authorizer could
not get service secret for service osd secret_id=10715
2021-07-27 11:45:36.173 7f1400cc2700 0 auth: could not find secret_id=10715
2021-07-27 11:45:36.173 7f1400cc2700 0 cephx: verify_authorizer could
not get service secret for service osd secret_id=10715
2021-07-27 11:45:51.184 7f1400cc2700 0 auth: could not find secret_id=10715
3 Node Node specs:
Proxmox 6.4 KVM 5.2.0-6 using RBD images (mixed: virtio block/virtio scsi)
2 to 4 x SSD OSDs per Node
3 mons, 3 mgr
Affected releases:
- Ceph Nautilus 14.2.22 (prior version: 14.2.20)
- Ceph Octopus 15.2.13 (prior version: 15.2.11)
Any ideas what causing this issues?
Kind Regards
Jules
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx