RBD stale after ceph rolling upgrade

Jules <jules@xxxxxxxxx> · Mon, 2 Aug 2021 16:13:50 +0200

After passing the stage where CVE Patch (CVE-2021-20288: Unauthorized 
global_id reuse in cephx) for mon_warn_on_insecure_global_id_reclaim 
came into play and doing further rolling upgrades up to the latest 
version we are facing into a weird behavior executing: ceph.target on a 
single node

all VMs (on the updated node and also the remaining ones), depending on 
the IOPS workload will soon or later stop responding to write requests 
hitting kernel msg "blocked for more than x seconds", Virtio block 
device timeout is a way higher than virtio scsi but it doesn't really 
matter, both RBD VM becoming stale.
If the IOPS workload is high, it initially happen, if workload is low it 
can take up to 4 hours until the bug is triggered.

The only way to solve this unrespoding block device hung up is by 
restarting every single VM. After rebooting VMs they will blame about 
"orphan cleanup" but happily i never seen a failing storage yet since 
fsck seem to be able to fix the inode corrupt.
After restarting every single VMs they continue to work proberly until 
we execute ceph.target again.

The logs doesn't show any special issues except these:

2021-07-27 11:40:36.904 7f9c3b4f5700  1 mon.xxx-infra3@2(probing) e3 
handle_auth_request failed to assign global_id
2021-07-27 11:40:36.924 7f9c3b4f5700  1 mon.xxx-infra3@2(probing) e3 
handle_auth_request failed to assign global_id
2021-07-27 11:40:36.928 7f9c3b4f5700  1 mon.xxx-infra3@2(probing) e3 
handle_auth_request failed to assign global_id
2021-07-27 11:40:37.304 7f9c3b4f5700  1 mon.xxx-infra3@2(probing) e3 
handle_auth_request failed to assign global_id

2021-07-27 11:45:21.161 7f1400cc2700  0 auth: could not find secret_id=10715
2021-07-27 11:45:21.161 7f1400cc2700  0 cephx: verify_authorizer could 
not get service secret for service osd secret_id=10715
2021-07-27 11:45:36.173 7f1400cc2700  0 auth: could not find secret_id=10715
2021-07-27 11:45:36.173 7f1400cc2700  0 cephx: verify_authorizer could 
not get service secret for service osd secret_id=10715
2021-07-27 11:45:51.184 7f1400cc2700  0 auth: could not find secret_id=10715

3 Node Node specs:
Proxmox 6.4 KVM 5.2.0-6 using RBD images (mixed: virtio block/virtio scsi)
2 to 4 x SSD OSDs per Node
3 mons, 3 mgr

Affected releases:
- Ceph Nautilus 14.2.22 (prior version: 14.2.20)
- Ceph Octopus 15.2.13 (prior version: 15.2.11)

Any ideas what causing this issues?

Kind Regards
Jules

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx