Re: Pacific: mgr loses osd removal queue

Eugen Block <eblock@xxxxxx> · Sat, 16 Nov 2024 14:08:49 +0000

I have a feeling that this could be related to the drives, but I have  
no real proof. I drained the SSD OSDs yesterday, hours later I wanted  
to remove the OSDs (no PGs were on them anymore) via for loop with the  
orchestrator (ceph orch osd rm ID --force --zap). The first one got  
removed quite quickly, but the others disappeared from the queue. I  
tried again and looked at the iostat output on the node where the  
queued OSD was running on. The drive had no IO at all but was utilized  
100% for several minutes until it was eventually removed.
I find that very weird, especially since we’re currently helping a  
customer rebuild OSDs on Pacific as well, and I haven’t seen such a  
behavior yet. And we already redeployed 132 OSDs so far.

Zitat von Eugen Block <eblock@xxxxxx>:

Hi,

I'm not sure if this has been asked before, or if there's an  
existing tracker issue already. It's difficult to reproduce it on my  
lab clusters.
I'm testing some new SSD OSDs on a Pacific cluster (16.2.15) and  
noticed that if we instruct the orchestrator to remove two or three  
OSDs (issuing the command 'ceph orch osd rm {ID}' a couple of  
times), it eventually only removes the first in the queue. I've been  
watching 'ceph orch osd rm status' to see the progress, and then the  
rest of the queued OSDs suddenly vanish from the status and never  
get removed. Then I have to issue the command again. If I remove one  
OSD by one, so no others in the queue, they are all successfully  
removed. Why is this happening? Is this a bug in Pacific? Has  
someone seen this in newer releases?

Thanks!
Eugen

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx