Re: Pacific: mgr loses osd removal queue

Eugen Block <eblock@xxxxxx> · Mon, 18 Nov 2024 12:09:20 +0000

Hi, thanks for chiming in. I believe there's a slot limit of 10 for  
the queue, at least I believe I read that some time ago somewhere, so  
that would explain those 10 parallel drains you mention. I also don't  
have any such issues on customer clusters, that's why I still suspect  
the drives... Anyway, they're gone now. When we get new drives, I'm  
going to test this as again, just to rule anything else out.

Zitat von Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>:

Hi Eugen,

I've removed 12 OSDs with a 'ceph orch osd rm ID --replace' last  
week on Pacific and even though only 10 OSDs started draining their  
PGs at a time (the other 2 waiting for an available 'slot',  
obviously) all 12 OSDs got removed successfully at the end.

Cheers,
Frédéric.

----- Le 16 Nov 24, à 15:08, Eugen Block eblock@xxxxxx a écrit :

I have a feeling that this could be related to the drives, but I have
no real proof. I drained the SSD OSDs yesterday, hours later I wanted
to remove the OSDs (no PGs were on them anymore) via for loop with the
orchestrator (ceph orch osd rm ID --force --zap). The first one got
removed quite quickly, but the others disappeared from the queue. I
tried again and looked at the iostat output on the node where the
queued OSD was running on. The drive had no IO at all but was utilized
100% for several minutes until it was eventually removed.
I find that very weird, especially since we’re currently helping a
customer rebuild OSDs on Pacific as well, and I haven’t seen such a
behavior yet. And we already redeployed 132 OSDs so far.

Zitat von Eugen Block <eblock@xxxxxx>:

Hi,

I'm not sure if this has been asked before, or if there's an
existing tracker issue already. It's difficult to reproduce it on my
lab clusters.
I'm testing some new SSD OSDs on a Pacific cluster (16.2.15) and
noticed that if we instruct the orchestrator to remove two or three
OSDs (issuing the command 'ceph orch osd rm {ID}' a couple of
times), it eventually only removes the first in the queue. I've been
watching 'ceph orch osd rm status' to see the progress, and then the
rest of the queued OSDs suddenly vanish from the status and never
get removed. Then I have to issue the command again. If I remove one
OSD by one, so no others in the queue, they are all successfully
removed. Why is this happening? Is this a bug in Pacific? Has
someone seen this in newer releases?

Thanks!
Eugen

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx