Adding new OSD's - slow_ops and other issues.

jskr@xxxxxxxxxx · Mon, 11 Mar 2024 08:49:31 -0000

Hi. 

We have a cluster working very nicely since it was put up more than a year ago. Now we needed to add more NVMe drives to expand. 

After setting all the "no" flags.. we added them using

$ ceph orch osd add .... 

The twist is that we have managed to get the default weights set to 1 for all disks not 7.68 (as the default for the ceph orch command. 

Thus we did a subsequent reweight to change weight -- and then removed the "no" flags. 

As a consequence we had a bunch of OSD's delivering slow_ops and -- after manually restarting osd's to get rid of them - the system returned to normal. 

... second try... 

Same drill - but somehow the ceph orch command failed to bring the new OSD online before we ran the reweight command ... and it works flawlessly 

... third try ... 

Same drill - but now ceph orch brought the new OSD into the system - and we saw excactly the same problem again. Being a bit wiser - we forcefully restarted the new OSD.. and everything whet back into normal mode again. 

Thus it seems like the "reweight" command on online OSD's have a bad effect on our setup - causing major service disruption. 

1) Is it possible to "bulk" change default weights on all OSD's without a huge data movement going on? 
2) or Is it possible to instrurct "ceph orch osd add" to set default weight before it putting the new OSD into the system? 

I would not expect above to be expected behaviour - if someone has ideas about what goes on more than above please share? 

Setup: 
# ceph version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

43 7.68 TB NVMe's over 12 OSD hosts - all connected using 2x 100GbitE 

Thanks Jesper
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx