PGs stuck after replacing OSDs

Ml Ml <mliebherr99@xxxxxxxxxxxxxx> · Tue, 17 Aug 2021 09:41:35 +0200

Hello List,

I am running Proxmox on top of ceph 14.2.20 on the nodes, replica 3, size 2.

Last week I wanted to swap the HDDs to SDDs on one node.

Since i have 3 Nodes with replica 3, size 2 i did the following:

1.) cep osd set noout
2.) Stopped all OSD on that one node
3.) i set the OSDs to out "ceph osd out" on that node
4.) I removed/destroyed the OSD
5.) I physically took the disk/osd out
6.) I plugged my SSDs in and started to add them as OSDs.

Recovery was active and running, but some PGs did not serve IO and
where stuck. VMs started to  complain about IO problems. Looked like
write was not able to some.

Looked like I had "pgs stuck" and "slow osds blocking"...
...but "ceph osd perf" and "iostat -dx 3" showed bored/idle OSDs.

...I restarted the OSD which seemed to be involved. Which did not help.

After 1h or so i started to restart ALL the OSDs one by one in the
whole Cluster. After restarting the last OSD in that cluster on a very
different node, the blocking error went away
and everything seemed to recover smoothly.

I wonder what i did wrong.
I did those Steps (1-6) within 5mins (so pretty fast). Maybe I should
have taken more time?
Was it too rough to replace all OSDs on one node?
Should I have replaced it one by one?

Any hints are welcome.

Cheers,
Mario
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx