Hi,
A word of caution for Ceph operators out there. Be careful with "ceph
osd primary-temp" command. TL;DR: with primary_temp active, a CRUSH
change might CRASH your OSDs ... and they won't come back online after a
restart (in almost all cases).
The bug is described in this tracker [1], and fixed with this PR [2]
(thanks Igor!).
The longer story is that we were inspired by the work done on the
read-balancer [3] and wondered if we could leverage this on older
clusters. It turned out this is indeed possible by using the
"primary-temp" command, instead of "pg-primary-temp" that will be
available from Reef onward. We compiled a main version of the
"osdmaptool", fed it OSD maps from pacific clusters and have it
calculate the optimal primary PG distributions. Then we replaced the
pg-primary-temp command with "primary-temp" and applied the commands.
That worked as expected. However, we hit a bug [1] in the Ceph code that
could not handle a situation when there were changes in the CRUSH map
when primary_temp where active. We added a new storage node to the
cluster that triggered this condition as soon as we put it in the proper
failure domain. It tried to make an OSD primary that was not in the
active set anymore, and hence crashed (with a Segmentation Fault most
often, or aborted). This resulted in multiple (many) OSD crashes across
the failure domains and basically took down the whole cluster.
If the Reef / main read-balancer code can suffer from the same bug is as
of yet unknown (at least to us). We will try to build a Reef test
cluster and find out.
For those of you who want to know how we handled this incident can read
the RFO ([4] in dutch, [5] in English).
Gr. Stefan
[1]: https://tracker.ceph.com/issues/59491?next_issue_id=59490
[2]: https://github.com/ceph/ceph/pull/51160
[3]: https://github.com/ljflores/ceph_read_balancer_2023
[4]: https://www.bit.nl/uploads/images/PDF-Files/RFO-20230314-185335.pdf
[5]: https://www.bit.nl/uploads/images/PDF-Files/2023.04.20%20RFO_Ceph
Cluster_185335_EN.pdf
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx