Re: [ceph-users] Be careful with primary-temp to balance primaries ...

Laura Flores <lflores@xxxxxxxxxx> · Thu, 20 Apr 2023 13:42:15 -0500

There was a lot of interest expressed at Cephalocon in using the read balancer code and new commands to Quincy and Pacific. Until I evaluate the possibility of backporting the feature, I would recommend using the read balancer on Reef only, as this is where it the feature has been tested. 

The main concern lies in the new commands we added, “pg-upmap-primary” and “rm-pg-upmap-primary”, which are only available for Reef. Past Ceph versions have a command “primary-temp”, as Stefan mentioned. However, this command only changes the primary on the acting set, and also is not maintained. 

If you would like to test the read balancer code on older versions, please do so with caution. Until the Reef commands are backported (if this proves possible), we cannot guarantee intended behavior with primary-temp.

Thank you,
Laura

On Thu, Apr 20, 2023 at 9:31 AM Stefan Kooman <stefan@xxxxxx> wrote:
Hi,

A word of caution for Ceph operators out there. Be careful with "ceph 

osd primary-temp" command. TL;DR: with primary_temp active, a CRUSH 

change might CRASH your OSDs ... and they won't come back online after a 

restart (in almost all cases).

The bug is described in this tracker [1], and fixed with this PR [2] 

(thanks Igor!).

The longer story is that we were inspired by the work done on the 

read-balancer [3] and wondered if we could leverage this on older 

clusters. It turned out this is indeed possible by using the 

"primary-temp" command, instead of "pg-primary-temp" that will be 

available from Reef onward. We compiled a main version of the 

"osdmaptool", fed it OSD maps from pacific clusters and have it 

calculate the optimal primary PG distributions. Then we replaced the 

pg-primary-temp command with "primary-temp" and applied the commands. 

That worked as expected. However, we hit a bug [1] in the Ceph code that 

could not handle a situation when there were changes in the CRUSH map 

when primary_temp where active. We added a new storage node to the 

cluster that triggered this condition as soon as we put it in the proper 

failure domain. It tried to make an OSD primary that was not in the 

active set anymore, and hence crashed (with a Segmentation Fault most 

often, or aborted). This resulted in multiple (many) OSD crashes across 

the failure domains and basically took down the whole cluster.

If the Reef / main read-balancer code can suffer from the same bug is as 

of yet unknown (at least to us). We will try to build a Reef test 

cluster and find out.

For those of you who want to know how we handled this incident can read 

the RFO ([4] in dutch, [5] in English).

Gr. Stefan

[1]: https://tracker.ceph.com/issues/59491?next_issue_id=59490

[2]: https://github.com/ceph/ceph/pull/51160

[3]: https://github.com/ljflores/ceph_read_balancer_2023

[4]: https://www.bit.nl/uploads/images/PDF-Files/RFO-20230314-185335.pdf

[5]: https://www.bit.nl/uploads/images/PDF-Files/2023.04.20%20RFO_Ceph 

Cluster_185335_EN.pdf

_______________________________________________

ceph-users mailing list -- ceph-users@xxxxxxx

To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 

          Laura Flores

        She/Her/Hers
        Software Engineer, Ceph Storage

        Chicago, IL

      lflores@xxxxxxx | lflores@xxxxxxxxxx
M: +17087388804

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx