Out of curiosity, what is your require_osd_release set to? (ceph osd dump | grep require_osd_release) Josh On Tue, Jul 11, 2023 at 5:11 AM Eugen Block <eblock@xxxxxx> wrote: > > I'm not so sure anymore if that could really help here. The dump-keys > output from the mon contains 42 million osd_snap prefix entries, 39 > million of them are "purged_snap" keys. I also compared to other > clusters as well, those aren't tombstones but expected "history" of > purged snapshots. So I don't think removing a couple of hundred trash > snapshots will actually reduce the number of osd_snap keys. At least > doubling the payload_size seems to have a positive impact. The > compaction during the sync has a negative impact, of course, same as > not having the mon store on SSDs. > I'm currently playing with a test cluster, removing all "purged_snap" > entries from the mon db (not finished yet) to see what that will do > with the mon and if it will even start correctly. But has anyone done > that, removing keys from the mon store? Not sure what to expect yet... > > Zitat von Dan van der Ster <dan.vanderster@xxxxxxxxx>: > > > Oh yes, sounds like purging the rbd trash will be the real fix here! > > Good luck! > > > > ______________________________________________________ > > Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > > > > > > > > > > On Mon, Jul 10, 2023 at 6:10 AM Eugen Block <eblock@xxxxxx> wrote: > > > >> Hi, > >> I got a customer response with payload size 4096, that made things > >> even worse. The mon startup time was now around 40 minutes. My doubts > >> wrt decreasing the payload size seem confirmed. Then I read Dan's > >> response again which also mentions that the default payload size could > >> be too small. So I asked them to double the default (2M instead of 1M) > >> and am now waiting for a new result. I'm still wondering why this only > >> happens when the mon is down for more than 5 minutes. Does anyone have > >> an explanation for that time factor? > >> Another thing they're going to do is to remove lots of snapshot > >> tombstones (rbd mirroring snapshots in the trash namespace), maybe > >> that will reduce the osd_snap keys in the mon db, which then would > >> increase the startup time. We'll see... > >> > >> Zitat von Eugen Block <eblock@xxxxxx>: > >> > >> > Thanks, Dan! > >> > > >> >> Yes that sounds familiar from the luminous and mimic days. > >> >> The workaround for zillions of snapshot keys at that time was to use: > >> >> ceph config set mon mon_sync_max_payload_size 4096 > >> > > >> > I actually did search for mon_sync_max_payload_keys, not bytes so I > >> > missed your thread, it seems. Thanks for pointing that out. So the > >> > defaults seem to be these in Octopus: > >> > > >> > "mon_sync_max_payload_keys": "2000", > >> > "mon_sync_max_payload_size": "1048576", > >> > > >> >> So it could be in your case that the sync payload is just too small to > >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and > >> debug_mon > >> >> you should be able to understand what is taking so long, and tune > >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > >> > > >> > I'm confused, if the payload size is too small, why would decreasing > >> > it help? Or am I misunderstanding something? But it probably won't > >> > hurt to try it with 4096 and see if anything changes. If not we can > >> > still turn on debug logs and take a closer look. > >> > > >> >> And additional to Dan suggestion, the HDD is not a good choices for > >> >> RocksDB, which is most likely the reason for this thread, I think > >> >> that from the 3rd time the database just goes into compaction > >> >> maintenance > >> > > >> > Believe me, I know... but there's not much they can currently do > >> > about it, quite a long story... But I have been telling them that > >> > for months now. Anyway, I will make some suggestions and report back > >> > if it worked in this case as well. > >> > > >> > Thanks! > >> > Eugen > >> > > >> > Zitat von Dan van der Ster <dan.vanderster@xxxxxxxxx>: > >> > > >> >> Hi Eugen! > >> >> > >> >> Yes that sounds familiar from the luminous and mimic days. > >> >> > >> >> Check this old thread: > >> >> > >> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/ > >> >> (that thread is truncated but I can tell you that it worked for Frank). > >> >> Also the even older referenced thread: > >> >> > >> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/ > >> >> > >> >> The workaround for zillions of snapshot keys at that time was to use: > >> >> ceph config set mon mon_sync_max_payload_size 4096 > >> >> > >> >> That said, that sync issue was supposed to be fixed by way of adding the > >> >> new option mon_sync_max_payload_keys, which has been around since > >> nautilus. > >> >> > >> >> So it could be in your case that the sync payload is just too small to > >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and > >> debug_mon > >> >> you should be able to understand what is taking so long, and tune > >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > >> >> > >> >> Good luck! > >> >> > >> >> Dan > >> >> > >> >> ______________________________________________________ > >> >> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > >> >> > >> >> > >> >> > >> >> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock@xxxxxx> wrote: > >> >> > >> >>> Hi *, > >> >>> > >> >>> I'm investigating an interesting issue on two customer clusters (used > >> >>> for mirroring) I've not solved yet, but today we finally made some > >> >>> progress. Maybe someone has an idea where to look next, I'd appreciate > >> >>> any hints or comments. > >> >>> These are two (latest) Octopus clusters, main usage currently is RBD > >> >>> mirroring with snapshot mode (around 500 RBD images are synced every > >> >>> 30 minutes). They noticed very long startup times of MON daemons after > >> >>> reboot, times between 10 and 30 minutes (reboot time already > >> >>> subtracted). These delays are present on both sites. Today we got a > >> >>> maintenance window and started to check in more detail by just > >> >>> restarting the MON service (joins quorum within seconds), then > >> >>> stopping the MON service and wait a few minutes (still joins quorum > >> >>> within seconds). And then we stopped the service and waited for more > >> >>> than 5 minutes, simulating a reboot, and then we were able to > >> >>> reproduce it. The sync then takes around 15 minutes, we verified with > >> >>> other MONs as well. The MON store is around 2 GB of size (on HDD), I > >> >>> understand that the sync itself can take some time, but what is the > >> >>> threshold here? I tried to find a hint in the MON config, searching > >> >>> for timeouts with 300 seconds, there were only a few matches > >> >>> (mon_session_timeout is one of them), but I'm not sure if they can > >> >>> explain this behavior. > >> >>> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed > >> >>> that there were more than 42 Million osd_snap keys, which is quite a > >> >>> lot and would explain the size of the MON store. But I'm also not sure > >> >>> if it's related to the long syncing process. > >> >>> Does that sound familiar to anyone? > >> >>> > >> >>> Thanks, > >> >>> Eugen > >> >>> _______________________________________________ > >> >>> ceph-users mailing list -- ceph-users@xxxxxxx > >> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> >>> > >> > >> > >> > >> > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx