Re: MON sync time depends on outage duration

Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> · Tue, 11 Jul 2023 08:08:49 -0600

Out of curiosity, what is your require_osd_release set to? (ceph osd
dump | grep require_osd_release)

Josh

On Tue, Jul 11, 2023 at 5:11 AM Eugen Block <eblock@xxxxxx> wrote:
>
> I'm not so sure anymore if that could really help here. The dump-keys
> output from the mon contains 42 million osd_snap prefix entries, 39
> million of them are "purged_snap" keys. I also compared to other
> clusters as well, those aren't tombstones but expected "history" of
> purged snapshots. So I don't think removing a couple of hundred trash
> snapshots will actually reduce the number of osd_snap keys. At least
> doubling the payload_size seems to have a positive impact. The
> compaction during the sync has a negative impact, of course, same as
> not having the mon store on SSDs.
> I'm currently playing with a test cluster, removing all "purged_snap"
> entries from the mon db (not finished yet) to see what that will do
> with the mon and if it will even start correctly. But has anyone done
> that, removing keys from the mon store? Not sure what to expect yet...
>
> Zitat von Dan van der Ster <dan.vanderster@xxxxxxxxx>:
>
> > Oh yes, sounds like purging the rbd trash will be the real fix here!
> > Good luck!
> >
> > ______________________________________________________
> > Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
> >
> >
> >
> >
> > On Mon, Jul 10, 2023 at 6:10 AM Eugen Block <eblock@xxxxxx> wrote:
> >
> >> Hi,
> >> I got a customer response with payload size 4096, that made things
> >> even worse. The mon startup time was now around 40 minutes. My doubts
> >> wrt decreasing the payload size seem confirmed. Then I read Dan's
> >> response again which also mentions that the default payload size could
> >> be too small. So I asked them to double the default (2M instead of 1M)
> >> and am now waiting for a new result. I'm still wondering why this only
> >> happens when the mon is down for more than 5 minutes. Does anyone have
> >> an explanation for that time factor?
> >> Another thing they're going to do is to remove lots of snapshot
> >> tombstones (rbd mirroring snapshots in the trash namespace), maybe
> >> that will reduce the osd_snap keys in the mon db, which then would
> >> increase the startup time. We'll see...
> >>
> >> Zitat von Eugen Block <eblock@xxxxxx>:
> >>
> >> > Thanks, Dan!
> >> >
> >> >> Yes that sounds familiar from the luminous and mimic days.
> >> >> The workaround for zillions of snapshot keys at that time was to use:
> >> >>   ceph config set mon mon_sync_max_payload_size 4096
> >> >
> >> > I actually did search for mon_sync_max_payload_keys, not bytes so I
> >> > missed your thread, it seems. Thanks for pointing that out. So the
> >> > defaults seem to be these in Octopus:
> >> >
> >> >     "mon_sync_max_payload_keys": "2000",
> >> >     "mon_sync_max_payload_size": "1048576",
> >> >
> >> >> So it could be in your case that the sync payload is just too small to
> >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
> >> debug_mon
> >> >> you should be able to understand what is taking so long, and tune
> >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
> >> >
> >> > I'm confused, if the payload size is too small, why would decreasing
> >> > it help? Or am I misunderstanding something? But it probably won't
> >> > hurt to try it with 4096 and see if anything changes. If not we can
> >> > still turn on debug logs and take a closer look.
> >> >
> >> >> And additional to Dan suggestion, the HDD is not a good choices for
> >> >> RocksDB, which is most likely the reason for this thread, I think
> >> >> that from the 3rd time the database just goes into compaction
> >> >> maintenance
> >> >
> >> > Believe me, I know... but there's not much they can currently do
> >> > about it, quite a long story... But I have been telling them that
> >> > for months now. Anyway, I will make some suggestions and report back
> >> > if it worked in this case as well.
> >> >
> >> > Thanks!
> >> > Eugen
> >> >
> >> > Zitat von Dan van der Ster <dan.vanderster@xxxxxxxxx>:
> >> >
> >> >> Hi Eugen!
> >> >>
> >> >> Yes that sounds familiar from the luminous and mimic days.
> >> >>
> >> >> Check this old thread:
> >> >>
> >> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
> >> >> (that thread is truncated but I can tell you that it worked for Frank).
> >> >> Also the even older referenced thread:
> >> >>
> >> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/
> >> >>
> >> >> The workaround for zillions of snapshot keys at that time was to use:
> >> >>   ceph config set mon mon_sync_max_payload_size 4096
> >> >>
> >> >> That said, that sync issue was supposed to be fixed by way of adding the
> >> >> new option mon_sync_max_payload_keys, which has been around since
> >> nautilus.
> >> >>
> >> >> So it could be in your case that the sync payload is just too small to
> >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
> >> debug_mon
> >> >> you should be able to understand what is taking so long, and tune
> >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
> >> >>
> >> >> Good luck!
> >> >>
> >> >> Dan
> >> >>
> >> >> ______________________________________________________
> >> >> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock@xxxxxx> wrote:
> >> >>
> >> >>> Hi *,
> >> >>>
> >> >>> I'm investigating an interesting issue on two customer clusters (used
> >> >>> for mirroring) I've not solved yet, but today we finally made some
> >> >>> progress. Maybe someone has an idea where to look next, I'd appreciate
> >> >>> any hints or comments.
> >> >>> These are two (latest) Octopus clusters, main usage currently is RBD
> >> >>> mirroring with snapshot mode (around 500 RBD images are synced every
> >> >>> 30 minutes). They noticed very long startup times of MON daemons after
> >> >>> reboot, times between 10 and 30 minutes (reboot time already
> >> >>> subtracted). These delays are present on both sites. Today we got a
> >> >>> maintenance window and started to check in more detail by just
> >> >>> restarting the MON service (joins quorum within seconds), then
> >> >>> stopping the MON service and wait a few minutes (still joins quorum
> >> >>> within seconds). And then we stopped the service and waited for more
> >> >>> than 5 minutes, simulating a reboot, and then we were able to
> >> >>> reproduce it. The sync then takes around 15 minutes, we verified with
> >> >>> other MONs as well. The MON store is around 2 GB of size (on HDD), I
> >> >>> understand that the sync itself can take some time, but what is the
> >> >>> threshold here? I tried to find a hint in the MON config, searching
> >> >>> for timeouts with 300 seconds, there were only a few matches
> >> >>> (mon_session_timeout is one of them), but I'm not sure if they can
> >> >>> explain this behavior.
> >> >>> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed
> >> >>> that there were more than 42 Million osd_snap keys, which is quite a
> >> >>> lot and would explain the size of the MON store. But I'm also not sure
> >> >>> if it's related to the long syncing process.
> >> >>> Does that sound familiar to anyone?
> >> >>>
> >> >>> Thanks,
> >> >>> Eugen
> >> >>> _______________________________________________
> >> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >>>
> >>
> >>
> >>
> >>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx