Re: MON sync time depends on outage duration

Dan van der Ster <dan.vanderster@xxxxxxxxx> · Mon, 10 Jul 2023 10:32:45 -0700

Oh yes, sounds like purging the rbd trash will be the real fix here!
Good luck!

______________________________________________________
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com

On Mon, Jul 10, 2023 at 6:10 AM Eugen Block <eblock@xxxxxx> wrote:

> Hi,
> I got a customer response with payload size 4096, that made things
> even worse. The mon startup time was now around 40 minutes. My doubts
> wrt decreasing the payload size seem confirmed. Then I read Dan's
> response again which also mentions that the default payload size could
> be too small. So I asked them to double the default (2M instead of 1M)
> and am now waiting for a new result. I'm still wondering why this only
> happens when the mon is down for more than 5 minutes. Does anyone have
> an explanation for that time factor?
> Another thing they're going to do is to remove lots of snapshot
> tombstones (rbd mirroring snapshots in the trash namespace), maybe
> that will reduce the osd_snap keys in the mon db, which then would
> increase the startup time. We'll see...
>
> Zitat von Eugen Block <eblock@xxxxxx>:
>
> > Thanks, Dan!
> >
> >> Yes that sounds familiar from the luminous and mimic days.
> >> The workaround for zillions of snapshot keys at that time was to use:
> >>   ceph config set mon mon_sync_max_payload_size 4096
> >
> > I actually did search for mon_sync_max_payload_keys, not bytes so I
> > missed your thread, it seems. Thanks for pointing that out. So the
> > defaults seem to be these in Octopus:
> >
> >     "mon_sync_max_payload_keys": "2000",
> >     "mon_sync_max_payload_size": "1048576",
> >
> >> So it could be in your case that the sync payload is just too small to
> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
> debug_mon
> >> you should be able to understand what is taking so long, and tune
> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
> >
> > I'm confused, if the payload size is too small, why would decreasing
> > it help? Or am I misunderstanding something? But it probably won't
> > hurt to try it with 4096 and see if anything changes. If not we can
> > still turn on debug logs and take a closer look.
> >
> >> And additional to Dan suggestion, the HDD is not a good choices for
> >> RocksDB, which is most likely the reason for this thread, I think
> >> that from the 3rd time the database just goes into compaction
> >> maintenance
> >
> > Believe me, I know... but there's not much they can currently do
> > about it, quite a long story... But I have been telling them that
> > for months now. Anyway, I will make some suggestions and report back
> > if it worked in this case as well.
> >
> > Thanks!
> > Eugen
> >
> > Zitat von Dan van der Ster <dan.vanderster@xxxxxxxxx>:
> >
> >> Hi Eugen!
> >>
> >> Yes that sounds familiar from the luminous and mimic days.
> >>
> >> Check this old thread:
> >>
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
> >> (that thread is truncated but I can tell you that it worked for Frank).
> >> Also the even older referenced thread:
> >>
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/
> >>
> >> The workaround for zillions of snapshot keys at that time was to use:
> >>   ceph config set mon mon_sync_max_payload_size 4096
> >>
> >> That said, that sync issue was supposed to be fixed by way of adding the
> >> new option mon_sync_max_payload_keys, which has been around since
> nautilus.
> >>
> >> So it could be in your case that the sync payload is just too small to
> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
> debug_mon
> >> you should be able to understand what is taking so long, and tune
> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
> >>
> >> Good luck!
> >>
> >> Dan
> >>
> >> ______________________________________________________
> >> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
> >>
> >>
> >>
> >> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock@xxxxxx> wrote:
> >>
> >>> Hi *,
> >>>
> >>> I'm investigating an interesting issue on two customer clusters (used
> >>> for mirroring) I've not solved yet, but today we finally made some
> >>> progress. Maybe someone has an idea where to look next, I'd appreciate
> >>> any hints or comments.
> >>> These are two (latest) Octopus clusters, main usage currently is RBD
> >>> mirroring with snapshot mode (around 500 RBD images are synced every
> >>> 30 minutes). They noticed very long startup times of MON daemons after
> >>> reboot, times between 10 and 30 minutes (reboot time already
> >>> subtracted). These delays are present on both sites. Today we got a
> >>> maintenance window and started to check in more detail by just
> >>> restarting the MON service (joins quorum within seconds), then
> >>> stopping the MON service and wait a few minutes (still joins quorum
> >>> within seconds). And then we stopped the service and waited for more
> >>> than 5 minutes, simulating a reboot, and then we were able to
> >>> reproduce it. The sync then takes around 15 minutes, we verified with
> >>> other MONs as well. The MON store is around 2 GB of size (on HDD), I
> >>> understand that the sync itself can take some time, but what is the
> >>> threshold here? I tried to find a hint in the MON config, searching
> >>> for timeouts with 300 seconds, there were only a few matches
> >>> (mon_session_timeout is one of them), but I'm not sure if they can
> >>> explain this behavior.
> >>> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed
> >>> that there were more than 42 Million osd_snap keys, which is quite a
> >>> lot and would explain the size of the MON store. But I'm also not sure
> >>> if it's related to the long syncing process.
> >>> Does that sound familiar to anyone?
> >>>
> >>> Thanks,
> >>> Eugen
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx