Re: MON sync time depends on outage duration

Eugen Block <eblock@xxxxxx> · Fri, 28 Jul 2023 10:36:37 +0000

Hi,

I think we found an explanation for the behaviour, we still need to  
verify it though. Just wanted to write it up for posterity.
We already knew that the large number of "purged_snap" keys in the mon  
store is responsible for the long synchronization. Removing them  
didn't seem to have a negative impact in my test cluster, but don't  
want to try that in production. They also tried a couple of variations  
with mon_sync_payload_size but it didn't have a significant impact (it  
impacted a few other keys, but not the osd_snap keys). We seemed to  
hit the payload_keys limit (default 2000), we'll suggest to increase  
it and hopefully find a suitable value. But it still didn't explain  
the variations in the sync duration.
So we looked deeper (also dived into the code) and finally got some  
debug logs we could analyse.
The paxos versions determine if a "full sync" is required or a "recent  
sync" is sufficient:

if (paxos->get_version() < m->paxos_first_version &&
	m->paxos_first_version > 1) {
      dout(10) << " peer paxos first versions [" << m->paxos_first_version
	       << "," << m->paxos_last_version << "]"
	       << " vs my version " << paxos->get_version()
	       << " (too far ahead)"
	       << dendl;
...

So if the current version of the to-be-synced mon is lower than the  
first available version of the peer it starts a full sync, otherwise a  
recent sync is started. In one of the tests (simulating a mon reboot)  
the difference between paxos versions was 628. I checked the available  
mon config options and found "paxos_min" (default 500). This will be  
the next suggestion, increase paxos_min to 1000 so the cluster doesn't  
require a full sync after a regular reboot and only do a full sync in  
case it's down for a longer period of time. Not sure what other impact  
it could have except for some more storage consumption, but we'll let  
them test it.
But this still doesn't explain the variations in the startup times. My  
current theory is that the duration depends on the timing of the  
reboot/daemon shutdown: The rbd-mirror is currently configured with a  
30 minute schedule. This means that every full and every half hour new  
snapshots are created and synced, older snapshots are deleted which  
impacts the osdmap. So if a MON goes down during this time it's very  
likely that its paxos version will be lower than the first available  
on the peer(s). So if a reboot is scheduled after the snapshot  
schedule the mon synchronisation time probably will decrease. This  
also needs some varification, still waiting for the results.

From my perspective, those two config options (mon_sync_payload_keys,  
paxos_min) and rebooting a MON server at the right time are the most  
promising approaches for now. Having the mon store on SSDs would help  
as well, of course, but unfortunately that's currently not an option.

I'll update this thread when we have more results, maybe my theory  
garbage, but I'm confident. :-) If you have comments or objections  
regarding those config options, I'd appreciate your comments.

Thanks,
Eugen

Zitat von Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx>:

Out of curiosity, what is your require_osd_release set to? (ceph osd
dump | grep require_osd_release)

Josh

On Tue, Jul 11, 2023 at 5:11 AM Eugen Block <eblock@xxxxxx> wrote:

I'm not so sure anymore if that could really help here. The dump-keys
output from the mon contains 42 million osd_snap prefix entries, 39
million of them are "purged_snap" keys. I also compared to other
clusters as well, those aren't tombstones but expected "history" of
purged snapshots. So I don't think removing a couple of hundred trash
snapshots will actually reduce the number of osd_snap keys. At least
doubling the payload_size seems to have a positive impact. The
compaction during the sync has a negative impact, of course, same as
not having the mon store on SSDs.
I'm currently playing with a test cluster, removing all "purged_snap"
entries from the mon db (not finished yet) to see what that will do
with the mon and if it will even start correctly. But has anyone done
that, removing keys from the mon store? Not sure what to expect yet...

Zitat von Dan van der Ster <dan.vanderster@xxxxxxxxx>:

> Oh yes, sounds like purging the rbd trash will be the real fix here!
> Good luck!
>
> ______________________________________________________
> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
>
>
>
>
> On Mon, Jul 10, 2023 at 6:10 AM Eugen Block <eblock@xxxxxx> wrote:
>
>> Hi,
>> I got a customer response with payload size 4096, that made things
>> even worse. The mon startup time was now around 40 minutes. My doubts
>> wrt decreasing the payload size seem confirmed. Then I read Dan's
>> response again which also mentions that the default payload size could
>> be too small. So I asked them to double the default (2M instead of 1M)
>> and am now waiting for a new result. I'm still wondering why this only
>> happens when the mon is down for more than 5 minutes. Does anyone have
>> an explanation for that time factor?
>> Another thing they're going to do is to remove lots of snapshot
>> tombstones (rbd mirroring snapshots in the trash namespace), maybe
>> that will reduce the osd_snap keys in the mon db, which then would
>> increase the startup time. We'll see...
>>
>> Zitat von Eugen Block <eblock@xxxxxx>:
>>
>> > Thanks, Dan!
>> >
>> >> Yes that sounds familiar from the luminous and mimic days.
>> >> The workaround for zillions of snapshot keys at that time was to use:
>> >>   ceph config set mon mon_sync_max_payload_size 4096
>> >
>> > I actually did search for mon_sync_max_payload_keys, not bytes so I
>> > missed your thread, it seems. Thanks for pointing that out. So the
>> > defaults seem to be these in Octopus:
>> >
>> >     "mon_sync_max_payload_keys": "2000",
>> >     "mon_sync_max_payload_size": "1048576",
>> >
>> >> So it could be in your case that the sync payload is just too small to
>> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
>> debug_mon
>> >> you should be able to understand what is taking so long, and tune
>> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
>> >
>> > I'm confused, if the payload size is too small, why would decreasing
>> > it help? Or am I misunderstanding something? But it probably won't
>> > hurt to try it with 4096 and see if anything changes. If not we can
>> > still turn on debug logs and take a closer look.
>> >
>> >> And additional to Dan suggestion, the HDD is not a good choices for
>> >> RocksDB, which is most likely the reason for this thread, I think
>> >> that from the 3rd time the database just goes into compaction
>> >> maintenance
>> >
>> > Believe me, I know... but there's not much they can currently do
>> > about it, quite a long story... But I have been telling them that
>> > for months now. Anyway, I will make some suggestions and report back
>> > if it worked in this case as well.
>> >
>> > Thanks!
>> > Eugen
>> >
>> > Zitat von Dan van der Ster <dan.vanderster@xxxxxxxxx>:
>> >
>> >> Hi Eugen!
>> >>
>> >> Yes that sounds familiar from the luminous and mimic days.
>> >>
>> >> Check this old thread:
>> >>
>>  
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
>> >> (that thread is truncated but I can tell you that it worked  
for Frank).
>> >> Also the even older referenced thread:
>> >>
>>  
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/
>> >>
>> >> The workaround for zillions of snapshot keys at that time was to use:
>> >>   ceph config set mon mon_sync_max_payload_size 4096
>> >>
>> >> That said, that sync issue was supposed to be fixed by way of  
adding the
>> >> new option mon_sync_max_payload_keys, which has been around since
>> nautilus.
>> >>
>> >> So it could be in your case that the sync payload is just too small to
>> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
>> debug_mon
>> >> you should be able to understand what is taking so long, and tune
>> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
>> >>
>> >> Good luck!
>> >>
>> >> Dan
>> >>
>> >> ______________________________________________________
>> >> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
>> >>
>> >>
>> >>
>> >> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock@xxxxxx> wrote:
>> >>
>> >>> Hi *,
>> >>>
>> >>> I'm investigating an interesting issue on two customer clusters (used
>> >>> for mirroring) I've not solved yet, but today we finally made some
>> >>> progress. Maybe someone has an idea where to look next, I'd  
appreciate
>> >>> any hints or comments.
>> >>> These are two (latest) Octopus clusters, main usage currently is RBD
>> >>> mirroring with snapshot mode (around 500 RBD images are synced every
>> >>> 30 minutes). They noticed very long startup times of MON  
daemons after
>> >>> reboot, times between 10 and 30 minutes (reboot time already
>> >>> subtracted). These delays are present on both sites. Today we got a
>> >>> maintenance window and started to check in more detail by just
>> >>> restarting the MON service (joins quorum within seconds), then
>> >>> stopping the MON service and wait a few minutes (still joins quorum
>> >>> within seconds). And then we stopped the service and waited for more
>> >>> than 5 minutes, simulating a reboot, and then we were able to
>> >>> reproduce it. The sync then takes around 15 minutes, we verified with
>> >>> other MONs as well. The MON store is around 2 GB of size (on HDD), I
>> >>> understand that the sync itself can take some time, but what is the
>> >>> threshold here? I tried to find a hint in the MON config, searching
>> >>> for timeouts with 300 seconds, there were only a few matches
>> >>> (mon_session_timeout is one of them), but I'm not sure if they can
>> >>> explain this behavior.
>> >>> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed
>> >>> that there were more than 42 Million osd_snap keys, which is quite a
>> >>> lot and would explain the size of the MON store. But I'm  
also not sure
>> >>> if it's related to the long syncing process.
>> >>> Does that sound familiar to anyone?
>> >>>
>> >>> Thanks,
>> >>> Eugen
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>>
>>
>>
>>
>>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx