Re: MON sync time depends on outage duration

Eugen Block <eblock@xxxxxx> · Tue, 11 Jul 2023 11:10:36 +0000

I'm not so sure anymore if that could really help here. The dump-keys  
output from the mon contains 42 million osd_snap prefix entries, 39  
million of them are "purged_snap" keys. I also compared to other  
clusters as well, those aren't tombstones but expected "history" of  
purged snapshots. So I don't think removing a couple of hundred trash  
snapshots will actually reduce the number of osd_snap keys. At least  
doubling the payload_size seems to have a positive impact. The  
compaction during the sync has a negative impact, of course, same as  
not having the mon store on SSDs.
I'm currently playing with a test cluster, removing all "purged_snap"  
entries from the mon db (not finished yet) to see what that will do  
with the mon and if it will even start correctly. But has anyone done  
that, removing keys from the mon store? Not sure what to expect yet...

Zitat von Dan van der Ster <dan.vanderster@xxxxxxxxx>:

Oh yes, sounds like purging the rbd trash will be the real fix here!
Good luck!

______________________________________________________
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com

On Mon, Jul 10, 2023 at 6:10 AM Eugen Block <eblock@xxxxxx> wrote:

Hi,
I got a customer response with payload size 4096, that made things
even worse. The mon startup time was now around 40 minutes. My doubts
wrt decreasing the payload size seem confirmed. Then I read Dan's
response again which also mentions that the default payload size could
be too small. So I asked them to double the default (2M instead of 1M)
and am now waiting for a new result. I'm still wondering why this only
happens when the mon is down for more than 5 minutes. Does anyone have
an explanation for that time factor?
Another thing they're going to do is to remove lots of snapshot
tombstones (rbd mirroring snapshots in the trash namespace), maybe
that will reduce the osd_snap keys in the mon db, which then would
increase the startup time. We'll see...

Zitat von Eugen Block <eblock@xxxxxx>:

> Thanks, Dan!
>
>> Yes that sounds familiar from the luminous and mimic days.
>> The workaround for zillions of snapshot keys at that time was to use:
>>   ceph config set mon mon_sync_max_payload_size 4096
>
> I actually did search for mon_sync_max_payload_keys, not bytes so I
> missed your thread, it seems. Thanks for pointing that out. So the
> defaults seem to be these in Octopus:
>
>     "mon_sync_max_payload_keys": "2000",
>     "mon_sync_max_payload_size": "1048576",
>
>> So it could be in your case that the sync payload is just too small to
>> efficiently move 42 million osd_snap keys? Using debug_paxos and
debug_mon
>> you should be able to understand what is taking so long, and tune
>> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
>
> I'm confused, if the payload size is too small, why would decreasing
> it help? Or am I misunderstanding something? But it probably won't
> hurt to try it with 4096 and see if anything changes. If not we can
> still turn on debug logs and take a closer look.
>
>> And additional to Dan suggestion, the HDD is not a good choices for
>> RocksDB, which is most likely the reason for this thread, I think
>> that from the 3rd time the database just goes into compaction
>> maintenance
>
> Believe me, I know... but there's not much they can currently do
> about it, quite a long story... But I have been telling them that
> for months now. Anyway, I will make some suggestions and report back
> if it worked in this case as well.
>
> Thanks!
> Eugen
>
> Zitat von Dan van der Ster <dan.vanderster@xxxxxxxxx>:
>
>> Hi Eugen!
>>
>> Yes that sounds familiar from the luminous and mimic days.
>>
>> Check this old thread:
>>
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
>> (that thread is truncated but I can tell you that it worked for Frank).
>> Also the even older referenced thread:
>>
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/
>>
>> The workaround for zillions of snapshot keys at that time was to use:
>>   ceph config set mon mon_sync_max_payload_size 4096
>>
>> That said, that sync issue was supposed to be fixed by way of adding the
>> new option mon_sync_max_payload_keys, which has been around since
nautilus.
>>
>> So it could be in your case that the sync payload is just too small to
>> efficiently move 42 million osd_snap keys? Using debug_paxos and
debug_mon
>> you should be able to understand what is taking so long, and tune
>> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
>>
>> Good luck!
>>
>> Dan
>>
>> ______________________________________________________
>> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
>>
>>
>>
>> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock@xxxxxx> wrote:
>>
>>> Hi *,
>>>
>>> I'm investigating an interesting issue on two customer clusters (used
>>> for mirroring) I've not solved yet, but today we finally made some
>>> progress. Maybe someone has an idea where to look next, I'd appreciate
>>> any hints or comments.
>>> These are two (latest) Octopus clusters, main usage currently is RBD
>>> mirroring with snapshot mode (around 500 RBD images are synced every
>>> 30 minutes). They noticed very long startup times of MON daemons after
>>> reboot, times between 10 and 30 minutes (reboot time already
>>> subtracted). These delays are present on both sites. Today we got a
>>> maintenance window and started to check in more detail by just
>>> restarting the MON service (joins quorum within seconds), then
>>> stopping the MON service and wait a few minutes (still joins quorum
>>> within seconds). And then we stopped the service and waited for more
>>> than 5 minutes, simulating a reboot, and then we were able to
>>> reproduce it. The sync then takes around 15 minutes, we verified with
>>> other MONs as well. The MON store is around 2 GB of size (on HDD), I
>>> understand that the sync itself can take some time, but what is the
>>> threshold here? I tried to find a hint in the MON config, searching
>>> for timeouts with 300 seconds, there were only a few matches
>>> (mon_session_timeout is one of them), but I'm not sure if they can
>>> explain this behavior.
>>> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed
>>> that there were more than 42 Million osd_snap keys, which is quite a
>>> lot and would explain the size of the MON store. But I'm also not sure
>>> if it's related to the long syncing process.
>>> Does that sound familiar to anyone?
>>>
>>> Thanks,
>>> Eugen
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx