Re: MON sync time depends on outage duration

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I forgot to add one question.
@Konstantin, you wrote:

I think that from the 3rd time the database just goes into compaction maintenance

Can you share some more details what exactly you mean? Do you mean that if I restart a MON three times it goes into compaction maintenance and that it's not related to a timing? We tried the same on a different MON and only did two tests: - stopping a MON for less than 5 minutes, starting it again, sync happens immediately - stopping a MON for more than 5 minutes, starting it again, sync takes 15 minutes

This doesn't feel related to the payload size or keys option, but a timing option.

Zitat von Eugen Block <eblock@xxxxxx>:

Thanks, Dan!

Yes that sounds familiar from the luminous and mimic days.
The workaround for zillions of snapshot keys at that time was to use:
  ceph config set mon mon_sync_max_payload_size 4096

I actually did search for mon_sync_max_payload_keys, not bytes so I missed your thread, it seems. Thanks for pointing that out. So the defaults seem to be these in Octopus:

    "mon_sync_max_payload_keys": "2000",
    "mon_sync_max_payload_size": "1048576",

So it could be in your case that the sync payload is just too small to
efficiently move 42 million osd_snap keys? Using debug_paxos and debug_mon
you should be able to understand what is taking so long, and tune
mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.

I'm confused, if the payload size is too small, why would decreasing it help? Or am I misunderstanding something? But it probably won't hurt to try it with 4096 and see if anything changes. If not we can still turn on debug logs and take a closer look.

And additional to Dan suggestion, the HDD is not a good choices for RocksDB, which is most likely the reason for this thread, I think that from the 3rd time the database just goes into compaction maintenance

Believe me, I know... but there's not much they can currently do about it, quite a long story... But I have been telling them that for months now. Anyway, I will make some suggestions and report back if it worked in this case as well.

Thanks!
Eugen

Zitat von Dan van der Ster <dan.vanderster@xxxxxxxxx>:

Hi Eugen!

Yes that sounds familiar from the luminous and mimic days.

Check this old thread:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
(that thread is truncated but I can tell you that it worked for Frank).
Also the even older referenced thread:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/

The workaround for zillions of snapshot keys at that time was to use:
  ceph config set mon mon_sync_max_payload_size 4096

That said, that sync issue was supposed to be fixed by way of adding the
new option mon_sync_max_payload_keys, which has been around since nautilus.

So it could be in your case that the sync payload is just too small to
efficiently move 42 million osd_snap keys? Using debug_paxos and debug_mon
you should be able to understand what is taking so long, and tune
mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.

Good luck!

Dan

______________________________________________________
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com



On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock@xxxxxx> wrote:

Hi *,

I'm investigating an interesting issue on two customer clusters (used
for mirroring) I've not solved yet, but today we finally made some
progress. Maybe someone has an idea where to look next, I'd appreciate
any hints or comments.
These are two (latest) Octopus clusters, main usage currently is RBD
mirroring with snapshot mode (around 500 RBD images are synced every
30 minutes). They noticed very long startup times of MON daemons after
reboot, times between 10 and 30 minutes (reboot time already
subtracted). These delays are present on both sites. Today we got a
maintenance window and started to check in more detail by just
restarting the MON service (joins quorum within seconds), then
stopping the MON service and wait a few minutes (still joins quorum
within seconds). And then we stopped the service and waited for more
than 5 minutes, simulating a reboot, and then we were able to
reproduce it. The sync then takes around 15 minutes, we verified with
other MONs as well. The MON store is around 2 GB of size (on HDD), I
understand that the sync itself can take some time, but what is the
threshold here? I tried to find a hint in the MON config, searching
for timeouts with 300 seconds, there were only a few matches
(mon_session_timeout is one of them), but I'm not sure if they can
explain this behavior.
Investigating the MON store (ceph-monstore-tool dump-keys) I noticed
that there were more than 42 Million osd_snap keys, which is quite a
lot and would explain the size of the MON store. But I'm also not sure
if it's related to the long syncing process.
Does that sound familiar to anyone?

Thanks,
Eugen
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux