I forgot to add one question.
@Konstantin, you wrote:
I think that from the 3rd time the database just goes into
compaction maintenance
Can you share some more details what exactly you mean? Do you mean
that if I restart a MON three times it goes into compaction
maintenance and that it's not related to a timing? We tried the same
on a different MON and only did two tests:
- stopping a MON for less than 5 minutes, starting it again, sync
happens immediately
- stopping a MON for more than 5 minutes, starting it again, sync
takes 15 minutes
This doesn't feel related to the payload size or keys option, but a
timing option.
Zitat von Eugen Block <eblock@xxxxxx>:
Thanks, Dan!
Yes that sounds familiar from the luminous and mimic days.
The workaround for zillions of snapshot keys at that time was to use:
ceph config set mon mon_sync_max_payload_size 4096
I actually did search for mon_sync_max_payload_keys, not bytes so I
missed your thread, it seems. Thanks for pointing that out. So the
defaults seem to be these in Octopus:
"mon_sync_max_payload_keys": "2000",
"mon_sync_max_payload_size": "1048576",
So it could be in your case that the sync payload is just too small to
efficiently move 42 million osd_snap keys? Using debug_paxos and debug_mon
you should be able to understand what is taking so long, and tune
mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
I'm confused, if the payload size is too small, why would decreasing
it help? Or am I misunderstanding something? But it probably won't
hurt to try it with 4096 and see if anything changes. If not we can
still turn on debug logs and take a closer look.
And additional to Dan suggestion, the HDD is not a good choices for
RocksDB, which is most likely the reason for this thread, I think
that from the 3rd time the database just goes into compaction
maintenance
Believe me, I know... but there's not much they can currently do
about it, quite a long story... But I have been telling them that
for months now. Anyway, I will make some suggestions and report back
if it worked in this case as well.
Thanks!
Eugen
Zitat von Dan van der Ster <dan.vanderster@xxxxxxxxx>:
Hi Eugen!
Yes that sounds familiar from the luminous and mimic days.
Check this old thread:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
(that thread is truncated but I can tell you that it worked for Frank).
Also the even older referenced thread:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/
The workaround for zillions of snapshot keys at that time was to use:
ceph config set mon mon_sync_max_payload_size 4096
That said, that sync issue was supposed to be fixed by way of adding the
new option mon_sync_max_payload_keys, which has been around since nautilus.
So it could be in your case that the sync payload is just too small to
efficiently move 42 million osd_snap keys? Using debug_paxos and debug_mon
you should be able to understand what is taking so long, and tune
mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
Good luck!
Dan
______________________________________________________
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
On Thu, Jul 6, 2023 at 1:47 PM Eugen Block <eblock@xxxxxx> wrote:
Hi *,
I'm investigating an interesting issue on two customer clusters (used
for mirroring) I've not solved yet, but today we finally made some
progress. Maybe someone has an idea where to look next, I'd appreciate
any hints or comments.
These are two (latest) Octopus clusters, main usage currently is RBD
mirroring with snapshot mode (around 500 RBD images are synced every
30 minutes). They noticed very long startup times of MON daemons after
reboot, times between 10 and 30 minutes (reboot time already
subtracted). These delays are present on both sites. Today we got a
maintenance window and started to check in more detail by just
restarting the MON service (joins quorum within seconds), then
stopping the MON service and wait a few minutes (still joins quorum
within seconds). And then we stopped the service and waited for more
than 5 minutes, simulating a reboot, and then we were able to
reproduce it. The sync then takes around 15 minutes, we verified with
other MONs as well. The MON store is around 2 GB of size (on HDD), I
understand that the sync itself can take some time, but what is the
threshold here? I tried to find a hint in the MON config, searching
for timeouts with 300 seconds, there were only a few matches
(mon_session_timeout is one of them), but I'm not sure if they can
explain this behavior.
Investigating the MON store (ceph-monstore-tool dump-keys) I noticed
that there were more than 42 Million osd_snap keys, which is quite a
lot and would explain the size of the MON store. But I'm also not sure
if it's related to the long syncing process.
Does that sound familiar to anyone?
Thanks,
Eugen
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx