Re: Insane number of "osd_snap purged_snap" keys in monstore db due to rbd-mirror

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Jul 29, 2023 at 2:11 PM Mykola Golub <to.my.trociny@xxxxxxxxx> wrote:
>
> Hi,
>
> We have a customer with an abnormally large number of "osd_snap /
> purged_snap_{pool}_{snapid}" keys in monstore db: almost 40
> million. Among other problems it causes a very long mon
> synchronization on startup.
>
> Our understanding is that the cause is that a mirroring snapshot
> creation is very frequently interrupted in their environment, most
> likely due to connectivity issues between the sites. The assumption is

Hi Mykola,

I'm missing how connectivity issues between the sites can lead to
mirror snapshot creation being interrupted on the primary cluster.
Isn't that operation local to the cluster?

Also, do you know who/what is actually interrupting it?  Even if mirror
snapshot creation ends up taking a while for some reason, I don't think
anything in RBD would interrupt it.

> based on the fact that they have a lot of rbd "trash" snapshots, which
> may happen when an rbd snapshot removal is interrupted. (A mirroring
> snapshot creation usually includes removal of some older snapshot to keep
> the total number of the image mirroring snapshots under the limit).
>
> We removed all "trash" snapshots manually, so currently they have a
> limited number of "expected" snapshots but the number of purged_snap
> keys is still the same large.
>
> So, our understanding is that if an rbd snapshot creation is
> frequently interrupted there is a chance it will be interrupted in or
> just after SnapshotCreateRequest::send_allocate_snap_id [1], when
> it requests a new snap id from the mon. As a result this id is not
> tracked by rbd and never removed, and snap id holes like this make
> "purged_snap_{pool}_{snapid}" ranges never merge.
>
> To confirm that this scenario is likely I ran the following simple test
> that interrupted rbd mirror snapshot creation at random time:
>
>   for i in `seq 500`;do
>     rbd mirror image snapshot test&
>     PID=$!
>     sleep $((RANDOM % 5)).$((RANDOM % 10))
>     kill $PID && sleep 30
>   done
>
> Running this with debug_rbd=30, from the rbd client logs I see that it
> was interrupted in send_allocate_snap_id 74 times, which is (surprisingly)
> very high.

This applies to regular user (i.e. non-mirror) snapshots too.

>
> And after the experiment, and after removing the rbd image with all
> tracked snapshots (i.e having the pool with no known rbd snapshots),
> I see "purged_snap_{pool}_{snapid}" keys for ranges that I believe will
> never be merged.
>
> So the questions are:
>
> 1) Is there a way we could improve this to avoid monstore growing large?

Nothing simple comes to mind.  The issue is that getting a snap ID on
the monitor and registering a snapshot with the image on the OSDs are
fundamentally separate steps, with the latter requiring a snap ID from
the former.  Unless the process of allocating a snap ID itself becomes
two-step, where a freshly allocated snap ID is initially marked
inactive and later, after it gets persisted, it's switched to active
with a separate request to the monitor, one could always generate
"forgotten" snap IDs if they try hard enough.  (I'm assuming that in
such a two-step process, monitors would clean up inactive snap IDs
after a timeout.)

In general, I don't think we are resilient to these scenarios.
I suspect there are many similar "some piece of metadata is left behind
if the command is killed at the wrong moment" issues lurking there.

>
> 2) How can we fix the current situation in the cluster? Would it be safe
> enough to just run `ceph-kvstore-tool rocksdb store.db rm-prefix osd_snap`
> to remove all osd_snap keys (including purged_epoch keys)? Due to
> large db size I don't think it would be possible to selectively remove
> keys with `ceph-kvstore-tool rocksdb store.db rm {prefix} {key}`
> command and we may use only the `rm-prefix` command. Looking at the
> code and actually trying it in a test environment it seems like it could
> work, but I may be missing something dangerous here?

Adding Radek.

Thanks,

                Ilya
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux