rbd mirror snapshot trash

Eugen Block <eblock@xxxxxx> · Tue, 16 May 2023 07:47:36 +0000

Good morning,

I would be grateful if anybody could shed some light on this, I can't  
reproduce it in my lab clusters so I was hoping for the community.
A customer has 2 clusters with rbd mirroring (snapshots) enabled, it  
seems to work fine, they have regular checks and the images on the  
remote site are synced correctly. While investigating a different  
issue (could be related though) we noticed that there are quite a lot  
of snapshots in the trash namespace but for unkown reasons they are  
not purged. This is some example output from the remote site (the  
primary site looks similar):

---snip---
# rbd snap ls --all <pool>/<image>
SNAPID     NAME                                                         
                                   SIZE    PROTECTED  TIMESTAMP        
          NAMESPACE
 23962608  42c2e4c3-e1ab-480f-9d29-aab5555c751b                        
                                    20 GiB             Tue Sep 20  
20:33:18 2022  trash  
(.mirror.non_primary.61ef3a1e-6e5b-4147-ac62-552e4776dd70.ef345a74-2585-48ac-8e25-09b15c20c877)
 96796437  24cc96d4-51c6-447e-870d-1545d8ec9308                        
                                    20 GiB             Tue Apr 18  
12:15:28 2023  trash  
(.mirror.non_primary.61ef3a1e-6e5b-4147-ac62-552e4776dd70.a6ae46e4-cbb5-4264-b2e4-6b565820aa16)
110025019   
.mirror.non_primary.61ef3a1e-6e5b-4147-ac62-552e4776dd70.0fc73e88-6671-4ba8-a118-8fcbcd422c22  20 GiB             Mon May 15 16:27:39 2023  mirror (non-primary peer_uuids:[] 18ac37b9-a6c6-4f64-ba43-a1f0d02b3f96:115986586  
copied)
---snip---

They're on latest Octopus and use all config defaults. There are more  
than 500 images in that pool with a snapshot schedule of 3 minutes for  
each images.
A couple of questions arised and hopefully at least some of them can  
be answered:

- Why are there snapshots in the trash namespace from September 2022,  
how can they be cleaned up and why aren't they cleaned up  
automatically? Not all images have trash entries though.
- Could disabling mirroring for those images help getting rid of the  
trash, then enable mirroring again?
- Is the default osd_max_trimming_pgs = 2 a bottleneck here? They have  
a week old sst files in the store.db, leading to long startup duration  
for MONs after reboot/restart. Would increasing that value help  
getting rid of the trash entries and maybe also trim the mon store?
- Regarding the general snapshot mirroring procedure, the default of  
rbd_mirroring_max_mirroring_snapshots  is 5, but I assume that the  
number of active snapshots would only grow in case of a disruption  
between those sites, correct? If the sync works there's no need to  
keep 5 snapshots, right?

I'm still looking into these things myself but I'd appreciate anyone  
chiming in here.

Thanks!
Eugen
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx