On Sun, Jul 30, 2023 at 7:09 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > I'm missing how connectivity issues between the sites can lead to > mirror snapshot creation being interrupted on the primary cluster. > Isn't that operation local to the cluster? > > Also, do you know who/what is actually interrupting it? Even if mirror > snapshot creation ends up taking a while for some reason, I don't think > anything in RBD would interrupt it. Indeed, just thinking more about it, it does not look like the connectivity issue that was interrupting it. The first thing noticed was a lot of "purge" (former mirroring) snapshots, so my logical assumption was that snapshot removal was interrupted and we also knew from the customer about "connectivity issues" between sites, so it was just my assumption that those interruptions were due to network issues. At first I was thinking that "snap id leak" might have happened on snapshot removal. And I used the test with creating primary snapshots because it was also removing snapshots. But reviewing the code I have not found any suspicious place where we could leak the snap id on snapshot removal, while the testing showed that it is rather possible to leak it on snapshot creation. So currently I think it happens on snapshot creation, just forgot to revisit my initial assumption what could cause the interruption. Ok, then another suspect could be the rbd_support mgr module. They are still running octopus (latest), there are more than 500 mirroring images, and the snapshot schedule was configured for 3 minutes for each image. I expect it could cause a considerable load and could trigger this interruption somehow. Could it be due to blacklisting? (Recently we added to rbd_support module the ability to restart rados when it is blacklisted). Now after our recommendation I believe they have the schedule changed to 30 minutes interval. Unfortunately the communication with the customer is troublesome, I get only limited secondhand information and there are a lot of assumptions here. Currently we are more interested in actually how to fix the large number of purged_snap keys in the monstore, still I thought it would be useful to report some details how it could happen. > This applies to regular user (i.e. non-mirror) snapshots too. Sure, mirroring is just a case when you may hit it due to frequent use. Thanks, -- Mykola Golub _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx