Hi Adrian, I have also hit this recently and have since increased the osd_snap_trim_sleep to try and stop this from happening again. However, I haven't had an opportunity to actually try and break it again yet, but your mail seems to suggest it might not be the silver bullet I was looking for. I'm wondering if the problem is not with the removal of the snapshot, but actually down to the amount of object deletes that happen, as I see similar results when doing fstrim's or deleting RBD's. Either way I agree that a settable throttle to allow it to process more slowly would be a good addition. Have you tried that value set to higher than 1, maybe 10? Nick > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Adrian Saul > Sent: 22 September 2016 05:19 > To: 'ceph-users@xxxxxxxxxxxxxx' <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: Snap delete performance impact > > > Any guidance on this? I have osd_snap_trim_sleep set to 1 and it seems to have tempered some of the issues but its still bad enough > that NFS storage off RBD volumes become unavailable for over 3 minutes. > > It seems that the activity which the snapshot deletes are actioned triggers massive disk load for around 30 minutes. The logs show > OSDs marking each other out, OSDs complaining they are wrongly marked out and blocked requests errors for around 10 minutes at > the start of this activity. > > Is there any way to throttle snapshot deletes to make them much more of a background activity? It really should not make the entire > platform unusable for 10 minutes. > > > > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > > Of Adrian Saul > > Sent: Wednesday, 6 July 2016 3:41 PM > > To: 'ceph-users@xxxxxxxxxxxxxx' > > Subject: Snap delete performance impact > > > > > > I recently started a process of using rbd snapshots to setup a backup > > regime for a few file systems contained in RBD images. While this > > generally works well at the time of the snapshots there is a massive > > increase in latency (10ms to multiple seconds of rbd device latency) > > across the entire cluster. This has flow on effects for some cluster > > timeouts as well as general performance hits to applications. > > > > In research I have found some references to osd_snap_trim_sleep being the > > way to throttle this activity but no real guidance on values for it. I also see > > some other osd_snap_trim tunables (priority and cost). > > > > Is there any recommendations around setting these for a Jewel cluster? > > > > cheers, > > Adrian > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional > privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or > disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality > clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality, > privilege or copyright is not waived or lost because this email has been sent to you by mistake. > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com