Re: Snap delete performance impact

Nick Fisk <nick@xxxxxxxxxx> · Thu, 22 Sep 2016 10:05:33 +0100

Hi Adrian,

I have also hit this recently and have since increased the osd_snap_trim_sleep to try and stop this from happening again. However, I
haven't had an opportunity to actually try and break it again yet, but your mail seems to suggest it might not be the silver bullet
I was looking for.

I'm wondering if the problem is not with the removal of the snapshot, but actually down to the amount of object deletes that happen,
as I see similar results when doing fstrim's or deleting RBD's. Either way I agree that a settable throttle to allow it to process
more slowly would be a good addition. Have you tried that value set to higher than 1, maybe 10?

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Adrian Saul
> Sent: 22 September 2016 05:19
> To: 'ceph-users@xxxxxxxxxxxxxx' <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Snap delete performance impact
> 
> 
> Any guidance on this?  I have osd_snap_trim_sleep set to 1 and it seems to have tempered some of the issues but its still bad
enough
> that NFS storage off RBD volumes become unavailable for over 3 minutes.
> 
> It seems that the activity which the snapshot deletes are actioned triggers massive disk load for around 30 minutes.  The logs
show
> OSDs marking each other out, OSDs complaining they are wrongly marked out and blocked requests errors for around 10 minutes at
> the start of this activity.
> 
> Is there any way to throttle snapshot deletes to make them much more of a background activity?  It really should not make the
entire
> platform unusable for 10 minutes.
> 
> 
> 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Adrian Saul
> > Sent: Wednesday, 6 July 2016 3:41 PM
> > To: 'ceph-users@xxxxxxxxxxxxxx'
> > Subject:  Snap delete performance impact
> >
> >
> > I recently started a process of using rbd snapshots to setup a backup
> > regime for a few file systems contained in RBD images.  While this
> > generally works well at the time of the snapshots there is a massive
> > increase in latency (10ms to multiple seconds of rbd device latency)
> > across the entire cluster.  This has flow on effects for some cluster
> > timeouts as well as general performance hits to applications.
> >
> > In research I have found some references to osd_snap_trim_sleep being the
> > way to throttle this activity but no real guidance on values for it.   I also see
> > some other osd_snap_trim tunables  (priority and cost).
> >
> > Is there any recommendations around setting these for a Jewel cluster?
> >
> > cheers,
> >  Adrian
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional
> privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or
> disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality
> clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent to you by mistake.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com