Re: osd_snap_trim_sleep keeps locks PG during sleep?

Nick Fisk <nick@xxxxxxxxxx> · Fri, 13 Jan 2017 20:37:30 -0000

We're on Jewel and your right, I'm pretty sure the snap stuff is also now handled in the op thread.

The dump historic ops socket command showed a 10s delay at the "Reached PG" stage, from Greg's response [1], it would suggest that the OSD itself isn't blocking but the PG it's currently sleeping whilst trimming. I think in the former case, it would have a high time on the "Started" part of the op? Anyway I will carry out some more testing with higher osd op threads and see if that makes any difference. Thanks for the suggestion.

Nick

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008652.html

> -----Original Message-----
> From: Dan van der Ster [mailto:dan@xxxxxxxxxxxxxx]
> Sent: 13 January 2017 10:28
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?
> 
> Hammer or jewel? I've forgotten which thread pool is handling the snap trim nowadays -- is it the op thread yet? If so, perhaps all the
> op threads are stuck sleeping? Just a wild guess. (Maybe increasing # op threads would help?).
> 
> -- Dan
> 
> 
> On Thu, Jan 12, 2017 at 3:11 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > Hi,
> >
> > I had been testing some higher values with the osd_snap_trim_sleep
> > variable to try and reduce the impact of removing RBD snapshots on our
> > cluster and I have come across what I believe to be a possible unintended consequence. The value of the sleep seems to keep the
> lock on the PG open so that no other IO can use the PG whilst the snap removal operation is sleeping.
> >
> > I had set the variable to 10s to completely minimise the impact as I
> > had some multi TB snapshots to remove and noticed that suddenly all IO to the cluster had a latency of roughly 10s as well, all the
> dumped ops show waiting on PG for 10s as well.
> >
> > Is the osd_snap_trim_sleep variable only ever meant to be used up to
> > say a max of 0.1s and this is a known side effect, or should the lock on the PG be removed so that normal IO can continue during the
> sleeps?
> >
> > Nick
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com