Re: osd_snap_trim_sleep keeps locks PG during sleep?

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Wed, 26 Apr 2017 22:38:48 +0200 (CEST)

Hi Greg,

Thanks a lot for your work on this one. It really helps us right now.

Would it be easy to add the snaptrim speed on a ceph -s, like "snaptrim io 144 MB/s, 721 objects/s" (or just objects/s if sizes are unknown) ?
It would help to see how the snaptrim speed changes along with snap trimming options.

When a snapshot is removed, all primary OSDs seem to start trimming at the same time. Can we avoid this or limit their number ?

Best regards,

Frédéric Nass.

----- Le 26 Avr 17, à 20:24, Gregory Farnum gfarnum@xxxxxxxxxx a écrit :

> Hey all,

> Resurrecting this thread because I just wanted to let you know that
> Sam's initial work in master has been backported to Jewel and will be
> in the next (10.2.8, I think?) release:
> https://github.com/ceph/ceph/pull/14492/

> Once upgraded, it will be safe to use the "osd snap trim sleep" option
> again. It also adds a new "osd max trimming pgs" (default 2) that
> limits the number of PGs each primary will simultaneously trim on, and
> adds "snaptrim" and "snaptrim_wait" to the list of reported PG states.
> :)

> (For those of you running Kraken, its backport hasn't merged yet but
> is at https://github.com/ceph/ceph/pull/14597)
> -Greg

> On Tue, Feb 21, 2017 at 3:32 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > Yep sure, will try and present some figures at tomorrow’s meeting again.

> > From: Samuel Just [mailto:sjust@xxxxxxxxxx]
> > Sent: 21 February 2017 18:14

> > To: Nick Fisk <nick@xxxxxxxxxx>
> > Cc: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?

> > Ok, I've added explicit support for osd_snap_trim_sleep (same param, new
> > non-blocking implementation) to that branch. Care to take it for a whirl?

> > -Sam

> > On Thu, Feb 9, 2017 at 11:36 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:

> > Building now

> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> > Samuel Just
> > Sent: 09 February 2017 19:22
> > To: Nick Fisk <nick@xxxxxxxxxx>
> > Cc: ceph-users@xxxxxxxxxxxxxx

> > Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?

> > Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep (based on
> > master) passed a rados suite. It adds a configurable limit to the number of
> > pgs which can be trimming on any OSD (default: 2). PGs trimming will be in
> > snaptrim state, PGs waiting to trim will be in snaptrim_wait state. I
> > suspect this'll be adequate to throttle the amount of trimming. If not, I
> > can try to add an explicit limit to the rate at which the work items trickle
> > into the queue. Can someone test this branch? Tester beware: this has not
> > merged into master yet and should only be run on a disposable cluster.

> > -Sam

> > On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:

> > Yeah it’s probably just the fact that they have more PG’s so they will hold
> > more data and thus serve more IO. As they have a fixed IO limit, they will
> > always hit the limit first and become the bottleneck.

> > The main problem with reducing the filestore queue is that I believe you
> > will start to lose the benefit of having IO’s queued up on the disk, so that
> > the scheduler can re-arrange them to action them in the most efficient manor
> > as the disk head moves across the platters. You might possibly see up to a
> > 20% hit on performance, in exchange for more consistent client latency.

> > From: Steve Taylor [mailto:steve.taylor@xxxxxxxxxxxxxxxx]
> > Sent: 07 February 2017 20:35
> > To: nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx

> > Subject: RE: Re:  osd_snap_trim_sleep keeps locks PG during
> > sleep?

> > Thanks, Nick.

> > One other data point that has come up is that nearly all of the blocked
> > requests that are waiting on subops are waiting for OSDs with more PGs than
> > the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB
> > OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs
> > individually have 33% more PGs than the others and are causing almost all of
> > the blocked requests. It appears that maps updates are generally not
> > blocking long enough to show up as blocked requests.

> > I set the reweight on those 7 OSDs to 0.75 and things are backfilling now.
> > I’ll test some more when the PG counts per OSD are more balanced and see
> > what I get. I’ll also play with the filestore queue. I was telling some of
> > my colleagues yesterday that this looked likely to be related to buffer
> > bloat somewhere. I appreciate the suggestion.

> > ________________________________

> > Steve Taylor | Senior Software Engineer | StorageCraft Technology
> > Corporation
> > 380 Data Drive Suite 300 | Draper | Utah | 84020
> > Office: 801.871.2799 |

> > ________________________________

> > If you are not the intended recipient of this message or received it
> > erroneously, please notify the sender and delete it, together with any
> > attachments, and be advised that any dissemination or copying of this
> > message is prohibited.

> > ________________________________

> > From: Nick Fisk [mailto:nick@xxxxxxxxxx]
> > Sent: Tuesday, February 7, 2017 10:25 AM
> > To: Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx
> > Subject: RE: Re:  osd_snap_trim_sleep keeps locks PG during
> > sleep?

> > Hi Steve,

> > From what I understand, the issue is not with the queueing in Ceph, which is
> > correctly moving client IO to the front of the queue. The problem lies below
> > what Ceph controls, ie the scheduler and disk layer in Linux. Once the IO’s
> > leave Ceph it’s a bit of a free for all and the client IO’s tend to get lost
> > in large disk queues surrounded by all the snap trim IO’s.

> > The workaround Sam is working on will limit the amount of snap trims that
> > are allowed to run, which I believe will have a similar effect to the sleep
> > parameters in pre-jewel clusters, but without pausing the whole IO thread.

> > Ultimately the solution requires Ceph to be able to control the queuing of
> > IO’s at the lower levels of the kernel. Whether this is via some sort of
> > tagging per IO (currently CFQ is only per thread/process) or some other
> > method, I don’t know. I was speaking to Sage and he thinks the easiest
> > method might be to shrink the filestore queue so that you don’t get buffer
> > bloat at the disk level. You should be able to test this out pretty easily
> > now by changing the parameter, probably around a queue of 5-10 would be
> > about right for spinning disks. It’s a trade off of peak throughput vs queue
> > latency though.

> > Nick

> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> > Steve Taylor
> > Sent: 07 February 2017 17:01
> > To: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?

> > As I look at more of these stuck ops, it looks like more of them are
> > actually waiting on subops than on osdmap updates, so maybe there is still
> > some headway to be made with the weighted priority queue settings. I do see
> > OSDs waiting for map updates all the time, but they aren’t blocking things
> > as much as the subops are. Thoughts?

> > ________________________________

> > Steve Taylor | Senior Software Engineer | StorageCraft Technology
> > Corporation
> > 380 Data Drive Suite 300 | Draper | Utah | 84020
> > Office: 801.871.2799 |

> > ________________________________

> > If you are not the intended recipient of this message or received it
> > erroneously, please notify the sender and delete it, together with any
> > attachments, and be advised that any dissemination or copying of this
> > message is prohibited.

> > ________________________________

> > From: Steve Taylor
> > Sent: Tuesday, February 7, 2017 9:13 AM
> > To: 'ceph-users@xxxxxxxxxxxxxx' <ceph-users@xxxxxxxxxxxxxx>
> > Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?

> > Sorry, I lost the previous thread on this. I apologize for the resulting
> > incomplete reply.

> > The issue that we’re having with Jewel, as David Turner mentioned, is that
> > we can’t seem to throttle snap trimming sufficiently to prevent it from
> > blocking I/O requests. On further investigation, I encountered
> > osd_op_pq_max_tokens_per_priority, which should be able to be used in
> > conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue
> > positions for various operations using costs if I understand correctly. I’m
> > testing with RBDs using 4MB objects, so in order to leave plenty of room in
> > the weighted priority queue for client I/O, I set
> > osd_op_pq_max_tokens_per_priority to 64MB and osd_snap_trim_cost to 32MB+1.
> > I figured this should essentially reserve 32MB in the queue for client I/O
> > operations, which are prioritized higher and therefore shouldn’t get
> > blocked.

> > I still see blocked I/O requests, and when I dump in-flight ops, they show
> > ‘op must wait for map.’ I assume this means that what’s blocking the I/O
> > requests at this point is all of the osdmap updates caused by snap trimming,
> > and not the actual snap trimming itself starving the ops of op threads.
> > Hammer is able to mitigate this with osd_snap_trim_sleep by directly
> > throttling snap trimming and therefore causing less frequent osdmap updates,
> > but there doesn’t seem to be a good way to accomplish the same thing with
> > Jewel.

> > First of all, am I understanding these settings correctly? If so, are there
> > other settings that could potentially help here, or do we just need
> > something like Sam already mentioned that can sort of reserve threads for
> > client I/O requests? Even then it seems like we might have issues if we
> > can’t also throttle snap trimming. We delete a LOT of RBD snapshots on a
> > daily basis, which we recognize is an extreme use case. Just wondering if
> > there’s something else to try or if we need to start working toward
> > implementing something new ourselves to handle our use case better.

> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com