Re: osd_snap_trim_sleep keeps locks PG during sleep?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 26 Apr 2017 14:24:39 -0400

Hey all,

Resurrecting this thread because I just wanted to let you know that
Sam's initial work in master has been backported to Jewel and will be
in the next (10.2.8, I think?) release:
https://github.com/ceph/ceph/pull/14492/

Once upgraded, it will be safe to use the "osd snap trim sleep" option
again. It also adds a new "osd max trimming pgs" (default 2) that
limits the number of PGs each primary will simultaneously trim on, and
adds "snaptrim" and "snaptrim_wait" to the list of reported PG states.
:)

(For those of you running Kraken, its backport hasn't merged yet but
is at https://github.com/ceph/ceph/pull/14597)
-Greg

On Tue, Feb 21, 2017 at 3:32 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> Yep sure, will try and present some figures at tomorrow’s meeting again.
>
>
>
> From: Samuel Just [mailto:sjust@xxxxxxxxxx]
> Sent: 21 February 2017 18:14
>
>
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?
>
>
>
> Ok, I've added explicit support for osd_snap_trim_sleep (same param, new
> non-blocking implementation) to that branch.  Care to take it for a whirl?
>
> -Sam
>
>
>
> On Thu, Feb 9, 2017 at 11:36 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>
> Building now
>
>
>
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Samuel Just
> Sent: 09 February 2017 19:22
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
>
>
> Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?
>
>
>
> Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep (based on
> master) passed a rados suite.  It adds a configurable limit to the number of
> pgs which can be trimming on any OSD (default: 2).  PGs trimming will be in
> snaptrim state, PGs waiting to trim will be in snaptrim_wait state.  I
> suspect this'll be adequate to throttle the amount of trimming.  If not, I
> can try to add an explicit limit to the rate at which the work items trickle
> into the queue.  Can someone test this branch?   Tester beware: this has not
> merged into master yet and should only be run on a disposable cluster.
>
> -Sam
>
>
>
> On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>
> Yeah it’s probably just the fact that they have more PG’s so they will hold
> more data and thus serve more IO. As they have a fixed IO limit, they will
> always hit the limit first and become the bottleneck.
>
>
>
> The main problem with reducing the filestore queue is that I believe you
> will start to lose the benefit of having IO’s queued up on the disk, so that
> the scheduler can re-arrange them to action them in the most efficient manor
> as the disk head moves across the platters. You might possibly see up to a
> 20% hit on performance, in exchange for more consistent client latency.
>
>
>
> From: Steve Taylor [mailto:steve.taylor@xxxxxxxxxxxxxxxx]
> Sent: 07 February 2017 20:35
> To: nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
>
>
> Subject: RE: Re:  osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Thanks, Nick.
>
>
>
> One other data point that has come up is that nearly all of the blocked
> requests that are waiting on subops are waiting for OSDs with more PGs than
> the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB
> OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs
> individually have 33% more PGs than the others and are causing almost all of
> the blocked requests. It appears that maps updates are generally not
> blocking long enough to show up as blocked requests.
>
>
>
> I set the reweight on those 7 OSDs to 0.75 and things are backfilling now.
> I’ll test some more when the PG counts per OSD are more balanced and see
> what I get. I’ll also play with the filestore queue. I was telling some of
> my colleagues yesterday that this looked likely to be related to buffer
> bloat somewhere. I appreciate the suggestion.
>
>
>
> ________________________________
>
> Steve Taylor | Senior Software Engineer | StorageCraft Technology
> Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2799 |
>
> ________________________________
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> ________________________________
>
> From: Nick Fisk [mailto:nick@xxxxxxxxxx]
> Sent: Tuesday, February 7, 2017 10:25 AM
> To: Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx
> Subject: RE: Re:  osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Hi Steve,
>
>
>
> From what I understand, the issue is not with the queueing in Ceph, which is
> correctly moving client IO to the front of the queue. The problem lies below
> what Ceph controls, ie the scheduler and disk layer in Linux. Once the IO’s
> leave Ceph it’s a bit of a free for all and the client IO’s tend to get lost
> in large disk queues surrounded by all the snap trim IO’s.
>
>
>
> The workaround Sam is working on will limit the amount of snap trims that
> are allowed to run, which I believe will have a similar effect to the sleep
> parameters in pre-jewel clusters, but without pausing the whole IO thread.
>
>
>
> Ultimately the solution requires Ceph to be able to control the queuing of
> IO’s at the lower levels of the kernel. Whether this is via some sort of
> tagging per IO (currently CFQ is only per thread/process) or some other
> method, I don’t know. I was speaking to Sage and he thinks the easiest
> method might be to shrink the filestore queue so that you don’t get buffer
> bloat at the disk level. You should be able to test this out pretty easily
> now by changing the parameter, probably around a queue of 5-10 would be
> about right for spinning disks. It’s a trade off of peak throughput vs queue
> latency though.
>
>
>
> Nick
>
>
>
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Steve Taylor
> Sent: 07 February 2017 17:01
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?
>
>
>
> As I look at more of these stuck ops, it looks like more of them are
> actually waiting on subops than on osdmap updates, so maybe there is still
> some headway to be made with the weighted priority queue settings. I do see
> OSDs waiting for map updates all the time, but they aren’t blocking things
> as much as the subops are. Thoughts?
>
>
>
> ________________________________
>
> Steve Taylor | Senior Software Engineer | StorageCraft Technology
> Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2799 |
>
> ________________________________
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> ________________________________
>
> From: Steve Taylor
> Sent: Tuesday, February 7, 2017 9:13 AM
> To: 'ceph-users@xxxxxxxxxxxxxx' <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?
>
>
>
> Sorry, I lost the previous thread on this. I apologize for the resulting
> incomplete reply.
>
>
>
> The issue that we’re having with Jewel, as David Turner mentioned, is that
> we can’t seem to throttle snap trimming sufficiently to prevent it from
> blocking I/O requests. On further investigation, I encountered
> osd_op_pq_max_tokens_per_priority, which should be able to be used in
> conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue
> positions for various operations using costs if I understand correctly. I’m
> testing with RBDs using 4MB objects, so in order to leave plenty of room in
> the weighted priority queue for client I/O, I set
> osd_op_pq_max_tokens_per_priority to 64MB and osd_snap_trim_cost to 32MB+1.
> I figured this should essentially reserve 32MB in the queue for client I/O
> operations, which are prioritized higher and therefore shouldn’t get
> blocked.
>
>
>
> I still see blocked I/O requests, and when I dump in-flight ops, they show
> ‘op must wait for map.’ I assume this means that what’s blocking the I/O
> requests at this point is all of the osdmap updates caused by snap trimming,
> and not the actual snap trimming itself starving the ops of op threads.
> Hammer is able to mitigate this with osd_snap_trim_sleep by directly
> throttling snap trimming and therefore causing less frequent osdmap updates,
> but there doesn’t seem to be a good way to accomplish the same thing with
> Jewel.
>
>
>
> First of all, am I understanding these settings correctly? If so, are there
> other settings that could potentially help here, or do we just need
> something like Sam already mentioned that can sort of reserve threads for
> client I/O requests? Even then it seems like we might have issues if we
> can’t also throttle snap trimming. We delete a LOT of RBD snapshots on a
> daily basis, which we recognize is an extreme use case. Just wondering if
> there’s something else to try or if we need to start working toward
> implementing something new ourselves to handle our use case better.
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com