Re: osd_snap_trim_sleep keeps locks PG during sleep?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 27 Apr 2017 15:00:42 -0700

On Wed, Apr 26, 2017 at 1:38 PM, Frédéric Nass
<frederic.nass@xxxxxxxxxxxxxxxx> wrote:
> Hi Greg,
>
> Thanks a lot for your work on this one. It really helps us right now.
>
> Would it be easy to add the snaptrim speed on a ceph -s, like "snaptrim io 144 MB/s, 721 objects/s" (or just objects/s if sizes are unknown) ?
> It would help to see how the snaptrim speed changes along with snap trimming options.

I've added a ticket for this. I'm not sure if exposing a delete data
rate will be useful or not as it's mostly the number of IOPs users
will care about, and that's determined by the number of objects. But
that number definitely does make sense.
http://tracker.ceph.com/issues/19799

>
> When a snapshot is removed, all primary OSDs seem to start trimming at the same time. Can we avoid this or limit their number ?

Assuming you have replica 3 and "osd max trimming pgs = 2", each OSD
will trim 6 PGs worth of stuff at once (roughly — the balance won't
quite be perfect). You could turn osd max trimming pgs down to 1 to
cut the load in half, and introduce a sleep of appropriate length for
the others if you want to throttle more precisely. Doing something
that requires cross-node coordination probably isn't a good choice,
though — in most clusters, a single snap trim won't do more than a
couple object deletes (or even none!) and having to go to remote OSDs
and get reservations would just slow everything down to no purpose.
-Greg

>
> Best regards,
>
> Frédéric Nass.
>
> ----- Le 26 Avr 17, à 20:24, Gregory Farnum gfarnum@xxxxxxxxxx a écrit :
>
>> Hey all,
>
>> Resurrecting this thread because I just wanted to let you know that
>> Sam's initial work in master has been backported to Jewel and will be
>> in the next (10.2.8, I think?) release:
>> https://github.com/ceph/ceph/pull/14492/
>
>> Once upgraded, it will be safe to use the "osd snap trim sleep" option
>> again. It also adds a new "osd max trimming pgs" (default 2) that
>> limits the number of PGs each primary will simultaneously trim on, and
>> adds "snaptrim" and "snaptrim_wait" to the list of reported PG states.
>> :)
>
>> (For those of you running Kraken, its backport hasn't merged yet but
>> is at https://github.com/ceph/ceph/pull/14597)
>> -Greg
>
>> On Tue, Feb 21, 2017 at 3:32 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> > Yep sure, will try and present some figures at tomorrow’s meeting again.
>
>
>
>> > From: Samuel Just [mailto:sjust@xxxxxxxxxx]
>> > Sent: 21 February 2017 18:14
>
>
>> > To: Nick Fisk <nick@xxxxxxxxxx>
>> > Cc: ceph-users@xxxxxxxxxxxxxx
>> > Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?
>
>
>
>> > Ok, I've added explicit support for osd_snap_trim_sleep (same param, new
>> > non-blocking implementation) to that branch. Care to take it for a whirl?
>
>> > -Sam
>
>
>
>> > On Thu, Feb 9, 2017 at 11:36 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>
>> > Building now
>
>
>
>> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> > Samuel Just
>> > Sent: 09 February 2017 19:22
>> > To: Nick Fisk <nick@xxxxxxxxxx>
>> > Cc: ceph-users@xxxxxxxxxxxxxx
>
>
>> > Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?
>
>
>
>> > Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep (based on
>> > master) passed a rados suite. It adds a configurable limit to the number of
>> > pgs which can be trimming on any OSD (default: 2). PGs trimming will be in
>> > snaptrim state, PGs waiting to trim will be in snaptrim_wait state. I
>> > suspect this'll be adequate to throttle the amount of trimming. If not, I
>> > can try to add an explicit limit to the rate at which the work items trickle
>> > into the queue. Can someone test this branch? Tester beware: this has not
>> > merged into master yet and should only be run on a disposable cluster.
>
>> > -Sam
>
>
>
>> > On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>
>> > Yeah it’s probably just the fact that they have more PG’s so they will hold
>> > more data and thus serve more IO. As they have a fixed IO limit, they will
>> > always hit the limit first and become the bottleneck.
>
>
>
>> > The main problem with reducing the filestore queue is that I believe you
>> > will start to lose the benefit of having IO’s queued up on the disk, so that
>> > the scheduler can re-arrange them to action them in the most efficient manor
>> > as the disk head moves across the platters. You might possibly see up to a
>> > 20% hit on performance, in exchange for more consistent client latency.
>
>
>
>> > From: Steve Taylor [mailto:steve.taylor@xxxxxxxxxxxxxxxx]
>> > Sent: 07 February 2017 20:35
>> > To: nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
>
>
>> > Subject: RE: Re:  osd_snap_trim_sleep keeps locks PG during
>> > sleep?
>
>
>
>> > Thanks, Nick.
>
>
>
>> > One other data point that has come up is that nearly all of the blocked
>> > requests that are waiting on subops are waiting for OSDs with more PGs than
>> > the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB
>> > OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs
>> > individually have 33% more PGs than the others and are causing almost all of
>> > the blocked requests. It appears that maps updates are generally not
>> > blocking long enough to show up as blocked requests.
>
>
>
>> > I set the reweight on those 7 OSDs to 0.75 and things are backfilling now.
>> > I’ll test some more when the PG counts per OSD are more balanced and see
>> > what I get. I’ll also play with the filestore queue. I was telling some of
>> > my colleagues yesterday that this looked likely to be related to buffer
>> > bloat somewhere. I appreciate the suggestion.
>
>
>
>> > ________________________________
>
>> > Steve Taylor | Senior Software Engineer | StorageCraft Technology
>> > Corporation
>> > 380 Data Drive Suite 300 | Draper | Utah | 84020
>> > Office: 801.871.2799 |
>
>> > ________________________________
>
>> > If you are not the intended recipient of this message or received it
>> > erroneously, please notify the sender and delete it, together with any
>> > attachments, and be advised that any dissemination or copying of this
>> > message is prohibited.
>
>> > ________________________________
>
>> > From: Nick Fisk [mailto:nick@xxxxxxxxxx]
>> > Sent: Tuesday, February 7, 2017 10:25 AM
>> > To: Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx
>> > Subject: RE: Re:  osd_snap_trim_sleep keeps locks PG during
>> > sleep?
>
>
>
>> > Hi Steve,
>
>
>
>> > From what I understand, the issue is not with the queueing in Ceph, which is
>> > correctly moving client IO to the front of the queue. The problem lies below
>> > what Ceph controls, ie the scheduler and disk layer in Linux. Once the IO’s
>> > leave Ceph it’s a bit of a free for all and the client IO’s tend to get lost
>> > in large disk queues surrounded by all the snap trim IO’s.
>
>
>
>> > The workaround Sam is working on will limit the amount of snap trims that
>> > are allowed to run, which I believe will have a similar effect to the sleep
>> > parameters in pre-jewel clusters, but without pausing the whole IO thread.
>
>
>
>> > Ultimately the solution requires Ceph to be able to control the queuing of
>> > IO’s at the lower levels of the kernel. Whether this is via some sort of
>> > tagging per IO (currently CFQ is only per thread/process) or some other
>> > method, I don’t know. I was speaking to Sage and he thinks the easiest
>> > method might be to shrink the filestore queue so that you don’t get buffer
>> > bloat at the disk level. You should be able to test this out pretty easily
>> > now by changing the parameter, probably around a queue of 5-10 would be
>> > about right for spinning disks. It’s a trade off of peak throughput vs queue
>> > latency though.
>
>
>
>> > Nick
>
>
>
>> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> > Steve Taylor
>> > Sent: 07 February 2017 17:01
>> > To: ceph-users@xxxxxxxxxxxxxx
>> > Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?
>
>
>
>> > As I look at more of these stuck ops, it looks like more of them are
>> > actually waiting on subops than on osdmap updates, so maybe there is still
>> > some headway to be made with the weighted priority queue settings. I do see
>> > OSDs waiting for map updates all the time, but they aren’t blocking things
>> > as much as the subops are. Thoughts?
>
>
>
>> > ________________________________
>
>> > Steve Taylor | Senior Software Engineer | StorageCraft Technology
>> > Corporation
>> > 380 Data Drive Suite 300 | Draper | Utah | 84020
>> > Office: 801.871.2799 |
>
>> > ________________________________
>
>> > If you are not the intended recipient of this message or received it
>> > erroneously, please notify the sender and delete it, together with any
>> > attachments, and be advised that any dissemination or copying of this
>> > message is prohibited.
>
>> > ________________________________
>
>> > From: Steve Taylor
>> > Sent: Tuesday, February 7, 2017 9:13 AM
>> > To: 'ceph-users@xxxxxxxxxxxxxx' <ceph-users@xxxxxxxxxxxxxx>
>> > Subject: Re:  osd_snap_trim_sleep keeps locks PG during sleep?
>
>
>
>> > Sorry, I lost the previous thread on this. I apologize for the resulting
>> > incomplete reply.
>
>
>
>> > The issue that we’re having with Jewel, as David Turner mentioned, is that
>> > we can’t seem to throttle snap trimming sufficiently to prevent it from
>> > blocking I/O requests. On further investigation, I encountered
>> > osd_op_pq_max_tokens_per_priority, which should be able to be used in
>> > conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue
>> > positions for various operations using costs if I understand correctly. I’m
>> > testing with RBDs using 4MB objects, so in order to leave plenty of room in
>> > the weighted priority queue for client I/O, I set
>> > osd_op_pq_max_tokens_per_priority to 64MB and osd_snap_trim_cost to 32MB+1.
>> > I figured this should essentially reserve 32MB in the queue for client I/O
>> > operations, which are prioritized higher and therefore shouldn’t get
>> > blocked.
>
>
>
>> > I still see blocked I/O requests, and when I dump in-flight ops, they show
>> > ‘op must wait for map.’ I assume this means that what’s blocking the I/O
>> > requests at this point is all of the osdmap updates caused by snap trimming,
>> > and not the actual snap trimming itself starving the ops of op threads.
>> > Hammer is able to mitigate this with osd_snap_trim_sleep by directly
>> > throttling snap trimming and therefore causing less frequent osdmap updates,
>> > but there doesn’t seem to be a good way to accomplish the same thing with
>> > Jewel.
>
>
>
>> > First of all, am I understanding these settings correctly? If so, are there
>> > other settings that could potentially help here, or do we just need
>> > something like Sam already mentioned that can sort of reserve threads for
>> > client I/O requests? Even then it seems like we might have issues if we
>> > can’t also throttle snap trimming. We delete a LOT of RBD snapshots on a
>> > daily basis, which we recognize is an extreme use case. Just wondering if
>> > there’s something else to try or if we need to start working toward
>> > implementing something new ourselves to handle our use case better.
>
>
>
>
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
>
>
>
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com