osd_snap_trim_sleep keeps locks PG during sleep?

sjust@xxxxxxxxxx (Samuel Just) · Tue, 21 Feb 2017 10:14:28 -0800

Ok, I've added explicit support for osd_snap_trim_sleep (same param, new
non-blocking implementation) to that branch.  Care to take it for a whirl?
-Sam

On Thu, Feb 9, 2017 at 11:36 AM, Nick Fisk <nick at fisk.me.uk> wrote:

> Building now
>
>
>
> *From:* ceph-users [mailto:ceph-users-bounces at lists.ceph.com] *On Behalf
> Of *Samuel Just
> *Sent:* 09 February 2017 19:22
> *To:* Nick Fisk <nick at fisk.me.uk>
> *Cc:* ceph-users at lists.ceph.com
>
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep
> <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAAAAAAAAAFklQUAAADNJBWwAAAAAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/1/h2R0YQ3R0gPNSCn9TMuW8Q/aHR0cHM6Ly9naXRodWIuY29tL2F0aGFuYXRvcy9jZXBoL3RyZWUvd2lwLXNuYXAtdHJpbS1zbGVlcA>
> (based on master) passed a rados suite.  It adds a configurable limit to
> the number of pgs which can be trimming on any OSD (default: 2).  PGs
> trimming will be in snaptrim state, PGs waiting to trim will be in
> snaptrim_wait state.  I suspect this'll be adequate to throttle the amount
> of trimming.  If not, I can try to add an explicit limit to the rate at
> which the work items trickle into the queue.  Can someone test this branch?
>   Tester beware: this has not merged into master yet and should only be run
> on a disposable cluster.
>
> -Sam
>
>
>
> On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk <nick at fisk.me.uk> wrote:
>
> Yeah it?s probably just the fact that they have more PG?s so they will
> hold more data and thus serve more IO. As they have a fixed IO limit, they
> will always hit the limit first and become the bottleneck.
>
>
>
> The main problem with reducing the filestore queue is that I believe you
> will start to lose the benefit of having IO?s queued up on the disk, so
> that the scheduler can re-arrange them to action them in the most efficient
> manor as the disk head moves across the platters. You might possibly see up
> to a 20% hit on performance, in exchange for more consistent client
> latency.
>
>
>
> *From:* Steve Taylor [mailto:steve.taylor at storagecraft.com]
> *Sent:* 07 February 2017 20:35
> *To:* nick at fisk.me.uk; ceph-users at lists.ceph.com
>
>
> *Subject:* RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Thanks, Nick.
>
>
>
> One other data point that has come up is that nearly all of the blocked
> requests that are waiting on subops are waiting for OSDs with more PGs than
> the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB
> OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs
> individually have 33% more PGs than the others and are causing almost all
> of the blocked requests. It appears that maps updates are generally not
> blocking long enough to show up as blocked requests.
>
>
>
> I set the reweight on those 7 OSDs to 0.75 and things are backfilling now.
> I?ll test some more when the PG counts per OSD are more balanced and see
> what I get. I?ll also play with the filestore queue. I was telling some of
> my colleagues yesterday that this looked likely to be related to buffer
> bloat somewhere. I appreciate the suggestion.
>
>
> ------------------------------
>
>
> <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAAAAAAAAAFklQUAAADNJBWwAAAAAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/2/2T4Xj-_wncGT6Y6LyBEKdw/aHR0cDovL3hvNHQubWouYW0vbG5rL0FEc0FBR1ZFeFk0QUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWW1qa3VqWE1Hc2Z2MFFJMklrZHpkTVBIYk93QUFsQkkvMS9HclNQRjU2RnY2VXVUc1JUejFUbnJRL2FIUjBjSE02THk5emRHOXlZV2RsWTNKaFpuUXVZMjl0>
>
> *Steve* *Taylor* | Senior Software Engineer | StorageCraft Technology
> Corporation
> <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAAAAAAAAAFklQUAAADNJBWwAAAAAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/3/TFexBfD-LHnCcPturKjU5Q/aHR0cDovL3hvNHQubWouYW0vbG5rL0FEc0FBR1ZFeFk0QUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWW1qa3VqWE1Hc2Z2MFFJMklrZHpkTVBIYk93QUFsQkkvMi9IbGVSZWkzWVdEZGljbUN1RG9XeXRBL2FIUjBjSE02THk5emRHOXlZV2RsWTNKaFpuUXVZMjl0>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> *Office: *801.871.2799 <(801)%20871-2799> |
> ------------------------------
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
> ------------------------------
>
> *From:* Nick Fisk [mailto:nick at fisk.me.uk]
> *Sent:* Tuesday, February 7, 2017 10:25 AM
> *To:* Steve Taylor <steve.taylor at storagecraft.com>;
> ceph-users at lists.ceph.com
> *Subject:* RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Hi Steve,
>
>
>
> From what I understand, the issue is not with the queueing in Ceph, which
> is correctly moving client IO to the front of the queue. The problem lies
> below what Ceph controls, ie the scheduler and disk layer in Linux. Once
> the IO?s leave Ceph it?s a bit of a free for all and the client IO?s tend
> to get lost in large disk queues surrounded by all the snap trim IO?s.
>
>
>
> The workaround Sam is working on will limit the amount of snap trims that
> are allowed to run, which I believe will have a similar effect to the sleep
> parameters in pre-jewel clusters, but without pausing the whole IO thread.
>
>
>
> Ultimately the solution requires Ceph to be able to control the queuing of
> IO?s at the lower levels of the kernel. Whether this is via some sort of
> tagging per IO (currently CFQ is only per thread/process) or some other
> method, I don?t know. I was speaking to Sage and he thinks the easiest
> method might be to shrink the filestore queue so that you don?t get buffer
> bloat at the disk level. You should be able to test this out pretty easily
> now by changing the parameter, probably around a queue of 5-10 would be
> about right for spinning disks. It?s a trade off of peak throughput vs
> queue latency though.
>
>
>
> Nick
>
>
>
> *From:* ceph-users [mailto:ceph-users-bounces at lists.ceph.com
> <ceph-users-bounces at lists.ceph.com>] *On Behalf Of *Steve Taylor
> *Sent:* 07 February 2017 17:01
> *To:* ceph-users at lists.ceph.com
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> As I look at more of these stuck ops, it looks like more of them are
> actually waiting on subops than on osdmap updates, so maybe there is still
> some headway to be made with the weighted priority queue settings. I do see
> OSDs waiting for map updates all the time, but they aren?t blocking things
> as much as the subops are. Thoughts?
>
>
> ------------------------------
>
>
> <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAAAAAAAAAFklQUAAADNJBWwAAAAAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/4/81yNjhvM-QSuNkb1hgsK_A/aHR0cDovL3hvNHQubWouYW0vbG5rL0FEc0FBR1ZFeFk0QUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWW1qa3VqWE1Hc2Z2MFFJMklrZHpkTVBIYk93QUFsQkkvMy9nbXhCUTRkdWxoQ0xnZGFYWXdqelhRL2FIUjBjRG92TDNodk5IUXViV291WVcwdmJHNXJMMEZGUVVGSVpGaGZUbFk0UVVGQlFVRkJRVUZCUVVZeloyUnhORUZCUkU1S1FsZDNRVUZCUVVGQlFVTlNXSGRDV1cxblRESjJNa3BxY2w5UExWSXlUekkwTUVwaVdYTjVXV1ZuUVVGc1Fra3ZNUzl2WTNSb2VUWm5jM1ZzTFRsSFNsazFURU53WTJGQkwyRklVakJqU0UwMlRIazVlbVJIT1hsWlYyUnNXVE5LYUZwdVVYVlpNamww>
>
> *Steve* *Taylor* | Senior Software Engineer | StorageCraft Technology
> Corporation
> <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAAAAAAAAAFklQUAAADNJBWwAAAAAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/5/P4cU1Y6EtJnP-BebF1YnNA/aHR0cDovL3hvNHQubWouYW0vbG5rL0FEc0FBR1ZFeFk0QUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWW1qa3VqWE1Hc2Z2MFFJMklrZHpkTVBIYk93QUFsQkkvNC9GdXNXdDRmMkRydGZBZ19SbDFYenBnL2FIUjBjRG92TDNodk5IUXViV291WVcwdmJHNXJMMEZGUVVGSVpGaGZUbFk0UVVGQlFVRkJRVUZCUVVZeloyUnhORUZCUkU1S1FsZDNRVUZCUVVGQlFVTlNXSGRDV1cxblRESjJNa3BxY2w5UExWSXlUekkwTUVwaVdYTjVXV1ZuUVVGc1Fra3ZNaTkwUlUxRU9ETTBaSFZuT0VacFdXeDZRbVJ1UkVSbkwyRklVakJqU0UwMlRIazVlbVJIT1hsWlYyUnNXVE5LYUZwdVVYVlpNamww>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> *Office: *801.871.2799 <(801)%20871-2799> |
> ------------------------------
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
> ------------------------------
>
> *From:* Steve Taylor
> *Sent:* Tuesday, February 7, 2017 9:13 AM
> *To:* 'ceph-users at lists.ceph.com' <ceph-users at lists.ceph.com>
> *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during
> sleep?
>
>
>
> Sorry, I lost the previous thread on this. I apologize for the resulting
> incomplete reply.
>
>
>
> The issue that we?re having with Jewel, as David Turner mentioned, is that
> we can?t seem to throttle snap trimming sufficiently to prevent it from
> blocking I/O requests. On further investigation, I encountered
> osd_op_pq_max_tokens_per_priority, which should be able to be used in
> conjunction with ?osd_op_queue = wpq? to govern the availability of queue
> positions for various operations using costs if I understand correctly. I?m
> testing with RBDs using 4MB objects, so in order to leave plenty of room in
> the weighted priority queue for client I/O, I set osd_op_pq_max_tokens_per_priority
> to 64MB and osd_snap_trim_cost to 32MB+1. I figured this should essentially
> reserve 32MB in the queue for client I/O operations, which are prioritized
> higher and therefore shouldn?t get blocked.
>
>
>
> I still see blocked I/O requests, and when I dump in-flight ops, they show
> ?op must wait for map.? I assume this means that what?s blocking the I/O
> requests at this point is all of the osdmap updates caused by snap
> trimming, and not the actual snap trimming itself starving the ops of op
> threads. Hammer is able to mitigate this with osd_snap_trim_sleep by
> directly throttling snap trimming and therefore causing less frequent
> osdmap updates, but there doesn?t seem to be a good way to accomplish the
> same thing with Jewel.
>
>
>
> First of all, am I understanding these settings correctly? If so, are
> there other settings that could potentially help here, or do we just need
> something like Sam already mentioned that can sort of reserve threads for
> client I/O requests? Even then it seems like we might have issues if we
> can?t also throttle snap trimming. We delete a LOT of RBD snapshots on a
> daily basis, which we recognize is an extreme use case. Just wondering if
> there?s something else to try or if we need to start working toward
> implementing something new ourselves to handle our use case better.
>
>
> [image: Image removed by sender.]
>
>
> [image: Image removed by sender.]
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAAAAAAAAAFklQUAAADNJBWwAAAAAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/6/Z086sItvmjXlhBlubwoAQQ/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170221/972b1641/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 2679 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170221/972b1641/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 332 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170221/972b1641/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.jpg
Type: image/jpeg
Size: 332 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170221/972b1641/attachment-0002.jpg>