Ok, I've added explicit support for osd_snap_trim_sleep (same param, new non-blocking implementation) to that branch. Care to take it for a whirl? -Sam On Thu, Feb 9, 2017 at 11:36 AM, Nick Fisk <nick at fisk.me.uk> wrote: > Building now > > > > *From:* ceph-users [mailto:ceph-users-bounces at lists.ceph.com] *On Behalf > Of *Samuel Just > *Sent:* 09 February 2017 19:22 > *To:* Nick Fisk <nick at fisk.me.uk> > *Cc:* ceph-users at lists.ceph.com > > *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during > sleep? > > > > Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep > <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAAAAAAAAAFklQUAAADNJBWwAAAAAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/1/h2R0YQ3R0gPNSCn9TMuW8Q/aHR0cHM6Ly9naXRodWIuY29tL2F0aGFuYXRvcy9jZXBoL3RyZWUvd2lwLXNuYXAtdHJpbS1zbGVlcA> > (based on master) passed a rados suite. It adds a configurable limit to > the number of pgs which can be trimming on any OSD (default: 2). PGs > trimming will be in snaptrim state, PGs waiting to trim will be in > snaptrim_wait state. I suspect this'll be adequate to throttle the amount > of trimming. If not, I can try to add an explicit limit to the rate at > which the work items trickle into the queue. Can someone test this branch? > Tester beware: this has not merged into master yet and should only be run > on a disposable cluster. > > -Sam > > > > On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk <nick at fisk.me.uk> wrote: > > Yeah it?s probably just the fact that they have more PG?s so they will > hold more data and thus serve more IO. As they have a fixed IO limit, they > will always hit the limit first and become the bottleneck. > > > > The main problem with reducing the filestore queue is that I believe you > will start to lose the benefit of having IO?s queued up on the disk, so > that the scheduler can re-arrange them to action them in the most efficient > manor as the disk head moves across the platters. You might possibly see up > to a 20% hit on performance, in exchange for more consistent client > latency. > > > > *From:* Steve Taylor [mailto:steve.taylor at storagecraft.com] > *Sent:* 07 February 2017 20:35 > *To:* nick at fisk.me.uk; ceph-users at lists.ceph.com > > > *Subject:* RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during > sleep? > > > > Thanks, Nick. > > > > One other data point that has come up is that nearly all of the blocked > requests that are waiting on subops are waiting for OSDs with more PGs than > the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB > OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs > individually have 33% more PGs than the others and are causing almost all > of the blocked requests. It appears that maps updates are generally not > blocking long enough to show up as blocked requests. > > > > I set the reweight on those 7 OSDs to 0.75 and things are backfilling now. > I?ll test some more when the PG counts per OSD are more balanced and see > what I get. I?ll also play with the filestore queue. I was telling some of > my colleagues yesterday that this looked likely to be related to buffer > bloat somewhere. I appreciate the suggestion. > > > ------------------------------ > > > <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAAAAAAAAAFklQUAAADNJBWwAAAAAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/2/2T4Xj-_wncGT6Y6LyBEKdw/aHR0cDovL3hvNHQubWouYW0vbG5rL0FEc0FBR1ZFeFk0QUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWW1qa3VqWE1Hc2Z2MFFJMklrZHpkTVBIYk93QUFsQkkvMS9HclNQRjU2RnY2VXVUc1JUejFUbnJRL2FIUjBjSE02THk5emRHOXlZV2RsWTNKaFpuUXVZMjl0> > > *Steve* *Taylor* | Senior Software Engineer | StorageCraft Technology > Corporation > <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAAAAAAAAAFklQUAAADNJBWwAAAAAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/3/TFexBfD-LHnCcPturKjU5Q/aHR0cDovL3hvNHQubWouYW0vbG5rL0FEc0FBR1ZFeFk0QUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWW1qa3VqWE1Hc2Z2MFFJMklrZHpkTVBIYk93QUFsQkkvMi9IbGVSZWkzWVdEZGljbUN1RG9XeXRBL2FIUjBjSE02THk5emRHOXlZV2RsWTNKaFpuUXVZMjl0> > 380 Data Drive Suite 300 | Draper | Utah | 84020 > *Office: *801.871.2799 <(801)%20871-2799> | > ------------------------------ > > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this > message is prohibited. > ------------------------------ > > *From:* Nick Fisk [mailto:nick at fisk.me.uk] > *Sent:* Tuesday, February 7, 2017 10:25 AM > *To:* Steve Taylor <steve.taylor at storagecraft.com>; > ceph-users at lists.ceph.com > *Subject:* RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during > sleep? > > > > Hi Steve, > > > > From what I understand, the issue is not with the queueing in Ceph, which > is correctly moving client IO to the front of the queue. The problem lies > below what Ceph controls, ie the scheduler and disk layer in Linux. Once > the IO?s leave Ceph it?s a bit of a free for all and the client IO?s tend > to get lost in large disk queues surrounded by all the snap trim IO?s. > > > > The workaround Sam is working on will limit the amount of snap trims that > are allowed to run, which I believe will have a similar effect to the sleep > parameters in pre-jewel clusters, but without pausing the whole IO thread. > > > > Ultimately the solution requires Ceph to be able to control the queuing of > IO?s at the lower levels of the kernel. Whether this is via some sort of > tagging per IO (currently CFQ is only per thread/process) or some other > method, I don?t know. I was speaking to Sage and he thinks the easiest > method might be to shrink the filestore queue so that you don?t get buffer > bloat at the disk level. You should be able to test this out pretty easily > now by changing the parameter, probably around a queue of 5-10 would be > about right for spinning disks. It?s a trade off of peak throughput vs > queue latency though. > > > > Nick > > > > *From:* ceph-users [mailto:ceph-users-bounces at lists.ceph.com > <ceph-users-bounces at lists.ceph.com>] *On Behalf Of *Steve Taylor > *Sent:* 07 February 2017 17:01 > *To:* ceph-users at lists.ceph.com > *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during > sleep? > > > > As I look at more of these stuck ops, it looks like more of them are > actually waiting on subops than on osdmap updates, so maybe there is still > some headway to be made with the weighted priority queue settings. I do see > OSDs waiting for map updates all the time, but they aren?t blocking things > as much as the subops are. Thoughts? > > > ------------------------------ > > > <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAAAAAAAAAFklQUAAADNJBWwAAAAAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/4/81yNjhvM-QSuNkb1hgsK_A/aHR0cDovL3hvNHQubWouYW0vbG5rL0FEc0FBR1ZFeFk0QUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWW1qa3VqWE1Hc2Z2MFFJMklrZHpkTVBIYk93QUFsQkkvMy9nbXhCUTRkdWxoQ0xnZGFYWXdqelhRL2FIUjBjRG92TDNodk5IUXViV291WVcwdmJHNXJMMEZGUVVGSVpGaGZUbFk0UVVGQlFVRkJRVUZCUVVZeloyUnhORUZCUkU1S1FsZDNRVUZCUVVGQlFVTlNXSGRDV1cxblRESjJNa3BxY2w5UExWSXlUekkwTUVwaVdYTjVXV1ZuUVVGc1Fra3ZNUzl2WTNSb2VUWm5jM1ZzTFRsSFNsazFURU53WTJGQkwyRklVakJqU0UwMlRIazVlbVJIT1hsWlYyUnNXVE5LYUZwdVVYVlpNamww> > > *Steve* *Taylor* | Senior Software Engineer | StorageCraft Technology > Corporation > <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAAAAAAAAAFklQUAAADNJBWwAAAAAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/5/P4cU1Y6EtJnP-BebF1YnNA/aHR0cDovL3hvNHQubWouYW0vbG5rL0FEc0FBR1ZFeFk0QUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWW1qa3VqWE1Hc2Z2MFFJMklrZHpkTVBIYk93QUFsQkkvNC9GdXNXdDRmMkRydGZBZ19SbDFYenBnL2FIUjBjRG92TDNodk5IUXViV291WVcwdmJHNXJMMEZGUVVGSVpGaGZUbFk0UVVGQlFVRkJRVUZCUVVZeloyUnhORUZCUkU1S1FsZDNRVUZCUVVGQlFVTlNXSGRDV1cxblRESjJNa3BxY2w5UExWSXlUekkwTUVwaVdYTjVXV1ZuUVVGc1Fra3ZNaTkwUlUxRU9ETTBaSFZuT0VacFdXeDZRbVJ1UkVSbkwyRklVakJqU0UwMlRIazVlbVJIT1hsWlYyUnNXVE5LYUZwdVVYVlpNamww> > 380 Data Drive Suite 300 | Draper | Utah | 84020 > *Office: *801.871.2799 <(801)%20871-2799> | > ------------------------------ > > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this > message is prohibited. > ------------------------------ > > *From:* Steve Taylor > *Sent:* Tuesday, February 7, 2017 9:13 AM > *To:* 'ceph-users at lists.ceph.com' <ceph-users at lists.ceph.com> > *Subject:* Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during > sleep? > > > > Sorry, I lost the previous thread on this. I apologize for the resulting > incomplete reply. > > > > The issue that we?re having with Jewel, as David Turner mentioned, is that > we can?t seem to throttle snap trimming sufficiently to prevent it from > blocking I/O requests. On further investigation, I encountered > osd_op_pq_max_tokens_per_priority, which should be able to be used in > conjunction with ?osd_op_queue = wpq? to govern the availability of queue > positions for various operations using costs if I understand correctly. I?m > testing with RBDs using 4MB objects, so in order to leave plenty of room in > the weighted priority queue for client I/O, I set osd_op_pq_max_tokens_per_priority > to 64MB and osd_snap_trim_cost to 32MB+1. I figured this should essentially > reserve 32MB in the queue for client I/O operations, which are prioritized > higher and therefore shouldn?t get blocked. > > > > I still see blocked I/O requests, and when I dump in-flight ops, they show > ?op must wait for map.? I assume this means that what?s blocking the I/O > requests at this point is all of the osdmap updates caused by snap > trimming, and not the actual snap trimming itself starving the ops of op > threads. Hammer is able to mitigate this with osd_snap_trim_sleep by > directly throttling snap trimming and therefore causing less frequent > osdmap updates, but there doesn?t seem to be a good way to accomplish the > same thing with Jewel. > > > > First of all, am I understanding these settings correctly? If so, are > there other settings that could potentially help here, or do we just need > something like Sam already mentioned that can sort of reserve threads for > client I/O requests? Even then it seems like we might have issues if we > can?t also throttle snap trimming. We delete a LOT of RBD snapshots on a > daily basis, which we recognize is an extreme use case. Just wondering if > there?s something else to try or if we need to start working toward > implementing something new ourselves to handle our use case better. > > > [image: Image removed by sender.] > > > [image: Image removed by sender.] > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://xo4t.mj.am/lnk/AEMAHoCuLsEAAAAAAAAAAFklQUAAADNJBWwAAAAAAACRXwBYnMSxtQOC2XS3Szqb2auk-ebWnwAAlBI/6/Z086sItvmjXlhBlubwoAQQ/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t> > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170221/972b1641/attachment.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 2679 bytes Desc: not available URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170221/972b1641/attachment.jpg> -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 332 bytes Desc: not available URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170221/972b1641/attachment-0001.jpg> -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.jpg Type: image/jpeg Size: 332 bytes Desc: not available URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170221/972b1641/attachment-0002.jpg>