Hey all, Resurrecting this thread because I just wanted to let you know that Sam's initial work in master has been backported to Jewel and will be in the next (10.2.8, I think?) release: https://github.com/ceph/ceph/pull/14492/ Once upgraded, it will be safe to use the "osd snap trim sleep" option again. It also adds a new "osd max trimming pgs" (default 2) that limits the number of PGs each primary will simultaneously trim on, and adds "snaptrim" and "snaptrim_wait" to the list of reported PG states. :) (For those of you running Kraken, its backport hasn't merged yet but is at https://github.com/ceph/ceph/pull/14597) -Greg On Tue, Feb 21, 2017 at 3:32 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > Yep sure, will try and present some figures at tomorrow’s meeting again. > > > > From: Samuel Just [mailto:sjust@xxxxxxxxxx] > Sent: 21 February 2017 18:14 > > > To: Nick Fisk <nick@xxxxxxxxxx> > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: Re: osd_snap_trim_sleep keeps locks PG during sleep? > > > > Ok, I've added explicit support for osd_snap_trim_sleep (same param, new > non-blocking implementation) to that branch. Care to take it for a whirl? > > -Sam > > > > On Thu, Feb 9, 2017 at 11:36 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > Building now > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Samuel Just > Sent: 09 February 2017 19:22 > To: Nick Fisk <nick@xxxxxxxxxx> > Cc: ceph-users@xxxxxxxxxxxxxx > > > Subject: Re: osd_snap_trim_sleep keeps locks PG during sleep? > > > > Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep (based on > master) passed a rados suite. It adds a configurable limit to the number of > pgs which can be trimming on any OSD (default: 2). PGs trimming will be in > snaptrim state, PGs waiting to trim will be in snaptrim_wait state. I > suspect this'll be adequate to throttle the amount of trimming. If not, I > can try to add an explicit limit to the rate at which the work items trickle > into the queue. Can someone test this branch? Tester beware: this has not > merged into master yet and should only be run on a disposable cluster. > > -Sam > > > > On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > Yeah it’s probably just the fact that they have more PG’s so they will hold > more data and thus serve more IO. As they have a fixed IO limit, they will > always hit the limit first and become the bottleneck. > > > > The main problem with reducing the filestore queue is that I believe you > will start to lose the benefit of having IO’s queued up on the disk, so that > the scheduler can re-arrange them to action them in the most efficient manor > as the disk head moves across the platters. You might possibly see up to a > 20% hit on performance, in exchange for more consistent client latency. > > > > From: Steve Taylor [mailto:steve.taylor@xxxxxxxxxxxxxxxx] > Sent: 07 February 2017 20:35 > To: nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx > > > Subject: RE: Re: osd_snap_trim_sleep keeps locks PG during > sleep? > > > > Thanks, Nick. > > > > One other data point that has come up is that nearly all of the blocked > requests that are waiting on subops are waiting for OSDs with more PGs than > the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB > OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs > individually have 33% more PGs than the others and are causing almost all of > the blocked requests. It appears that maps updates are generally not > blocking long enough to show up as blocked requests. > > > > I set the reweight on those 7 OSDs to 0.75 and things are backfilling now. > I’ll test some more when the PG counts per OSD are more balanced and see > what I get. I’ll also play with the filestore queue. I was telling some of > my colleagues yesterday that this looked likely to be related to buffer > bloat somewhere. I appreciate the suggestion. > > > > ________________________________ > > Steve Taylor | Senior Software Engineer | StorageCraft Technology > Corporation > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2799 | > > ________________________________ > > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this > message is prohibited. > > ________________________________ > > From: Nick Fisk [mailto:nick@xxxxxxxxxx] > Sent: Tuesday, February 7, 2017 10:25 AM > To: Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx > Subject: RE: Re: osd_snap_trim_sleep keeps locks PG during > sleep? > > > > Hi Steve, > > > > From what I understand, the issue is not with the queueing in Ceph, which is > correctly moving client IO to the front of the queue. The problem lies below > what Ceph controls, ie the scheduler and disk layer in Linux. Once the IO’s > leave Ceph it’s a bit of a free for all and the client IO’s tend to get lost > in large disk queues surrounded by all the snap trim IO’s. > > > > The workaround Sam is working on will limit the amount of snap trims that > are allowed to run, which I believe will have a similar effect to the sleep > parameters in pre-jewel clusters, but without pausing the whole IO thread. > > > > Ultimately the solution requires Ceph to be able to control the queuing of > IO’s at the lower levels of the kernel. Whether this is via some sort of > tagging per IO (currently CFQ is only per thread/process) or some other > method, I don’t know. I was speaking to Sage and he thinks the easiest > method might be to shrink the filestore queue so that you don’t get buffer > bloat at the disk level. You should be able to test this out pretty easily > now by changing the parameter, probably around a queue of 5-10 would be > about right for spinning disks. It’s a trade off of peak throughput vs queue > latency though. > > > > Nick > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Steve Taylor > Sent: 07 February 2017 17:01 > To: ceph-users@xxxxxxxxxxxxxx > Subject: Re: osd_snap_trim_sleep keeps locks PG during sleep? > > > > As I look at more of these stuck ops, it looks like more of them are > actually waiting on subops than on osdmap updates, so maybe there is still > some headway to be made with the weighted priority queue settings. I do see > OSDs waiting for map updates all the time, but they aren’t blocking things > as much as the subops are. Thoughts? > > > > ________________________________ > > Steve Taylor | Senior Software Engineer | StorageCraft Technology > Corporation > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2799 | > > ________________________________ > > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this > message is prohibited. > > ________________________________ > > From: Steve Taylor > Sent: Tuesday, February 7, 2017 9:13 AM > To: 'ceph-users@xxxxxxxxxxxxxx' <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: osd_snap_trim_sleep keeps locks PG during sleep? > > > > Sorry, I lost the previous thread on this. I apologize for the resulting > incomplete reply. > > > > The issue that we’re having with Jewel, as David Turner mentioned, is that > we can’t seem to throttle snap trimming sufficiently to prevent it from > blocking I/O requests. On further investigation, I encountered > osd_op_pq_max_tokens_per_priority, which should be able to be used in > conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue > positions for various operations using costs if I understand correctly. I’m > testing with RBDs using 4MB objects, so in order to leave plenty of room in > the weighted priority queue for client I/O, I set > osd_op_pq_max_tokens_per_priority to 64MB and osd_snap_trim_cost to 32MB+1. > I figured this should essentially reserve 32MB in the queue for client I/O > operations, which are prioritized higher and therefore shouldn’t get > blocked. > > > > I still see blocked I/O requests, and when I dump in-flight ops, they show > ‘op must wait for map.’ I assume this means that what’s blocking the I/O > requests at this point is all of the osdmap updates caused by snap trimming, > and not the actual snap trimming itself starving the ops of op threads. > Hammer is able to mitigate this with osd_snap_trim_sleep by directly > throttling snap trimming and therefore causing less frequent osdmap updates, > but there doesn’t seem to be a good way to accomplish the same thing with > Jewel. > > > > First of all, am I understanding these settings correctly? If so, are there > other settings that could potentially help here, or do we just need > something like Sam already mentioned that can sort of reserve threads for > client I/O requests? Even then it seems like we might have issues if we > can’t also throttle snap trimming. We delete a LOT of RBD snapshots on a > daily basis, which we recognize is an extreme use case. Just wondering if > there’s something else to try or if we need to start working toward > implementing something new ourselves to handle our use case better. > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com