Hi Greg, Thanks a lot for your work on this one. It really helps us right now. Would it be easy to add the snaptrim speed on a ceph -s, like "snaptrim io 144 MB/s, 721 objects/s" (or just objects/s if sizes are unknown) ? It would help to see how the snaptrim speed changes along with snap trimming options. When a snapshot is removed, all primary OSDs seem to start trimming at the same time. Can we avoid this or limit their number ? Best regards, Frédéric Nass. ----- Le 26 Avr 17, à 20:24, Gregory Farnum gfarnum@xxxxxxxxxx a écrit : > Hey all, > Resurrecting this thread because I just wanted to let you know that > Sam's initial work in master has been backported to Jewel and will be > in the next (10.2.8, I think?) release: > https://github.com/ceph/ceph/pull/14492/ > Once upgraded, it will be safe to use the "osd snap trim sleep" option > again. It also adds a new "osd max trimming pgs" (default 2) that > limits the number of PGs each primary will simultaneously trim on, and > adds "snaptrim" and "snaptrim_wait" to the list of reported PG states. > :) > (For those of you running Kraken, its backport hasn't merged yet but > is at https://github.com/ceph/ceph/pull/14597) > -Greg > On Tue, Feb 21, 2017 at 3:32 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > Yep sure, will try and present some figures at tomorrow’s meeting again. > > From: Samuel Just [mailto:sjust@xxxxxxxxxx] > > Sent: 21 February 2017 18:14 > > To: Nick Fisk <nick@xxxxxxxxxx> > > Cc: ceph-users@xxxxxxxxxxxxxx > > Subject: Re: osd_snap_trim_sleep keeps locks PG during sleep? > > Ok, I've added explicit support for osd_snap_trim_sleep (same param, new > > non-blocking implementation) to that branch. Care to take it for a whirl? > > -Sam > > On Thu, Feb 9, 2017 at 11:36 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > Building now > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > > Samuel Just > > Sent: 09 February 2017 19:22 > > To: Nick Fisk <nick@xxxxxxxxxx> > > Cc: ceph-users@xxxxxxxxxxxxxx > > Subject: Re: osd_snap_trim_sleep keeps locks PG during sleep? > > Ok, https://github.com/athanatos/ceph/tree/wip-snap-trim-sleep (based on > > master) passed a rados suite. It adds a configurable limit to the number of > > pgs which can be trimming on any OSD (default: 2). PGs trimming will be in > > snaptrim state, PGs waiting to trim will be in snaptrim_wait state. I > > suspect this'll be adequate to throttle the amount of trimming. If not, I > > can try to add an explicit limit to the rate at which the work items trickle > > into the queue. Can someone test this branch? Tester beware: this has not > > merged into master yet and should only be run on a disposable cluster. > > -Sam > > On Tue, Feb 7, 2017 at 1:16 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > Yeah it’s probably just the fact that they have more PG’s so they will hold > > more data and thus serve more IO. As they have a fixed IO limit, they will > > always hit the limit first and become the bottleneck. > > The main problem with reducing the filestore queue is that I believe you > > will start to lose the benefit of having IO’s queued up on the disk, so that > > the scheduler can re-arrange them to action them in the most efficient manor > > as the disk head moves across the platters. You might possibly see up to a > > 20% hit on performance, in exchange for more consistent client latency. > > From: Steve Taylor [mailto:steve.taylor@xxxxxxxxxxxxxxxx] > > Sent: 07 February 2017 20:35 > > To: nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx > > Subject: RE: Re: osd_snap_trim_sleep keeps locks PG during > > sleep? > > Thanks, Nick. > > One other data point that has come up is that nearly all of the blocked > > requests that are waiting on subops are waiting for OSDs with more PGs than > > the others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB > > OSDs. The cluster is well balanced based on OSD capacity, so those 7 OSDs > > individually have 33% more PGs than the others and are causing almost all of > > the blocked requests. It appears that maps updates are generally not > > blocking long enough to show up as blocked requests. > > I set the reweight on those 7 OSDs to 0.75 and things are backfilling now. > > I’ll test some more when the PG counts per OSD are more balanced and see > > what I get. I’ll also play with the filestore queue. I was telling some of > > my colleagues yesterday that this looked likely to be related to buffer > > bloat somewhere. I appreciate the suggestion. > > ________________________________ > > Steve Taylor | Senior Software Engineer | StorageCraft Technology > > Corporation > > 380 Data Drive Suite 300 | Draper | Utah | 84020 > > Office: 801.871.2799 | > > ________________________________ > > If you are not the intended recipient of this message or received it > > erroneously, please notify the sender and delete it, together with any > > attachments, and be advised that any dissemination or copying of this > > message is prohibited. > > ________________________________ > > From: Nick Fisk [mailto:nick@xxxxxxxxxx] > > Sent: Tuesday, February 7, 2017 10:25 AM > > To: Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx > > Subject: RE: Re: osd_snap_trim_sleep keeps locks PG during > > sleep? > > Hi Steve, > > From what I understand, the issue is not with the queueing in Ceph, which is > > correctly moving client IO to the front of the queue. The problem lies below > > what Ceph controls, ie the scheduler and disk layer in Linux. Once the IO’s > > leave Ceph it’s a bit of a free for all and the client IO’s tend to get lost > > in large disk queues surrounded by all the snap trim IO’s. > > The workaround Sam is working on will limit the amount of snap trims that > > are allowed to run, which I believe will have a similar effect to the sleep > > parameters in pre-jewel clusters, but without pausing the whole IO thread. > > Ultimately the solution requires Ceph to be able to control the queuing of > > IO’s at the lower levels of the kernel. Whether this is via some sort of > > tagging per IO (currently CFQ is only per thread/process) or some other > > method, I don’t know. I was speaking to Sage and he thinks the easiest > > method might be to shrink the filestore queue so that you don’t get buffer > > bloat at the disk level. You should be able to test this out pretty easily > > now by changing the parameter, probably around a queue of 5-10 would be > > about right for spinning disks. It’s a trade off of peak throughput vs queue > > latency though. > > Nick > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > > Steve Taylor > > Sent: 07 February 2017 17:01 > > To: ceph-users@xxxxxxxxxxxxxx > > Subject: Re: osd_snap_trim_sleep keeps locks PG during sleep? > > As I look at more of these stuck ops, it looks like more of them are > > actually waiting on subops than on osdmap updates, so maybe there is still > > some headway to be made with the weighted priority queue settings. I do see > > OSDs waiting for map updates all the time, but they aren’t blocking things > > as much as the subops are. Thoughts? > > ________________________________ > > Steve Taylor | Senior Software Engineer | StorageCraft Technology > > Corporation > > 380 Data Drive Suite 300 | Draper | Utah | 84020 > > Office: 801.871.2799 | > > ________________________________ > > If you are not the intended recipient of this message or received it > > erroneously, please notify the sender and delete it, together with any > > attachments, and be advised that any dissemination or copying of this > > message is prohibited. > > ________________________________ > > From: Steve Taylor > > Sent: Tuesday, February 7, 2017 9:13 AM > > To: 'ceph-users@xxxxxxxxxxxxxx' <ceph-users@xxxxxxxxxxxxxx> > > Subject: Re: osd_snap_trim_sleep keeps locks PG during sleep? > > Sorry, I lost the previous thread on this. I apologize for the resulting > > incomplete reply. > > The issue that we’re having with Jewel, as David Turner mentioned, is that > > we can’t seem to throttle snap trimming sufficiently to prevent it from > > blocking I/O requests. On further investigation, I encountered > > osd_op_pq_max_tokens_per_priority, which should be able to be used in > > conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue > > positions for various operations using costs if I understand correctly. I’m > > testing with RBDs using 4MB objects, so in order to leave plenty of room in > > the weighted priority queue for client I/O, I set > > osd_op_pq_max_tokens_per_priority to 64MB and osd_snap_trim_cost to 32MB+1. > > I figured this should essentially reserve 32MB in the queue for client I/O > > operations, which are prioritized higher and therefore shouldn’t get > > blocked. > > I still see blocked I/O requests, and when I dump in-flight ops, they show > > ‘op must wait for map.’ I assume this means that what’s blocking the I/O > > requests at this point is all of the osdmap updates caused by snap trimming, > > and not the actual snap trimming itself starving the ops of op threads. > > Hammer is able to mitigate this with osd_snap_trim_sleep by directly > > throttling snap trimming and therefore causing less frequent osdmap updates, > > but there doesn’t seem to be a good way to accomplish the same thing with > > Jewel. > > First of all, am I understanding these settings correctly? If so, are there > > other settings that could potentially help here, or do we just need > > something like Sam already mentioned that can sort of reserve threads for > > client I/O requests? Even then it seems like we might have issues if we > > can’t also throttle snap trimming. We delete a LOT of RBD snapshots on a > > daily basis, which we recognize is an extreme use case. Just wondering if > > there’s something else to try or if we need to start working toward > > implementing something new ourselves to handle our use case better. > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com