Hi Josh, > On Sep 16, 2017, at 3:13 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote: > > (Sorry for top posting, this email client isn't great at editing) Thanks for taking the time to respond. :) > The mitigation strategy I mentioned before of forcing backfill could be backported to jewel, but I don't think it's a very good option for RBD users without SSDs. Interestingly enough, we don’t see this problem on our pure SSD pool. > In luminous there is a command (something like 'ceph pg force-recovery') that you can use to prioritize recovery of particular PGs (and thus rbd images with some scripting). This would at least let you limit the scope of affected images. A couple folks from OVH added it for just this purpose. Uhm. I haven’t measured, but my impression is that for us it’s all over the map anyway. I don’t think we’d have many PGs that have objects of only specific rbd images … why would that happen anyway? > Neither of these is an ideal workaround, but I haven't thought of a better one for existing versions. I’ll discuss more strategies with Florian today, however, a few questions arise: a) Do you have any ideas whether certain settings (recovery / backfill limits, network / disk / cpu saturation, ceph version) may be contributing in a way that this seems to hurt us more than others? I’m also surprised that a prioritized recovery causes 30-60 seconds of delay for a single IOP. I mean, I understand degraded throughput and latency during recovery, but what gets me are those extremely blocked individual operations. After we reviewed other’s settings incl. last years Cern recommendations we’ve set the following “interesting” options. Did we maybe unintentionally hit a combination that worsens this behaviour? Could the “backfill scan” and “max chunk” options make this worse? fd cache size = 2048 filestore max sync interval = 60 # fsync files every 60s filestore op threads = 8 # more threads where needed filestore queue max ops = 100 # allow more queued ops filestore fiemap = true osd backfill scan max = 128 osd backfill scan min = 32 osd max backfills = 5 osd recovery max active = 3 osd recovery max single start = 1 osd recovery op priority = 1 osd recovery threads = 1 osd recovery max chunk = 1048576 osd disk threads = 1 osd disk thread ioprio class = idle osd disk thread ioprio priority = 0 osd snap trim sleep = 0.5 # throttle some long lived OSD ops osd op threads = 4 # more threads where needed The full OSD config is here (for a week from now on): http://dpaste.com/35ABA0N b) I just upgraded to hammer 0.94.10 (+segfault fix) in our development environment and _may_ have seen an improvement on this. Could this be http://tracker.ceph.com/issues/16128 c) Are all Ceph users just silently happy with this and are we the only ones where this makes us feel uneasy? Or are we the only ones hit by this? (Well, I guess others are. Alibaba seems to have been working on the async + partial recovery, too.) d) with size=3/min_size=2 we like to perform _quick_ maintenance operations (i.e. a simple host reboot) without evacuating the host. However, with the situation regarding recovery having such a high impact I’m now considering to do just that. Is everyone else just doing host evacuations all the time? We’ve become edgy about this behavior as we had a couple of weeks where we were fighting against a CPU bug that caused host reboots every few days, triggering availability issues for our customer services due to slow requests. Kind regards, Christian -- Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
Attachment:
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com