On Mon, Sep 18, 2017 at 8:48 AM, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote: > Hi Josh, > >> On Sep 16, 2017, at 3:13 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote: >> >> (Sorry for top posting, this email client isn't great at editing) > > Thanks for taking the time to respond. :) > >> The mitigation strategy I mentioned before of forcing backfill could be backported to jewel, but I don't think it's a very good option for RBD users without SSDs. > > Interestingly enough, we don’t see this problem on our pure SSD pool. I think it's been established before that for those at liberty to clobber the problem with hardware, it's unlikely to be that much of a hassle. The problem with that is that for most cloud operators, throwing SSD/NVMe hardware at *everything* is usually not a cost-effective option. >> In luminous there is a command (something like 'ceph pg force-recovery') that you can use to prioritize recovery of particular PGs (and thus rbd images with some scripting). This would at least let you limit the scope of affected images. A couple folks from OVH added it for just this purpose. > > Uhm. I haven’t measured, but my impression is that for us it’s all over the map anyway. I don’t think we’d have many PGs that have objects of only specific rbd images … why would that happen anyway? > >> Neither of these is an ideal workaround, but I haven't thought of a better one for existing versions. > > I’ll discuss more strategies with Florian today, however, a few questions arise: > > a) Do you have any ideas whether certain settings (recovery / backfill limits, network / disk / cpu saturation, ceph version) may be contributing in a way that this seems to hurt us more than others? For Josh's and others' benefit, I think you might want to share how many nodes you operate, as that would be quite relevant to the discussion. Generally, the larger the *percentage* of OSDs that simultaneously recover, the more likely it will actually cause a problem. If you have, say, 100 OSD nodes with 10 OSDs each, then only 1% of your 1,000 OSDs are affected by the reboot of a node, and the slow request problem would be unlikely to be extremely disruptive. But, of course, the issue would still be relevant even with significantly larger clusters, considering such clusters would be typically using CRUSH rulesets defining racks, aisles, rooms etc. as failure domains. And while it's great that the simultaneous failure of all nodes in a rack does not cause any data loss, nor downtime while the failure is active, it's rather problematic for it to bring VMs to a crawl after the failure has been resolved. > I’m also surprised that a prioritized recovery causes 30-60 seconds of delay for a single IOP. I mean, I understand degraded throughput and latency during recovery, but what gets me are those extremely blocked individual operations. If I read Josh correctly, that would simply be a result of "everything" having moved to the front of the queue. It's like having one priority lane at airport security, and then giving everyone frequent flyer status. > After we reviewed other’s settings incl. last years Cern recommendations we’ve set the following “interesting” options. Did we maybe unintentionally hit a combination that worsens this behaviour? Could the “backfill scan” and “max chunk” options make this worse? > > fd cache size = 2048 > filestore max sync interval = 60 # fsync files every 60s > filestore op threads = 8 # more threads where needed > filestore queue max ops = 100 # allow more queued ops > filestore fiemap = true > osd backfill scan max = 128 > osd backfill scan min = 32 > osd max backfills = 5 > osd recovery max active = 3 > osd recovery max single start = 1 > osd recovery op priority = 1 > osd recovery threads = 1 > osd recovery max chunk = 1048576 > osd disk threads = 1 > osd disk thread ioprio class = idle > osd disk thread ioprio priority = 0 > osd snap trim sleep = 0.5 # throttle some long lived OSD ops > osd op threads = 4 # more threads where needed > > The full OSD config is here (for a week from now on): > http://dpaste.com/35ABA0N > > b) I just upgraded to hammer 0.94.10 (+segfault fix) in our development environment and _may_ have seen an improvement on this. Could this be > > http://tracker.ceph.com/issues/16128 > > c) Are all Ceph users just silently happy with this and are we the only ones where this makes us feel uneasy? Or are we the only ones hit by this? (Well, I guess others are. Alibaba seems to have been working on the async + partial recovery, too.) > > d) with size=3/min_size=2 we like to perform _quick_ maintenance operations (i.e. a simple host reboot) without evacuating the host. However, with the situation regarding recovery having such a high impact I’m now considering to do just that. Is everyone else just doing host evacuations all the time? > > We’ve become edgy about this behavior as we had a couple of weeks where we were fighting against a CPU bug that caused host reboots every few days, triggering availability issues for our customer services due to slow requests. > > Kind regards, > Christian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com