Re: Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Christian Theune <ct@xxxxxxxxxxxxxxx> · Mon, 18 Sep 2017 08:48:39 +0200

Hi Josh,

> On Sep 16, 2017, at 3:13 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
> 
> (Sorry for top posting, this email client isn't great at editing)

Thanks for taking the time to respond. :)

> The mitigation strategy I mentioned before of forcing backfill could be backported to jewel, but I don't think it's a very good option for RBD users without SSDs.

Interestingly enough, we don’t see this problem on our pure SSD pool.

> In luminous there is a command (something like 'ceph pg force-recovery') that you can use to prioritize recovery of particular PGs (and thus rbd images with some scripting). This would at least let you limit the scope of affected images. A couple folks from OVH added it for just this purpose.

Uhm. I haven’t measured, but my impression is that for us it’s all over the map anyway. I don’t think we’d have many PGs that have objects of only specific rbd images … why would that happen anyway?

> Neither of these is an ideal workaround, but I haven't thought of a better one for existing versions.

I’ll discuss more strategies with Florian today, however, a few questions arise:

a) Do you have any ideas whether certain settings (recovery / backfill limits, network / disk / cpu saturation, ceph version) may be contributing in a way that this seems to hurt us more than others?

   I’m also surprised that a prioritized recovery causes 30-60 seconds of delay for a single IOP. I mean, I understand degraded throughput and latency during recovery, but what gets me are those extremely blocked individual operations.

   After we reviewed other’s settings incl. last years Cern recommendations we’ve set the following “interesting” options. Did we maybe unintentionally hit a combination that worsens this behaviour? Could the “backfill scan” and “max chunk” options make this worse?

   fd cache size = 2048
   filestore max sync interval = 60 # fsync files every 60s
   filestore op threads = 8  # more threads where needed
   filestore queue max ops = 100 # allow more queued ops
   filestore fiemap = true
   osd backfill scan max = 128
   osd backfill scan min = 32
   osd max backfills = 5
   osd recovery max active = 3
   osd recovery max single start = 1
   osd recovery op priority = 1
   osd recovery threads = 1
   osd recovery max chunk = 1048576
   osd disk threads = 1
   osd disk thread ioprio class = idle
   osd disk thread ioprio priority = 0
   osd snap trim sleep = 0.5 # throttle some long lived OSD ops
   osd op threads = 4                      # more threads where needed

   The full OSD config is here (for a week from now on):
   http://dpaste.com/35ABA0N

b) I just upgraded to hammer 0.94.10 (+segfault fix) in our development environment and _may_ have seen an improvement on this. Could this be

http://tracker.ceph.com/issues/16128

c) Are all Ceph users just silently happy with this and are we the only ones where this makes us feel uneasy? Or are we the only ones hit by this? (Well, I guess others are. Alibaba seems to have been working on the async + partial recovery, too.)

d) with size=3/min_size=2 we like to perform _quick_ maintenance operations (i.e. a simple host reboot) without evacuating the host. However, with the situation regarding recovery having such a high impact I’m now considering to do just that. Is everyone else just doing host evacuations all the time?

We’ve become edgy about this behavior as we had a couple of weeks where we were fighting against a CPU bug that caused host reboots every few days, triggering availability issues for our customer services due to slow requests.

Kind regards,
Christian

--
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick

Attachment:
signature.asc

Description: Message signed with OpenPGP
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com