Re: Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Christian Theune <ct@xxxxxxxxxxxxxxx> · Mon, 18 Sep 2017 14:02:08 +0200

Hi,

and here’s another update which others might find quite interesting.

Florian and I spend some time discussing the issue further, face to face. I had one switch that I brought up again (—osd-recovery-start-delay) which I looked at a few weeks ago but came to the conclusion that its rules are underdocumented and from the appearance it didn’t seem to do anything.

After stepping through what we learned about prioritized recovery, I brought this up again and we started to experiment with this further - and it turns out this switch might be quite helpful.

Here’s what I found and maybe others can chime in whether this is going in the right direction or not:

1. Setting --osd-recovery-start-delay (e.g. 60 seconds) causes no PG
   to start its recovery when the OSD boots and goes from ‘down/in’ to
   ‘up/in’.

2. Client traffic starts getting processed immediately.

3. Writes from client traffic cause individual objects to require a
   (prioritized) recovery. As no other recovery is happening, everything
   is pretty relaxed and the recovery happens quickly and no slow
   requests appear. (Even when pushing the complaint time to 15s)

4. When an object from a PG gets recovered this way, the PG is marked as
   ‘active+recovering+degraded’. In my test cluster this went up to ~37
   and made me wonder, because it exceeded my ‘--osd-recovery-max-
   active’ setting. Looking at the recovery rate you can see that no
   objects are recovering, and only every now and then an object
   gets recovered.

5. After two minutes, no sudden “everyone else please start recovering”
   thundering happens. I scratch my head. I think.

   My conclusion is, that the “active+recovering+degraded” marker is
   actually just that: a marker. The organic writes now (implicitly)
   signal Ceph that there is a certain amount organic traffic that
   requires recovery and pushes the recovering PGs beyond the point
   where “real” recovery would start, because my limits are 3 PGs per
   OSD recovering.

6. After a while your “hot set” of objects that get written to (I used
   to VMs with a random write fio[1] is recovered by organic means and
   the ‘recovering’ PGs count goes down.

7. Once an OSD’s “recovering” count falls below the limit, it begins
   to start “real” recoveries. However, the hot set is now already
   recovered, so slow requests due to prioritized recoveries
   become unlikely.

This actually feels like a quite nice way to handle this. Yes, recovery time will be longer, but with a size=3/min_size=2 this still feels fast enough. (In my test setup it took about 1h to recover fully from a 30% failure with heavy client traffic).

In my experiment I did see slow requests but none of those were ‘waiting for missing object’ or 'waiting for degraded object’.

I consider this a success and wonder what you guys think.

Christian

[1] fio --rw=randwrite --name=test --size=50M --direct=1 --bs=4k-128k --numjobs=20 --iodepth=64 --group_reporting --runtime=6000 --time_based

Liebe Grüße,
Christian Theune

--
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick

Attachment:
signature.asc

Description: Message signed with OpenPGP
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com