Re: Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Florian Haas <florian@xxxxxxxxxxx> · Tue, 19 Sep 2017 17:14:52 +0200

On Mon, Sep 18, 2017 at 2:02 PM, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote:
> Hi,
>
> and here’s another update which others might find quite interesting.
>
> Florian and I spend some time discussing the issue further, face to face. I had one switch that I brought up again (—osd-recovery-start-delay) which I looked at a few weeks ago but came to the conclusion that its rules are underdocumented and from the appearance it didn’t seem to do anything.
>
> After stepping through what we learned about prioritized recovery, I brought this up again and we started to experiment with this further - and it turns out this switch might be quite helpful.
>
> Here’s what I found and maybe others can chime in whether this is going in the right direction or not:
>
> 1. Setting --osd-recovery-start-delay (e.g. 60 seconds) causes no PG
>    to start its recovery when the OSD boots and goes from ‘down/in’ to
>    ‘up/in’.

Just a minor point of correction here, for the people grabbing this
thread from the archives: the option is osd_recovery_delay_start.

> 2. Client traffic starts getting processed immediately.
>
> 3. Writes from client traffic cause individual objects to require a
>    (prioritized) recovery. As no other recovery is happening, everything
>    is pretty relaxed and the recovery happens quickly and no slow
>    requests appear. (Even when pushing the complaint time to 15s)
>
> 4. When an object from a PG gets recovered this way, the PG is marked as
>    ‘active+recovering+degraded’. In my test cluster this went up to ~37
>    and made me wonder, because it exceeded my ‘--osd-recovery-max-
>    active’ setting. Looking at the recovery rate you can see that no
>    objects are recovering, and only every now and then an object
>    gets recovered.

Again for future interested parties following this thread I think this
is worth highlighting, as it's rather unexpected (albeit somewhat
logical): the PG will go to the "recovering" state even though it's
not undergoing a full recovery. There are just a handful of objects
*in* the PG that are being recovered, even though (a) recovery is
deferred via osd_recovery_delay_start, *and* (b) concurrent recovery
of that many PGs technically isn't allowed, thanks to
osd_recovery_max_active. A clear indicator of this situation is ceph
-w showing a nontrivial number of PGs recovering, but the recovery
rate being in the single-digit objects per second.

> 5. After two minutes, no sudden “everyone else please start recovering”
>    thundering happens. I scratch my head. I think.
>
>    My conclusion is, that the “active+recovering+degraded” marker is
>    actually just that: a marker. The organic writes now (implicitly)
>    signal Ceph that there is a certain amount organic traffic that
>    requires recovery and pushes the recovering PGs beyond the point
>    where “real” recovery would start, because my limits are 3 PGs per
>    OSD recovering.

Josh, can you confirm this? And if so, can you elaborate on the
reasoning behind it?

> 6. After a while your “hot set” of objects that get written to (I used
>    to VMs with a random write fio[1] is recovered by organic means and
>    the ‘recovering’ PGs count goes down.
>
> 7. Once an OSD’s “recovering” count falls below the limit, it begins
>    to start “real” recoveries. However, the hot set is now already
>    recovered, so slow requests due to prioritized recoveries
>    become unlikely.
>
> This actually feels like a quite nice way to handle this. Yes, recovery time will be longer, but with a size=3/min_size=2 this still feels fast enough. (In my test setup it took about 1h to recover fully from a 30% failure with heavy client traffic).

Again, for the benefit to third parties we should probably mention
that recovery otherwise completed in a matter of minutes, albeit at
the cost of making client I/O almost unworkably slow. Just so people
can decide for themselves whether or not they want to go down that
route.

Also (Josh, please correct me if I'm wrong here), I think people need
to understand that using this strategy makes recovery last longer when
their clients are very busy, and wrap up quickly when they are not
doing much. Client activity is not something that the Ceph cluster
operator necessarily has much control over, so keeping tabs on average
and max recovery time would be a good idea here.

What do others think? In particular, does anyone think what Christian
is suggesting is a bad idea? It seems like a sound approach to me.

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com