RE: Question about how rebuild works.

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Sat, 7 Nov 2015 00:12:34 +0000

So the current algorithm optimizes for minimum period of cluster degradation at the expense of degrading MTTDL.

So in the 3x replication case, the MTTR(two failure data) is somewhere between 1x and 2x the MTTR of a single failure -- depending on the phase alignment of the first and second rebuild. 

The average case would be 1.5x and this is inverse with the MTTDL, i.e., this behavior cuts the MTTDL in half.

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Gregory Farnum
Sent: Friday, November 06, 2015 8:53 AM
To: Samuel Just <sjust@xxxxxxxxxx>
Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
Subject: Re: Question about how rebuild works.

Yeah, I'm more concerned about individual object durability. This seems like a good way (in ongoing flapping or whatever) for objects at the tail end of a PG to never get properly replicated even as we expend lots of IO repeatedly recovering earlier objects which are better-replicated. :/ Perhaps min_size et al make this a moot point, but...I don't think so. Haven't worked it all the way through.
-Greg

On Fri, Nov 6, 2015 at 8:48 AM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> Nope, it's worse, there could be arbitrary portions of backfilled and 
> unbackfilled portions on any particular incomplete osd.  We'd need a 
> backfilled_regions field with a type like map<hobject_t, hobject_t> 
> mapping backfilled regions begin->end.  It's pretty tedious, but 
> doable provided that we bound how large the mapping gets.  I'm 
> skeptical about how large an effect this would actually have on 
> overall durability (how frequent is this case?).  Once Allen does the 
> math, we'll have a better idea :) -Sam
>
> On Fri, Nov 6, 2015 at 8:43 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>> Argh, I guess I was wrong. Sorry for the misinformation, all! :(
>>
>> If we were to try and do this, Sam, do you have any idea how much it 
>> would take? Presumably we'd have to add a backfill_begin marker to 
>> bookend with last_backfill_started, and then everywhere we send over 
>> object ops we'd have to compare against both of those values. But I'm 
>> not sure how many sites that's likely to be, what other kinds of 
>> paths rely on last_backfill_started, or if I'm missing something.
>> -Greg
>>
>> On Fri, Nov 6, 2015 at 8:30 AM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>> What it actually does is rebuild 3 until it catches up with 2 and 
>>> then it rebuilds them in parallel (to minimize reads).  Optimally, 
>>> we'd start 3 from where 2 left off and then circle back, but we'd 
>>> have to complicate the metadata we use to track backfill.
>>> -Sam
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f