RE: Pyramid erasure codes and replica hinted recovery

Andreas Joachim Peters <Andreas.Joachim.Peters@xxxxxxx> · Mon, 13 Jan 2014 11:15:22 +0000

Hi all, 
few points from my side:

In the case of three data centers to protect against 1 out of 3 failing data centers one has to fulfill 2M=K e.g.  (12,6)

K=12 M=6

kkkkmm kkkkmm kkkkmm

Then one can add three local parities to optimize data center local failures

kkkkmml kkkkmml kkkkmml

(12:6:(6:1)) => 175% space overhead

(6:3:(6:1)) => 200% space overhead

(18:9:(6:1)) => 166% space overhead

... so if you boost K it converges to 150% space overhead but if K/3 is too big the traffic overhead for reconstruction is getting high. Since we have a systematic code it is important with respect to CPU requirements to distinguish K,M stripes on the global level e.g. for 3 data centers K and M should be divisible by 3 e.g. (8:4) is not. 

In the current BPC implementation the global RS stripes are not included into the parity computation, however if M is high they should be included to optimize the traffic. We should have an option to specify this or to select automatically the best strategy. For (6:3:(3:1)) the traffic is the same if global stripes are included in local parity, for (8:4:(4:1)) it is better not to include global stripes, for (12:6:(6:1))) it is better to include them.

In the current implementation we can only use simple parity as local parity (k:1) , it is not possible to use something like RS(4:2) as local parity. In my opinion there is no need to add this feature since the probability having a double disk failure within the repair period of a single disk failure is extremely low and if it happens it is acceptable to use global reconstruction instead of adding more space requirements for additional local stripes.

I am not 100% sure but from your conversation I understand that recovery is done on a primary OSD which is quite unfortunate with respect to local parities. The volume is reduced but locality is not given anymore ... 

Cheers Andreas.
________________________________________
From: ceph-devel-owner@xxxxxxxxxxxxxxx [ceph-devel-owner@xxxxxxxxxxxxxxx] on behalf of Loic Dachary [loic@xxxxxxxxxxx]
Sent: 13 January 2014 09:38
To: Kyle Bader
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: Pyramid erasure codes and replica hinted recovery

On 13/01/2014 03:35, Kyle Bader wrote:
>> How is it different from what is described above? There must be something I fail to understand.
>
> No misunderstanding on your part, on second look that does achieve the
> desired placement. Could you please help walk me through the following
> scenarios:
>
> Can data or local parity chunks that have been lost (erasures) be
> recovered locally, with no inter-dc backfill traffic?

If the primary happens to be located in the same data enter as the lost chunk and the layout is as described previously, then it will be recovered without the need for inter-dc traffic. If the primary is not in the same datacenter, it may be possible to move it to the datacenter where the lost chunk is located. When the primary OSD is lost, another must be chosen. It would be nice to change the primary not only when it is lost but also when doing so helps recovery.

> Global parity chunks that are lost require reading....6x data or
> global parity chunks (effectively 1x the original write)?

>From the point of view of recovery, global parity chunks are treated in the same way as data chunks. If you have RS(6,3,3), you will need to read 6 chunks out of 9 ( 6 data chunks + 3 global parity chunks ) to be able to recover from the loss of 2 or 3 chunks ( data or parity, it does not matter ). In other words, to recover from the loss of more chunks than local parity allows, you need to read 1x the original write.

> Would placement groups containing a data or local parity chunk that
> have been remapped backfill from the local chunk (member of previous
> acting set)?

David is working on multiple backfill at the moment https://github.com/ceph/ceph/pull/931 and will have a definitive answer. The data flows from the primary OSD to the OSDs supporting the other chunks there is no peer-to-peer communication between the OSDs participating in a placement group.

Cheers

--
Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html