Re: why the erasure code pool not support random write?

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Tue, 21 Oct 2014 11:59:28 +0200

Le 21/10/2014 09:31, Nicheal a écrit :
> 2014-10-21 7:40 GMT+08:00 Lionel Bouton <lionel+ceph@xxxxxxxxxxx>:
>> Hi,
>>
>> Le 21/10/2014 01:10, 池信泽 a écrit :
>>
>> Thanks.
>>
>>    Another reason is the checksum in the attr of object used for deep scrub
>> in EC pools should be computed when modify the object. When supporting the
>> random write, We should caculate the whole object for checksum, even if
>> there is a bit modified. If only supporting append write, We can get the
>> checksum based on the previously checksum and the append date which is more
>> quickly.
>>
>>    Am I right?
>>
>>
>> From what I understand, the deep scrub doesn't use a Ceph checksum but
>> compares data between OSDs (and probably use a "majority wins" rule for
>> repair). If you are using Btrfs it will report an I/O error because it uses
>> an internal checksum by default which will force Ceph to use other OSDs for
>> repair.
>> I'd be glad to be proven wrong on this subject though.
> No, when deep scrubbing, not whole 4M objects(I mean if we set object
> size: 4M) content compare with each other byte by byte. I will
> introduce high overload on network, If you transmit whole 4M objects,
> even if we compress the object content. Instead, whole 4M object
> content will generate a 64bit hash-digest. With comparing the hash
> digest,  it confirms whether the content is consistent. But it still
> need to read out whole 4M object content, so scrub without deep just
> compare the meta info of each object.

What I meant is that I believe the source of data being compared is not
a checksum already stored on disk at write time (which is what I
understood by "checksum in the attr of object" in the original post)
that can detect bit rot by itself. The fact that there is a network
usage optimization using dynamically computed hashes doesn't change that
corruption detection is done by comparing the data between peers and not
using a checksum stored locally at write time which would bring
additional integrity guarantees (for example it would allow repair to
choose the correct replica out of 2 in pools configured with max-size 2).

Best regards,

Lionel Bouton
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com