Re: CEPH Erasure Encoding + OSD Scalability

Loic Dachary <loic@xxxxxxxxxxx> · Thu, 04 Jul 2013 15:07:52 +0200

Hi Andreas,

On 03/07/2013 18:55, Andreas-Joachim Peters wrote:> Dear Loic et. al., 
> 
> I have/had some questions about the idea's of Erasure Encoding plans and OSD scalability. 
> Please forgive me that I didnt' study too much any source code or details of the current CEPH implementation (yet).
> 
> Some of my questions I found now already answered here,
> 
> ( https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst )
> 
> but they also created some more ;-)
> 
> *ERASUE ENCODING*
> 
> 1.) I understand that you will cover only OSD outages with the implementation and will delegate block corruption to be discovered by the file system implementation (like BTRFS would do) Is that correct? 

Ceph also does scrubbing to detect block ( I assume you mean chunk ) corruption. The idea is to adapt the logic which is currently assuming replicas so that it detects corruption ( for instance more than K missing chunks if M+K is used ).

> 2.) Blocks would be assembled always on the OSDs (?)

Yes. 

> 3.) I understood that the (3,2) RS sketched in the Blog is the easiest to implement since it can be done with simple parity(XOR) operations but do you intend to have a generic (M,K) implementation?

Yes. The idea is to use the jerasure library which provides reed-solomon and can be configured in various ways.

> 4.) Would you split a 4M object into M x(4/M) objects? Would this not (even more) degrade single disk performance to random IO performance when many clients retrieve objects at random disk positions? Is 4M just a default or a hard coded parameter of CEPHFS/S3 ?

It is just a default. I hope the updated (look for "Partials" ) https://github.com/dachary/ceph/blob/5efcac8fa6e08119f0deaaf1ae9919080e90cf0a/doc/dev/osd_internals/erasure-code.rst answers the rest of the question .

> 5.) Local Parity like in Xorbas makes sense for large M, but would a large M not hit scalability limits given by a single OSD in terms of object bookkeeping/scrubbing/synchronization, Network packet limitations (atleast in 1GBit networks) etc ... 1 TB = 250k objects => M=10 => 2.5 Mio objects ( a 100 TB disk server would have 250 Mio object fragments ?!?!) 

We are looking at M+K < 2^8 at the moment which significantly reduces the problem you mention as well as the CPU consumption issues.

> 6.) Does a CEPH object know something like a parent object so it could understand if it is still a 'connected' object (like part of a block collection implementing a block, a file or container?)

At the level where erasure coding is implemented ( librados ) there is no nothing of relationships between objects.

> *OSD SCALABILITY*

Please take my answers there with a grain of salt because there are many people with much more knowledge than I have :-)

> 1.) Are there some deployment numbers about the largest number of OSDs per placement group and the number of objects you can handle well in a placement group?

The acceptable range seems to be ( number of OSDs ) * 100 up to ( number of OSDs ) * 1000

> 2.) What is the largest number of OSDs people have ever tried out? Many presentations say 10-10k nodes, but probably it should be more OSDs?

The largest deployment I'm aware of is Dream{Object,Compute} but I don't know the actual numbers.

> 3.) In our CC we operate disk server with up to 100 TB (25 disks) , next year 200 TB (50 disks) and in the future even bigger. 
> If I remember right the recommendation is to have 2GB of memory per OSD. 
> Can the memory footprint be lowered or is it a 'feature' of the OSD architecture?
> Is there in-memory information limiting scalability?

The OSD memory usage varies from from a few hundred mega bytes when running normal operations to about 2GB when recovering, which can be a problem if you have a large number of OSDs running on the same hardware. You can control this by grouping the disks together. For instance if your machine has 50 disks you could group them in 10 RAID0 including 5 physical disks each and run 10 OSD instead of 50. Of course it means that you'll lose 5 disks at once if one fails but when grouping 50 disks on a single hardware you already made a decision that leans in this direction.

> 4.) Today we run disk-only storage with 20k disks and 24 to 33 disk per node. There is a weekly activity of repair & replacement and reboots.

I assume that's of 1,000 machines, right ? How many disk / machines do you need to replace on a weekly basis ? 

> A typical scenario is that after a reboot filesystem contents was not synced and information is lost. Does CEPH OSD sync every block or if not use a quorum on block contents when reading or it would just return the block as is and only scrubbing would mark a block as corrupted?

I don't think Ceph can ever return a corrupted object as if it was not. That would either require a manual intervention from the operator to tamper with the file without notifying Ceph ( which would be the equivalent of shotting himself in the foot ;-) or a bug in XFS ( or the underlying file system on which objects are stored ) that similarly corrupts the file. And all this would have to happen before deep scrubbing discovers the problem.  

> 5.) When rebalancing is needed is there some time slice or scheduling mechanism which regulates the block relocation with respect to the 'normal' IO activity on the source and target OSD? Is there an overload protection in particular on the block target OSD?

There is a reservation mechanism to avoid creating too many communication paths during recovery ( see http://ceph.com/docs/master/dev/osd_internals/backfill_reservation/ for instance ) and throttling to regulate the bandwidth usage ( not 100% sure how that works though ). In addition it is recommended when operating a large cluster to dedicate an interface to internal communications ( check http://ceph.com/docs/master/rados/configuration/network-config-ref/ for more information ).

Cheers

> 
> Thanks.
> 
> Andreas.
> 
> 
> 
> 
> 
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

Attachment:
signature.asc

Description: OpenPGP digital signature