Re: What's the best practice for Erasure Coding

Frank Schilder <frans@xxxxxx> · Thu, 11 Jul 2019 10:48:53 +0000

Oh dear. Every occurrence of stripe_* is wrong :)

It should be stripe_count (option --stripe-count in rbd create) everywhere in my text.

What choices are legal depends on the restrictions on stripe_count*stripe_unit (=stripe_size=stripe_width?) imposed by ceph. I believe all of this ends up being powers of 2.

Yes, the 6+2 is a bit surprising. I have no explanation for the observation. It just seems a good argument for "do not trust what you believe, gather facts". And to try things that seem non-obvious - just to be sure.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Lars Marowsky-Bree <lmb@xxxxxxxx>
Sent: 11 July 2019 12:17:37
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  What's the best practice for Erasure Coding

On 2019-07-11T09:46:47, Frank Schilder <frans@xxxxxx> wrote:

> Striping with stripe units other than 1 is something I also tested. I found that with EC pools non-trivial striping should be avoided. Firstly, EC is already a striped format and, secondly, striping on top of that with stripe_unit>1 will make every write an ec_overwrite, because now shards are rarely if ever written as a whole.

That's why I said that rbd's stripe_unit should match the EC pool's
stripe_width, or be a 2^n multiple of it. (Not sure what stripe_count
should be set to, probably also a small number of two.)

> The native striping in EC pools comes from k, data is striped over k disks. The higher k the more throughput at the expense of cpu and network.

Increasing k also increases stripe_width though; this leads to more IO
suffering from the ec_overwrite penalty.

> In my long list, this should actually be point
>
> 6) Use stripe_unit=1 (default).

You mean stripe-count?

> To get back to your question, this is another argument for k=power-of-two. Object sizes in ceph are always powers of 2 and stripe sizes contain k as a factor. Hence, any prime factor other than 2 in k will imply a mismatch. How badly a mismatch affects performance should be tested.

Yes, of course. Depending on the IO pattern, this means more IO will be
misaligned or have non-stripe_width portions. (Most IO patterns, if they
strive for alignment, aim for a power of two alignment, obviously.)

> Results with non-trivial striping (stripe_size>1) were so poor, I did not even include them in my report.

stripe_size?

> We use the 8+2 pool for ceph fs, where throughput is important. The 6+2 pool is used for VMs (RBD images), where IOP/s are more important. It also offers a higher redundancy level. Its an acceptable compromise for us.

Especially with RBDs, I'm surprised that k=6 works well for you. Block
device IO is most commonly aligned on power-of-two boundaries.

Regards,
    Lars

--
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli Zbinden)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com