Re: What's the best practice for Erasure Coding

Frank Schilder <frans@xxxxxx> · Thu, 11 Jul 2019 09:46:47 +0000

Striping with stripe units other than 1 is something I also tested. I found that with EC pools non-trivial striping should be avoided. Firstly, EC is already a striped format and, secondly, striping on top of that with stripe_unit>1 will make every write an ec_overwrite, because now shards are rarely if ever written as a whole.

The native striping in EC pools comes from k, data is striped over k disks. The higher k the more throughput at the expense of cpu and network.

In my long list, this should actually be point

6) Use stripe_unit=1 (default).

To get back to your question, this is another argument for k=power-of-two. Object sizes in ceph are always powers of 2 and stripe sizes contain k as a factor. Hence, any prime factor other than 2 in k will imply a mismatch. How badly a mismatch affects performance should be tested.

Example: on our 6+2 EC pool I have stripe_width  24576, which has 3 as a factor. The 3 comes from k=6=3*2 and will always be there. This implies a misalignment and some writes will have to be split/padded in the middle. This does not happen too often per object, so 6+2 performance is good, but not as good as 8+2 performance.

Some numbers:

1) rbd object size 8MB, 4 servers writing with 1 processes each (=4 workers):
EC profile     4K random write      sequential write 8M write size
               IOP/s aggregated     MB/s aggregated
 5+2            802.30              1156.05
 6+2           1188.26              1873.67
 8+2           1210.27              2510.78
10+4            421.80               681.22

2) rbd object size 8MB, 4 servers writing with 4 processes each (=16 workers):
EC profile     4K random write      sequential write 8M write size
               IOP/s aggregated     MB/s aggregated
6+2            1384.43              3139.14
8+2            1343.34              4069.27

The EC-profiles with factor 5 are so bad that I didn't repeat the multi-process tests (2) with these. I had limited time and went for the discard-early strategy to find suitable parameters.

The 25% smaller throughput (6+2 vs 8+2) in test (2) is probably due to the fact that data is striped over 6 instead of 8 disks. There might be some impact of the factor 3 somewhere as well, but it seems negligible in the scenario I tested.

Results with non-trivial striping (stripe_size>1) were so poor, I did not even include them in my report.

We use the 8+2 pool for ceph fs, where throughput is important. The 6+2 pool is used for VMs (RBD images), where IOP/s are more important. It also offers a higher redundancy level. Its an acceptable compromise for us.

Note that numbers will vary depending on hardware, OSD config, kernel parameters etc, etc. One needs to test what one has.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Lars Marowsky-Bree <lmb@xxxxxxxx>
Sent: 11 July 2019 10:14:04
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  What's the best practice for Erasure Coding

On 2019-07-09T07:27:28, Frank Schilder <frans@xxxxxx> wrote:

> Small addition:
>
> This result holds for rbd bench. It seems to imply good performance for large-file IO on cephfs, since cephfs will split large files into many objects of size object_size. Small-file IO is a different story.
>
> The formula should be N*alloc_size=object_size/k, where N is some integer. alloc_size should be an integer multiple of object_size/k.

If using rbd striping, I'd also assume that making rbd's stripe_unit be
equal to, or at least a multiple of, the stripe_width of the EC pool is
sensible.

(Similar for CephFS's layout.)

Does this hold in your environment?

--
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli Zbinden)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com