Re: What's the best practice for Erasure Coding

Frank Schilder <frans@xxxxxx> · Wed, 6 May 2020 19:56:36 +0000

Dear Alex,

I don't really have a reference for this set up. The ceph documentation describes this as the simplest possible set up and back then it was basically dictated by budget. Everything else was several months of experimentation and benchmarking. I had scripts running for several weeks just doing rbd bench on all sorts of combinations of parameters. I was testing for aggregated sequential large write (aggregated bandwidth) and aggregated random small write (aggregated IOP/s).

In my experience, write performance with bluestore OSDs and everything collocated is quite good. One of the design goals of bluestore was to provide a more constant and predictable throughput and, as far as I can tell, this works quite well. We have 150 spindles and I can get a sustained aggregated sequential write performance of 6GB/s. This is quite good for our purposes and we had only very few users who managed to fill this bandwidth.

To be on the safe side, I promise no more than 30MB/s per disk. This is a pretty good lower bound and if you put enough disks together, it will add up.

Latency is a different story. There is an extreme skew between write and read. When I do a tar-untar test with an archive containing something like 100.000 small files, the tar is up to 20 times slower than the untar (on ceph fs, clean client, MDS and client cache flushed). I didn't test RBD read latency with RBD bench.

I don't have SSDs for WAL/DB, so some statements below this line are a bit speculative and based on snippets I picked up in other conversations.

The most noticeable improvement with using SSD for WAL/DB is probably a reduction in read-latency due to the faster DB lookup. WAL has actually only limited influence. If I understand correctly, it only speeds things up when its not running full. As soon as write load is high enough, the backing disk becomes the effective bottleneck.

I have some anecdotal wisdom that WAL+DB on SSD improves performance by a factor of 2. Performance is here undefined. Its probably something like "general user experience". A factor of 2 is not very tempting given the architectural complication. I would rather double the size of my cluster.

For latency and IOP/s sensitive applications (KVMs on RBD) we actually went for all-flash running on cheap MICRON PRO SSDs, which are QLC and can drop in bandwidth down to 80MB/s. However, I have never seen more than 5MB/s per disk with this workload. KVMs on RBD is really IOP/s intensive and bandwidth is secondary. The MICRON PRO disks provide very good IOP/s per TB already with a single OSD per disk. My benchmarks show that running 2 OSDs per disks doubles that and running 4 OSDs per disks saturates the disk spec performance. For the number of VMs we run per SSD, these disks are completely sufficient, I can live with the single-OSD per disk deployment. This all-flash setup is simpler and probably also cheaper than a hybrid OSD setup. I think Kingston SSDs are a bit more expensive but equally suitable. Make sure you disable volatile write cache.

In my experience, ceph fs with SSD meta data pool and data pool on HDD only is fine. For RBD backing VMs all flash with single-thread high IOP/s SSDs works really well for us. You need to test this though. Many SSDs have surprisingly poor single-thread performance compared with specs.

For the future, I consider LVM with dm_cache on SSD. This sounds a bit more flexible than the WAL+DB approach and also reduces read-latency.

Finally, yes, on our ceph fs we will accumulate a lot of cold data. Its going to be the dump yard. This means we will eventually get really good performance for the small amount of warm/hot data once the cluster grows enough.

Hope that answered your questions.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>
Sent: 04 May 2020 04:21
To: Frank Schilder
Cc: David; ceph-users
Subject: Re:  What's the best practice for Erasure Coding

Hi Frank,

Reviving this old thread as to whether the performance on these raw NL-SAS drives is adequate?  I was wondering if this is a deep archive with almost no retrieval, or how many drives are used?  In my experience with large parallel writes, WAL/DB with bluestore, or journal drives on SSD with filestore have always been needed to sustain a reasonably consistent transfer rate.
Very much appreciate any reference info as to your design.

Best regards,
Alex

On Mon, Jul 8, 2019 at 4:30 AM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
Hi David,

I'm running a cluster with bluestore on raw devices (no lvm) and all journals collocated on the same disk with the data. Disks are spinning NL-SAS. Our goal was to build storage at lowest cost, therefore all data on HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All large pools are EC on spinning disk.

I spent at least one month to run detailed benchmarks (rbd bench) depending on EC profile, object size, write size, etc. Results were varying a lot. My advice would be to run benchmarks with your hardware. If there was a single perfect choice, there wouldn't be so many options. For example, my tests will not be valid when using separate fast disks for WAL and DB.

There are some results though that might be valid in general:

1) EC pools have high throughput but low IOP/s compared with replicated pools

I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which is probably the network limit and not the disk limit. IOP/s get better with more disks, but are way lower than what replicated pools can provide. On a cephfs with EC data pool, small-file IO will be comparably slow and eat a lot of resources.

2) I observe massive network traffic amplification on small IO sizes, which is due to the way EC overwrites are handled. This is one bottleneck for IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD network. OSD bandwidth at least 2x client network, better 4x or more.

3) k should only have small prime factors, power of 2 if possible

I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All other choices were poor. The value of m seems not relevant for performance. Larger k will require more failure domains (more hardware).

4) object size matters

The best throughput (1M write size) I see with object sizes of 4MB or 8MB, with IOP/s getting somewhat better with slower object sizes but throughput dropping fast. I use the default of 4MB in production. Works well for us.

5) jerasure is quite good and seems most flexible

jerasure is quite CPU efficient and can handle smaller chunk sizes than other plugins, which is preferrable for IOP/s. However, CPU usage can become a problem and a plugin optimized for specific values of k and m might help here. Under usual circumstances I see very low load on all OSD hosts, even under rebalancing. However, I remember that once I needed to rebuild something on all OSDs (I don't remember what it was, sorry). In this situation, CPU load went up to 30-50% (meaning up to half the cores were at 100%), which is really high considering that each server has only 16 disks at the moment and is sized to handle up to 100. CPU power could become a bottle for us neck in the future.

These are some general observations and do not replace benchmarks for specific use cases. I was hunting for a specific performance pattern, which might not be what you want to optimize for. I would recommend to run extensive benchmarks if you have to live with a configuration for a long time - EC profiles cannot be changed.

We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also use bluestore compression. All meta data pools are on SSD, only very little SSD space is required. This choice works well for the majority of our use cases. We can still build small expensive pools to accommodate special performance requests.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx<mailto:ceph-users-bounces@xxxxxxxxxxxxxx>> on behalf of David <xiaomajia.st@xxxxxxxxx<mailto:xiaomajia.st@xxxxxxxxx>>
Sent: 07 July 2019 20:01:18
To: ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
Subject:   What's the best practice for Erasure Coding

Hi Ceph-Users,

I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).
Recently, I'm trying to use the Erasure Code pool.
My question is "what's the best practice for using EC pools ?".
More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should I adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), (k=6,m=3) ).

Does anyone share some experience?

Thanks for any help.

Regards,
David

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx