Re: slow perfomance: sanity check

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 6 Apr 2017 14:22:45 -0500

On 04/06/2017 01:54 PM, Adam Carheden wrote:
60-80MBs/s for what sort of setup? Is that 1Gbe rather than 10Gbe?

60-80MB/s per disk, assuming fairly standard 7200RPM disks before any 
replication takes place and assuming journals are on SSDs with fast 
O_DSYNC write performance.  Any network limitations may decrease that 
further.  Basically the gist of it is that you take a fairly standard 
~140-150MB/s per disk, assume you get half that due to metadata writes, 
flushes, inode seeks, etc.

I consistently get 80-90Mb/s bandwidth as measured by `rados bench -p
rbd 10 write` run from a ceph node on a cluster with:
* 3 nodes
* 4 OSD/node, 600GB 15kRPM SAS disks
* 1G disk controller cache write cache shared by all disks in each node
* No SSDs
* 2x1Gbe lacp bond for redundancy, no jumbo frames
* 512 PGs for a cluster of 12 OSDs
* All disks in one pool of size=3, min_size=2

IOzone run on a VM using an rbd as it's HD confirms that setup maxes out
at around just under 100 MB/s for best-case scenarios, so I assumed the
1Gb network was the bottleneck.

The network is a good guess.  With 3 1GbE nodes and 3X replication you 
aren't going to do any better than ~110MB/s.  You are a little below 
that but it's in the right ballpark.

I'm in the process of planning a hardware purchase for a larger cluster:
more nodes, more drives, SSD journals and 10Gbe. I'm asuming I'll get
better performance.

You should, but it can be tricky to balance out everything.  Figure that 
80MB/s per disk (with 7200rpm disks and SSD journals) is the typical 
upper limit of what to expect with filestore on XFS, and any potential 
additional bottlenecks may bring that down.  Some folks have started 
playing with things like Intel's CAS software to potentially improve 
those numbers through SSD caching, but it's not a typical setup.

What's the upper bound on CEPH performance for large sequential writes
from a single-client with all the recommended bells and whistles (ssd
journal, 10Gbe)? I assume it depends on both the total number of OSDs
and possibly OSDs per node if one had enough to saturate the network,
correct?

Yep, and that's sort of tough to answer.  The fastest single client 
performance I've seen was a little over 4GB/s doing 4MB writes to an RBD 
volume on 16 NVMe OSDs using 40GbE (ie maxing it out on the client).  If 
I had enough switch ports to do bonded I could probably having gotten 
closer to 8GB/s since the cluster was capable of it.

Having said that, there's a *lot* of ways to hurt performance.  Red Hat 
has a ref architecture team that tests various hardware that might be 
able to give you a better idea of what works well these days.

Mark
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com