Re: ceph cluster performance

Mike Dawson <mike.dawson@xxxxxxxxxxxx> · Wed, 06 Nov 2013 15:25:17 -0500

We just fixed a performance issue on our cluster related to spikes of 
high latency on some of our SSDs used for osd journals. In our case, the 
slow SSDs showed spikes of 100x higher latency than expected.

What SSDs were you using that were so slow?

Cheers,
Mike

On 11/6/2013 12:39 PM, Dinu Vlad wrote:
I'm using the latest 3.8.0 branch from raring. Is there a more recent/better kernel recommended?

Meanwhile, I think I might have identified the culprit - my SSD drives are extremely slow on sync writes, doing 5-600 iops max with 4k blocksize. By comparison, an Intel 530 in another server (also installed behind a SAS expander is doing the same test with ~ 8k iops. I guess I'm good for replacing them.

Removing the SSD drives from the setup and re-testing with ceph => 595 MB/s throughput under the same conditions (only mechanical drives, journal on a separate partition on each one, 8 rados bench processes, 16 threads each).

On Nov 5, 2013, at 4:38 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:

Ok, some more thoughts:

1) What kernel are you using?

2) Mixing SATA and SAS on an expander backplane can some times have bad effects.  We don't really know how bad this is and in what circumstances, but the Nexenta folks have seen problems with ZFS on solaris and it's not impossible linux may suffer too:

http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html

3) If you are doing tests and look at disk throughput with something like "collectl -sD -oT"  do the writes look balanced across the spinning disks?  Do any devices have much really high service times or queue times?

4) Also, after the test is done, you can try:

find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} dump_historic_ops \; > foo

and then grep for "duration" in foo.  You'll get a list of the slowest operations over the last 10 minutes from every osd on the node.  Once you identify a slow duration, you can go back and in an editor search for the slow duration and look at where in the OSD it hung up.  That might tell us more about slow/latent operations.

5) Something interesting here is that I've heard from another party that in a 36 drive Supermicro SC847E16 chassis they had 30 7.2K RPM disks and 6 SSDs on a SAS9207-8i controller and were pushing significantly faster throughput than you are seeing (even given the greater number of drives).  So it's very interesting to me that you are pushing so much less.  The 36 drive supermicro chassis I have with no expanders and 30 drives with 6 SSDs can push about 2100MB/s with a bunch of 9207-8i controllers and XFS (no replication).

Mark

On 11/05/2013 05:15 AM, Dinu Vlad wrote:
Ok, so after tweaking the deadline scheduler and the filestore_wbthrottle* ceph settings I was able to get 440 MB/s from 8 rados bench instances, over a single osd node (pool pg_num = 1800, size = 1)

This still looks awfully slow to me - fio throughput across all disks reaches 2.8 GB/s!!

I'd appreciate any suggestion, where to look for the issue. Thanks!

On Oct 31, 2013, at 6:35 PM, Dinu Vlad <dinuvlad13@xxxxxxxxx> wrote:

I tested the osd performance from a single node. For this purpose I deployed a new cluster (using ceph-deploy, as before) and on fresh/repartitioned drives. I created a single pool, 1800 pgs. I ran the rados bench both on the osd server and on a remote one. Cluster configuration stayed "default", with the same additions about xfs mount & mkfs.xfs as before.

With a single host, the pgs were "stuck unclean" (active only, not active+clean):

# ceph -s
  cluster ffd16afa-6348-4877-b6bc-d7f9d82a4062
   health HEALTH_WARN 1800 pgs stuck unclean
   monmap e1: 3 mons at {cephmon1=10.4.0.250:6789/0,cephmon2=10.4.0.251:6789/0,cephmon3=10.4.0.252:6789/0}, election epoch 4, quorum 0,1,2 cephmon1,cephmon2,cephmon3
   osdmap e101: 18 osds: 18 up, 18 in
    pgmap v1055: 1800 pgs: 1800 active; 0 bytes data, 732 MB used, 16758 GB / 16759 GB avail
   mdsmap e1: 0/0/1 up

Test results:
Local test, 1 process, 16 threads: 241.7 MB/s
Local test, 8 processes, 128 threads: 374.8 MB/s
Remote test, 1 process, 16 threads: 231.8 MB/s
Remote test, 8 processes, 128 threads: 366.1 MB/s

Maybe it's just me, but it seems on the low side too.

Thanks,
Dinu

On Oct 30, 2013, at 8:59 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:

On 10/30/2013 01:51 PM, Dinu Vlad wrote:
Mark,

The SSDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/ssd/enterprise-sata-ssd/?sku=ST240FN0021 and the HDDs are http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/constellation/?sku=ST91000640SS.

The chasis is a "SiliconMechanics C602" - but I don't have the exact model. It's based on Supermicro, has 24 slots front and 2 in the back and a SAS expander.

I did a fio test (raw partitions, 4M blocksize, ioqueue maxed out according to what the driver reports in dmesg). here are the results (filtered):

Sequential:
Run status group 0 (all jobs):
  WRITE: io=176952MB, aggrb=2879.0MB/s, minb=106306KB/s, maxb=191165KB/s, mint=60444msec, maxt=61463msec

Individually, the HDDs had best:worst 103:109 MB/s while the SSDs gave 153:189 MB/s

Ok, that looks like what I'd expect to see given the controller being used.  SSDs are probably limited by total aggregate throughput.

Random:
Run status group 0 (all jobs):
  WRITE: io=106868MB, aggrb=1727.2MB/s, minb=67674KB/s, maxb=106493KB/s, mint=60404msec, maxt=61875msec

Individually (best:worst) HDD 71:73 MB/s, SSD 68:101 MB/s (with only one out of 6 doing 101)

This is on just one of the osd servers.

Where the ceph tests to one OSD server or across all servers?  It might be worth trying tests against a single server with no replication using multiple rados bench instances and just seeing what happens.

Thanks,
Dinu

On Oct 30, 2013, at 6:38 PM, Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:

On 10/30/2013 09:05 AM, Dinu Vlad wrote:
Hello,

I've been doing some tests on a newly installed ceph cluster:

# ceph osd create bench1 2048 2048
# ceph osd create bench2 2048 2048
# rbd -p bench1 create test
# rbd -p bench1 bench-write test --io-pattern rand
elapsed:   483  ops:   396579  ops/sec:   820.23  bytes/sec: 2220781.36

# rados -p bench2 bench 300 write --show-time
# (run 1)
Total writes made:      20665
Write size:             4194304
Bandwidth (MB/sec):     274.923

Stddev Bandwidth:       96.3316
Max bandwidth (MB/sec): 748
Min bandwidth (MB/sec): 0
Average Latency:        0.23273
Stddev Latency:         0.262043
Max latency:            1.69475
Min latency:            0.057293

These results seem to be quite poor for the configuration:

MON: dual-cpu Xeon E5-2407 2.2 GHz, 48 GB RAM, 2xSSD for OS
OSD: dual-cpu Xeon E5-2620 2.0 GHz, 64 GB RAM, 2xSSD for OS (on-board controller), 18 HDD 1TB 7.2K rpm SAS for OSD drives and 6 SSDs (SATA) for journal, attached to a LSI 9207-8i controller.
All servers have dual 10GE network cards, connected to a pair of dedicated switches. Each SSD has 3 10 GB partitions for journals.

Agreed, you should see much higher throughput with that kind of storage setup.  What brand/model SSDs are these?  Also, what brand and model of chassis?  With 24 drives and 8 SSDs I could push 2GB/s (no replication though) with a couple of concurrent rados bench processes going on our SC847A chassis, so ~550MB/s aggregate throughput for 18 drives and 6 SSDs is definitely on the low side.

I'm actually not too familiar with what the RBD benchmarking commands are doing behind the scenes.  Typically I've tested fio on top of a filesystem on RBD.

Using ubuntu 13.04, ceph 0.67.4, XFS for backend storage. Cluster was installed using ceph-deploy. ceph.conf pretty much out of the box (diff from default follows)

osd_journal_size = 10240
osd mount options xfs = "rw,noatime,nobarrier,inode64"
osd mkfs options xfs = "-f -i size=2048"

[osd]
public network = 10.4.0.0/24
cluster network = 10.254.254.0/24

All tests were run from a server outside the cluster, connected to the storage network with 2x 10 GE nics.

I've done a few other tests of the individual components:
- network: avg. 7.6 Gbit/s (iperf, mtu=1500), 9.6 Gbit/s (mtu=9000)
- md raid0 write across all 18 HDDs - 1.4 GB/s sustained throughput
- fio SSD write (xfs, 4k blocks, directio): ~ 250 MB/s, ~55K IOPS

What you might want to try doing is 4M direct IO writes using libaio and a high iodepth to all drives (spinning disks and SSDs) concurrently and see how both the per-drive and aggregate throughput is.

With just SSDs, I've been able to push the 9207-8i up to around 3GB/s with Ceph writes (1.5GB/s if you don't count journal writes), but perhaps there is something interesting about the way the hardware is setup on your system.

I'd appreciate any suggestion that might help improve the performance or identify a bottleneck.

Thanks
Dinu

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com