Bad Write-Performance on Ceph/Possible bottlenecks?

mark.nelson@xxxxxxxxxxx (Mark Nelson) · Fri, 04 Jul 2014 08:54:08 -0500

On 07/04/2014 04:13 AM, Marco Allevato wrote:
> Hello Ceph-Community,
>
> I?m writing here because we have a bad write-performance on our
> Ceph-Cluster of about
>
> _As an overview the technical details of our Cluster:_
>
> 3 x monitoring-Servers; each with 2 x 1 Gbit/s NIC configured as Bond
> (Link Aggregation-Mode)
>
> 5 x datastore-Servers; each with 10 x 4 TB HDDs serving as OSDs, as
> Journal we use a 15 GB LVM on an 256 GB SSD-Raid1; 2 x 10 Gbit/s NIC
> configured as Bond (Link Aggregation-Mode)

What SSDs are you using?  If this is just a single pair of typical 
consumer grade 256GB SSDs, you are likely oversubscribing them quite a 
bit.  A RAID1 of two of them won't achieve anywhere near the sequential 
write performance the disks can do.  Since SSD failure tends to be kind 
of clumpy based on the number of writes, I'm not really fond of doing 
SSD journal devices in a RAID1.  I'd rather reduce write wear and just 
use them individually with half the journals on each SSD.  Better 
performance too.

>
> __
>
> _ceph.conf_
>
> [global]
>
> auth_service_required = cephx
>
> filestore_xattr_use_omap = true
>
> auth_client_required = cephx
>
> auth_cluster_required = cephx
>
> mon_host = 172.30.30.8,172.30.30.9
>
> mon_initial_members = monitoring1, monitoring2, monitoring3
>
> fsid = 5f22ab94-8d96-48c2-88d3-cff7bad443a9
>
> public network = 172.30.30.0/24
>
> [mon.monitoring1]
>
>          host = monitoring1
>
>          addr = 172.30.30.8:6789
>
> [mon.monitoring2]
>
>          host = monitoring2
>
>          addr = 172.30.30.9:6789
>
> [mon.monitoring3]
>
>          host = monitoring3
>
>          addr = 172.30.30.10:6789
>
> [filestore]
>
>         filestore max sync interval = 10

This is tough to get right imho.  Less frequent synchronization gives 
you more time to potentially coalesce write ops, but also may cause 
spikier behaviour which we like to avoid.  Probably requires just 
testing it and seeing what happens in your setup.

>
> [osd]
>
>          osd recovery max active = 1
>
>          osd journal size = 15360
>
>          osd op threads = 40
>
>          osd disk threads = 40

Having so many disk threads and op threads probably isn't going to help 
you.  I'd reduce them unless you've actually tested that this is faster 
than the default.  In some cases increasing the op threads does improve 
performance, but ultimately you probably don't want to go too high.  I 
often set it to 8, but I suspect it may be platform dependent and how 
many CPU cores and disks you have.  Basically you want the minimum 
number of threads that lets you achieve saturation.

Some other things to consider:

1) Are you using Kernel RBD or userland RBD?  Have you considered using 
RBD Cache?  This can help dramatically, especially with smaller 
sequential writes.  In your case you are testing 1MB random writes, so 
it may not help as much, but it's still worth trying.

2) If you have few PGs in the pool and *lots* of objects (say many large 
volumes, and/or small RBD object sizes), you may benefit from increasing 
the filestore merge threshold and filestore split multiple parameters. 
These govern how the filestore splits directories that objects are 
stored in under the OSD per PG.  It appears that increasing the 
per-directory limits can reduce performance degradation as the number of 
objects increases, but may slow down lookup operations.  Currently the 
default is 300 objects per directory however which I believe may be too 
low.  Here's what I like to use:

         filestore merge threshold = 40
         filestore split multiple = 8

3) you might find that disabling in-memory debugging may help, though 
this typically helps more with small random IOs.

         debug_lockdep = "0/0"
         debug_context = "0/0"
         debug_crush = "0/0"
         debug_mds = "0/0"
         debug_mds_balancer = "0/0"
         debug_mds_locker = "0/0"
         debug_mds_log = "0/0"
         debug_mds_log_expire = "0/0"
         debug_mds_migrator = "0/0"
         debug_buffer = "0/0"
         debug_timer = "0/0"
         debug_filer = "0/0"
         debug_objecter = "0/0"
         debug_rados = "0/0"
         debug_rbd = "0/0"
         debug_journaler = "0/0"
         debug_objectcacher = "0/0"
         debug_client = "0/0"
         debug_osd = "0/0"
         debug_optracker = "0/0"
         debug_objclass = "0/0"
         debug_filestore = "0/0"
         debug_journal = "0/0"
         debug_ms = "0/0"
         debug_mon = "0/0"
         debug_monc = "0/0"
         debug_paxos = "0/0"
         debug_tp = "0/0"
         debug_auth = "0/0"
         debug_finisher = "0/0"
         debug_heartbeatmap = "0/0"
         debug_perfcounter = "0/0"
         debug_rgw = "0/0"
         debug_hadoop = "0/0"
         debug_asok = "0/0"
         debug_throttle = "0/0"

>
> [osd.0]
>
>          host = datastore1
>
> [osd.1]
>
>          host = datastore1
>
> [osd.2]
>
>          host = datastore1
>
> [osd.3]
>
>          host = datastore1
>
> [osd.4]
>
>          host = datastore1
>
> [osd.5]
>
>          host = datastore1
>
> [osd.6]
>
>          host = datastore1
>
> [osd.7]
>
>          host = datastore1
>
> [osd.8]
>
>          host = datastore1
>
> [osd.9]
>
>          host = datastore1
>
> [osd.10]
>
>          host = datastore2
>
> [osd.11]
>
>          host = datastore2
>
> [osd.11]
>
>          host = datastore2
>
> [osd.12]
>
>          host = datastore2
>
> [osd.13]
>
>          host = datastore2
>
> [osd.14]
>
>          host = datastore2
>
> [osd.15]
>
>          host = datastore2
>
> [osd.16]
>
>          host = datastore2
>
> [osd.17]
>
>          host = datastore2
>
> [osd.18]
>
>          host = datastore2
>
> [osd.19]
>
>          host = datastore2
>
> [osd.20]
>
>          host = datastore3
>
> [osd.21]
>
>          host = datastore3
>
> [osd.22]
>
>          host = datastore3
>
> [osd.23]
>
>          host = datastore3
>
> [osd.24]
>
>          host = datastore3
>
> [osd.25]
>
>          host = datastore3
>
> [osd.26]
>
>          host = datastore3
>
> [osd.27]
>
>          host = datastore3
>
> [osd.28]
>
>          host = datastore3
>
> [osd.29]
>
>          host = datastore3
>
> [osd.30]
>
>          host = datastore4
>
> [osd.31]
>
>          host = datastore4
>
> [osd.32]
>
>          host = datastore4
>
> [osd.33]
>
>          host = datastore4
>
> [osd.34]
>
>          host = datastore4
>
> [osd.35]
>
>          host = datastore4
>
> [osd.36]
>
>          host = datastore4
>
> [osd.37]
>
>          host = datastore4
>
> [osd.38]
>
>          host = datastore4
>
> [osd.39]
>
>          host = datastore4
>
> [osd.0]
>
>          host = datastore5
>
> [osd.40]
>
>          host = datastore5
>
> [osd.41]
>
>          host = datastore5
>
> [osd.42]
>
>          host = datastore5
>
> [osd.43]
>
>          host = datastore5
>
> [osd.44]
>
>          host = datastore5
>
> [osd.45]
>
>          host = datastore5
>
> [osd.46]
>
>          host = datastore5
>
> [osd.47]
>
>          host = datastore5
>
> [osd.48]
>
>          host = datastore5

You appear to have osd.0 defined twice on datastore0 and datastore5. 
Not sure if that was a transcription mistake or actually in the ceph 
configuration?

>
> We have 3 pools:
>
> -> 2 x 1000 pgs with 2 Replicas distributing the data equally to two
> racks (Used for datastore 1-4)

1000 PGs isn't imho enough for good distribution over 40 OSDs.  I'd up 
this to around 4096.

>
> -> 1 x 100 pgs without replication; data only stored on datastore 5.
> This Pool is used to compare the performance on local disks without
> networking

100 PGs *really* isn't enough.  For 10 OSDs I'd at least do 1024. 
Remember the filestore split multiple parameter from above?  The fewer 
PGs you have, the more important this becomes because you are shoving 
more objects into each PG and causing directory splits to happen much 
faster.

Consider that a 32GB volume is 32768MB, and we have 4MB RBD objects. 
that's 8192 objects.  Split that over 80 OSDs and you end up with around 
80 objects per PG on average.  Consider this isn't a particularly good 
distribution though, so some PGs are going to have more objects than 
others.  Now in this case you probably haven't hit the point where 
directories start splitting, but had you used replication, bigger 
volumes, or multiple volumes on multiple clients, you very easily could 
have started invoking directory splits on the OSDs.

At some point you have to be careful increasing the number of PGs 
because it puts more load on the monitors, but typically this won't be a 
problem until you get up closer to 100,000 PGs or more.  I've 
successfully tested a cluster with 1 million PGs.

>
> Here are the performance values, which I get using fio-Bench on a 32GB rbd:
>
> __
>
> _On 1000 pgs-Pool with distribution_
>
> fio --bs=1M --rw=randwrite --ioengine=libaio --direct=1 --iodepth=32
> --runtime=60 --name=/dev/rbd/pool1/bench1
>
> fio-2.0.13
>
> Starting 1 process
>
> Jobs: 1 (f=1): [w] [100.0% done] [0K/312.0M/0K /s] [0 /312 /0  iops]
> [eta 00m:00s]
>
> /dev/rbd/pool1/bench1: (groupid=0, jobs=1): err= 0: pid=21675: Fri Jul
> 4 11:03:52 2014
>
>    write: io=21071MB, bw=358989KB/s, iops=350 , runt= 60104msec
>
>      slat (usec): min=127 , max=8040 , avg=511.49, stdev=216.27
>
>      clat (msec): min=5 , max=4018 , avg=90.74, stdev=215.83
>
>       lat (msec): min=6 , max=4018 , avg=91.25, stdev=215.83
>
>      clat percentiles (msec):
>
>       |  1.00th=[    8],  5.00th=[    9], 10.00th=[   11], 20.00th=[   15],
>
>       | 30.00th=[   21], 40.00th=[   30], 50.00th=[   45], 60.00th=[   63],
>
>       | 70.00th=[   83], 80.00th=[  105], 90.00th=[  129], 95.00th=[  190],
>
>       | 99.00th=[ 1254], 99.50th=[ 1680], 99.90th=[ 2409], 99.95th=[ 2638],
>
>       | 99.99th=[ 3556]
>
>      bw (KB/s)  : min=68210, max=479232, per=100.00%, avg=368399.55,
> stdev=84457.12
>
>      lat (msec) : 10=9.50%, 20=20.02%, 50=23.56%, 100=24.56%, 250=18.09%
>
>      lat (msec) : 500=1.39%, 750=0.81%, 1000=0.65%, 2000=1.13%, >=2000=0.29%
>
>    cpu          : usr=11.17%, sys=7.46%, ctx=17772, majf=0, minf=24
>
>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%,
>  >=64=0.0%
>
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>  >=64=0.0%
>
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>  >=64=0.0%
>
>       issued    : total=r=0/w=21071/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
>
>    WRITE: io=21071MB, *aggrb=358989KB/s, minb=358989KB/s,
> maxb=358989KB/s, mint=60104msec, maxt=60104msec*
>
> **
>
> **
>
> _On 100 pgs-Pool without distribution:_
>
> __
>
> WRITE: io=5884.0MB, *aggrb=297953KB/s, minb=297953KB/s, maxb=297953KB/s,
> mint=20222msec, maxt=20222msec*
>
> Do you have any suggestion on how to improve the performace?

Just a general note, do you expect 1MB random writes to be a typical 
workload for your cluster?  Having said that, what I'm noticing here is 
that in the single-node setup with no replication, you get around 
300MB/s.  If you have a single SSD RAID1 for your journals, 300MB/s 
sounds about right for many ~250GB models.

For the 40 OSD pool, you are doing 2x replication, so getting an 
aggregate of about 180MB/s per node which is quite a bit lower.  On the 
other hand, the amount of concurrency you have isn't really very high 
for that many disks and you will lose some aggregate performance when 
replication is enabled.

The first thing I would do is just try your test with more concurrency. 
  If you don't have more clients, just increase the number of fio jobs. 
  I'd also increase the number of PGs and possibly the split multiplier 
when you do this!  Next I'd try enabling RBD cache and making a couple 
of the ceph.conf tunable tweaks mentioned above.  Finally I would 
seriously consider switching the SSD journal configuration so that half 
of the journals are on each SSD, directly placed in raw partition.

>
> While Reading on the internet, typical write-rates should be around
> 800-1000 Mb/sec if using 10 Gbit/s-Connection with a similar setup.

You won't get that unless you have very fast SSDs or I'm 
misunderstanding your SSD journal setup.

>
> Thanks in advance

Hope this helps!

>
> --
>
> Marco Allevato
> Projektteam
>
> Network Engineering GmbH
> Maximilianstrasse 93
> D-67346 Speyer
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>