On 07/04/2014 04:13 AM, Marco Allevato wrote: > Hello Ceph-Community, > > I?m writing here because we have a bad write-performance on our > Ceph-Cluster of about > > _As an overview the technical details of our Cluster:_ > > 3 x monitoring-Servers; each with 2 x 1 Gbit/s NIC configured as Bond > (Link Aggregation-Mode) > > 5 x datastore-Servers; each with 10 x 4 TB HDDs serving as OSDs, as > Journal we use a 15 GB LVM on an 256 GB SSD-Raid1; 2 x 10 Gbit/s NIC > configured as Bond (Link Aggregation-Mode) What SSDs are you using? If this is just a single pair of typical consumer grade 256GB SSDs, you are likely oversubscribing them quite a bit. A RAID1 of two of them won't achieve anywhere near the sequential write performance the disks can do. Since SSD failure tends to be kind of clumpy based on the number of writes, I'm not really fond of doing SSD journal devices in a RAID1. I'd rather reduce write wear and just use them individually with half the journals on each SSD. Better performance too. > > __ > > _ceph.conf_ > > [global] > > auth_service_required = cephx > > filestore_xattr_use_omap = true > > auth_client_required = cephx > > auth_cluster_required = cephx > > mon_host = 172.30.30.8,172.30.30.9 > > mon_initial_members = monitoring1, monitoring2, monitoring3 > > fsid = 5f22ab94-8d96-48c2-88d3-cff7bad443a9 > > public network = 172.30.30.0/24 > > [mon.monitoring1] > > host = monitoring1 > > addr = 172.30.30.8:6789 > > [mon.monitoring2] > > host = monitoring2 > > addr = 172.30.30.9:6789 > > [mon.monitoring3] > > host = monitoring3 > > addr = 172.30.30.10:6789 > > [filestore] > > filestore max sync interval = 10 This is tough to get right imho. Less frequent synchronization gives you more time to potentially coalesce write ops, but also may cause spikier behaviour which we like to avoid. Probably requires just testing it and seeing what happens in your setup. > > [osd] > > osd recovery max active = 1 > > osd journal size = 15360 > > osd op threads = 40 > > osd disk threads = 40 Having so many disk threads and op threads probably isn't going to help you. I'd reduce them unless you've actually tested that this is faster than the default. In some cases increasing the op threads does improve performance, but ultimately you probably don't want to go too high. I often set it to 8, but I suspect it may be platform dependent and how many CPU cores and disks you have. Basically you want the minimum number of threads that lets you achieve saturation. Some other things to consider: 1) Are you using Kernel RBD or userland RBD? Have you considered using RBD Cache? This can help dramatically, especially with smaller sequential writes. In your case you are testing 1MB random writes, so it may not help as much, but it's still worth trying. 2) If you have few PGs in the pool and *lots* of objects (say many large volumes, and/or small RBD object sizes), you may benefit from increasing the filestore merge threshold and filestore split multiple parameters. These govern how the filestore splits directories that objects are stored in under the OSD per PG. It appears that increasing the per-directory limits can reduce performance degradation as the number of objects increases, but may slow down lookup operations. Currently the default is 300 objects per directory however which I believe may be too low. Here's what I like to use: filestore merge threshold = 40 filestore split multiple = 8 3) you might find that disabling in-memory debugging may help, though this typically helps more with small random IOs. debug_lockdep = "0/0" debug_context = "0/0" debug_crush = "0/0" debug_mds = "0/0" debug_mds_balancer = "0/0" debug_mds_locker = "0/0" debug_mds_log = "0/0" debug_mds_log_expire = "0/0" debug_mds_migrator = "0/0" debug_buffer = "0/0" debug_timer = "0/0" debug_filer = "0/0" debug_objecter = "0/0" debug_rados = "0/0" debug_rbd = "0/0" debug_journaler = "0/0" debug_objectcacher = "0/0" debug_client = "0/0" debug_osd = "0/0" debug_optracker = "0/0" debug_objclass = "0/0" debug_filestore = "0/0" debug_journal = "0/0" debug_ms = "0/0" debug_mon = "0/0" debug_monc = "0/0" debug_paxos = "0/0" debug_tp = "0/0" debug_auth = "0/0" debug_finisher = "0/0" debug_heartbeatmap = "0/0" debug_perfcounter = "0/0" debug_rgw = "0/0" debug_hadoop = "0/0" debug_asok = "0/0" debug_throttle = "0/0" > > [osd.0] > > host = datastore1 > > [osd.1] > > host = datastore1 > > [osd.2] > > host = datastore1 > > [osd.3] > > host = datastore1 > > [osd.4] > > host = datastore1 > > [osd.5] > > host = datastore1 > > [osd.6] > > host = datastore1 > > [osd.7] > > host = datastore1 > > [osd.8] > > host = datastore1 > > [osd.9] > > host = datastore1 > > [osd.10] > > host = datastore2 > > [osd.11] > > host = datastore2 > > [osd.11] > > host = datastore2 > > [osd.12] > > host = datastore2 > > [osd.13] > > host = datastore2 > > [osd.14] > > host = datastore2 > > [osd.15] > > host = datastore2 > > [osd.16] > > host = datastore2 > > [osd.17] > > host = datastore2 > > [osd.18] > > host = datastore2 > > [osd.19] > > host = datastore2 > > [osd.20] > > host = datastore3 > > [osd.21] > > host = datastore3 > > [osd.22] > > host = datastore3 > > [osd.23] > > host = datastore3 > > [osd.24] > > host = datastore3 > > [osd.25] > > host = datastore3 > > [osd.26] > > host = datastore3 > > [osd.27] > > host = datastore3 > > [osd.28] > > host = datastore3 > > [osd.29] > > host = datastore3 > > [osd.30] > > host = datastore4 > > [osd.31] > > host = datastore4 > > [osd.32] > > host = datastore4 > > [osd.33] > > host = datastore4 > > [osd.34] > > host = datastore4 > > [osd.35] > > host = datastore4 > > [osd.36] > > host = datastore4 > > [osd.37] > > host = datastore4 > > [osd.38] > > host = datastore4 > > [osd.39] > > host = datastore4 > > [osd.0] > > host = datastore5 > > [osd.40] > > host = datastore5 > > [osd.41] > > host = datastore5 > > [osd.42] > > host = datastore5 > > [osd.43] > > host = datastore5 > > [osd.44] > > host = datastore5 > > [osd.45] > > host = datastore5 > > [osd.46] > > host = datastore5 > > [osd.47] > > host = datastore5 > > [osd.48] > > host = datastore5 You appear to have osd.0 defined twice on datastore0 and datastore5. Not sure if that was a transcription mistake or actually in the ceph configuration? > > We have 3 pools: > > -> 2 x 1000 pgs with 2 Replicas distributing the data equally to two > racks (Used for datastore 1-4) 1000 PGs isn't imho enough for good distribution over 40 OSDs. I'd up this to around 4096. > > -> 1 x 100 pgs without replication; data only stored on datastore 5. > This Pool is used to compare the performance on local disks without > networking 100 PGs *really* isn't enough. For 10 OSDs I'd at least do 1024. Remember the filestore split multiple parameter from above? The fewer PGs you have, the more important this becomes because you are shoving more objects into each PG and causing directory splits to happen much faster. Consider that a 32GB volume is 32768MB, and we have 4MB RBD objects. that's 8192 objects. Split that over 80 OSDs and you end up with around 80 objects per PG on average. Consider this isn't a particularly good distribution though, so some PGs are going to have more objects than others. Now in this case you probably haven't hit the point where directories start splitting, but had you used replication, bigger volumes, or multiple volumes on multiple clients, you very easily could have started invoking directory splits on the OSDs. At some point you have to be careful increasing the number of PGs because it puts more load on the monitors, but typically this won't be a problem until you get up closer to 100,000 PGs or more. I've successfully tested a cluster with 1 million PGs. > > Here are the performance values, which I get using fio-Bench on a 32GB rbd: > > __ > > _On 1000 pgs-Pool with distribution_ > > fio --bs=1M --rw=randwrite --ioengine=libaio --direct=1 --iodepth=32 > --runtime=60 --name=/dev/rbd/pool1/bench1 > > fio-2.0.13 > > Starting 1 process > > Jobs: 1 (f=1): [w] [100.0% done] [0K/312.0M/0K /s] [0 /312 /0 iops] > [eta 00m:00s] > > /dev/rbd/pool1/bench1: (groupid=0, jobs=1): err= 0: pid=21675: Fri Jul > 4 11:03:52 2014 > > write: io=21071MB, bw=358989KB/s, iops=350 , runt= 60104msec > > slat (usec): min=127 , max=8040 , avg=511.49, stdev=216.27 > > clat (msec): min=5 , max=4018 , avg=90.74, stdev=215.83 > > lat (msec): min=6 , max=4018 , avg=91.25, stdev=215.83 > > clat percentiles (msec): > > | 1.00th=[ 8], 5.00th=[ 9], 10.00th=[ 11], 20.00th=[ 15], > > | 30.00th=[ 21], 40.00th=[ 30], 50.00th=[ 45], 60.00th=[ 63], > > | 70.00th=[ 83], 80.00th=[ 105], 90.00th=[ 129], 95.00th=[ 190], > > | 99.00th=[ 1254], 99.50th=[ 1680], 99.90th=[ 2409], 99.95th=[ 2638], > > | 99.99th=[ 3556] > > bw (KB/s) : min=68210, max=479232, per=100.00%, avg=368399.55, > stdev=84457.12 > > lat (msec) : 10=9.50%, 20=20.02%, 50=23.56%, 100=24.56%, 250=18.09% > > lat (msec) : 500=1.39%, 750=0.81%, 1000=0.65%, 2000=1.13%, >=2000=0.29% > > cpu : usr=11.17%, sys=7.46%, ctx=17772, majf=0, minf=24 > > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, > >=64=0.0% > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, > >=64=0.0% > > issued : total=r=0/w=21071/d=0, short=r=0/w=0/d=0 > > Run status group 0 (all jobs): > > WRITE: io=21071MB, *aggrb=358989KB/s, minb=358989KB/s, > maxb=358989KB/s, mint=60104msec, maxt=60104msec* > > ** > > ** > > _On 100 pgs-Pool without distribution:_ > > __ > > WRITE: io=5884.0MB, *aggrb=297953KB/s, minb=297953KB/s, maxb=297953KB/s, > mint=20222msec, maxt=20222msec* > > Do you have any suggestion on how to improve the performace? Just a general note, do you expect 1MB random writes to be a typical workload for your cluster? Having said that, what I'm noticing here is that in the single-node setup with no replication, you get around 300MB/s. If you have a single SSD RAID1 for your journals, 300MB/s sounds about right for many ~250GB models. For the 40 OSD pool, you are doing 2x replication, so getting an aggregate of about 180MB/s per node which is quite a bit lower. On the other hand, the amount of concurrency you have isn't really very high for that many disks and you will lose some aggregate performance when replication is enabled. The first thing I would do is just try your test with more concurrency. If you don't have more clients, just increase the number of fio jobs. I'd also increase the number of PGs and possibly the split multiplier when you do this! Next I'd try enabling RBD cache and making a couple of the ceph.conf tunable tweaks mentioned above. Finally I would seriously consider switching the SSD journal configuration so that half of the journals are on each SSD, directly placed in raw partition. > > While Reading on the internet, typical write-rates should be around > 800-1000 Mb/sec if using 10 Gbit/s-Connection with a similar setup. You won't get that unless you have very fast SSDs or I'm misunderstanding your SSD journal setup. > > Thanks in advance Hope this helps! > > -- > > Marco Allevato > Projektteam > > Network Engineering GmbH > Maximilianstrasse 93 > D-67346 Speyer > > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >