Hi, I think, the problem is the rbd device. It's only ONE device. > fio --bs=1M --rw=randwrite --ioengine=libaio --direct=1 --iodepth=32 --runtime=60 --name=/dev/rbd/pool1/bench1 Try to create e.g. 20 (small) rbd devices, putting them all in a lvm vg, creating a logical volume (Raid0) with 20 stripes and e.g. stripeSize 1MB (better bandwith) or 4kb (better io) - or use md-raid0 (it's maybe 10% faster - but not that flexible): # create disks for i in `seq -f "%02.f" 0 19` ; do rbd create --size 40000 vmware/vol6-$i.dsk ; done emacs -nw /etc/lvm/lvm.conf types = [ "rbd", 16 ] # rbd map .... # pvcreate for i in `seq -f "%02.f" 0 19` ; do pvcreate /dev/rbd/vmware/vol6-$i.dsk ; done # vcreate VG vgcreate VG_RBD20x40_VOL6 /dev/rbd/vmware/vol6-00.dsk for i in `seq -f "%02.f" 1 19` ; do vgextend VG_RBD20x40_VOL6 /dev/rbd/vmware/vol6-$i.dsk ; done # lvcreate raid0 # -i, --stripes Stripes - This is equal to the number of physical volumes to scatter the logical volume. # -I, --stripesize StripeSize - Gives the number of kilobytes for the granularity of the stripes, 2^n, (n = 2 to 9) # 20 stripes und 4k StripeSize lvcreate -i20 -I1024 -L700000m -n VmProd06 VG_RBD20x40_VOL6 Now, try to run fio against /dev/mapper/ VG_RBD20x40_VOL6-VmProd06 I think, the performance will be about 10GBi. regards Danny From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Marco Allevato Sent: Friday, July 04, 2014 11:13 AM To: ceph-users at lists.ceph.com Subject: Bad Write-Performance on Ceph/Possible bottlenecks? Hello Ceph-Community, I'm writing here because we have a bad write-performance on our Ceph-Cluster of about As an overview the technical details of our Cluster: 3 x monitoring-Servers; each with 2 x 1 Gbit/s NIC configured as Bond (Link Aggregation-Mode) 5 x datastore-Servers; each with 10 x 4 TB HDDs serving as OSDs, as Journal we use a 15 GB LVM on an 256 GB SSD-Raid1; 2 x 10 Gbit/s NIC configured as Bond (Link Aggregation-Mode) ceph.conf [global] auth_service_required = cephx filestore_xattr_use_omap = true auth_client_required = cephx auth_cluster_required = cephx mon_host = 172.30.30.8,172.30.30.9 mon_initial_members = monitoring1, monitoring2, monitoring3 fsid = 5f22ab94-8d96-48c2-88d3-cff7bad443a9 public network = 172.30.30.0/24 [mon.monitoring1] host = monitoring1 addr = 172.30.30.8:6789 [mon.monitoring2] host = monitoring2 addr = 172.30.30.9:6789 [mon.monitoring3] host = monitoring3 addr = 172.30.30.10:6789 [filestore] filestore max sync interval = 10 [osd] osd recovery max active = 1 osd journal size = 15360 osd op threads = 40 osd disk threads = 40 [osd.0] host = datastore1 [osd.1] host = datastore1 [osd.2] host = datastore1 [osd.3] host = datastore1 [osd.4] host = datastore1 [osd.5] host = datastore1 [osd.6] host = datastore1 [osd.7] host = datastore1 [osd.8] host = datastore1 [osd.9] host = datastore1 [osd.10] host = datastore2 [osd.11] host = datastore2 [osd.11] host = datastore2 [osd.12] host = datastore2 [osd.13] host = datastore2 [osd.14] host = datastore2 [osd.15] host = datastore2 [osd.16] host = datastore2 [osd.17] host = datastore2 [osd.18] host = datastore2 [osd.19] host = datastore2 [osd.20] host = datastore3 [osd.21] host = datastore3 [osd.22] host = datastore3 [osd.23] host = datastore3 [osd.24] host = datastore3 [osd.25] host = datastore3 [osd.26] host = datastore3 [osd.27] host = datastore3 [osd.28] host = datastore3 [osd.29] host = datastore3 [osd.30] host = datastore4 [osd.31] host = datastore4 [osd.32] host = datastore4 [osd.33] host = datastore4 [osd.34] host = datastore4 [osd.35] host = datastore4 [osd.36] host = datastore4 [osd.37] host = datastore4 [osd.38] host = datastore4 [osd.39] host = datastore4 [osd.0] host = datastore5 [osd.40] host = datastore5 [osd.41] host = datastore5 [osd.42] host = datastore5 [osd.43] host = datastore5 [osd.44] host = datastore5 [osd.45] host = datastore5 [osd.46] host = datastore5 [osd.47] host = datastore5 [osd.48] host = datastore5 We have 3 pools: -> 2 x 1000 pgs with 2 Replicas distributing the data equally to two racks (Used for datastore 1-4) -> 1 x 100 pgs without replication; data only stored on datastore 5. This Pool is used to compare the performance on local disks without networking Here are the performance values, which I get using fio-Bench on a 32GB rbd: On 1000 pgs-Pool with distribution fio --bs=1M --rw=randwrite --ioengine=libaio --direct=1 --iodepth=32 --runtime=60 --name=/dev/rbd/pool1/bench1 fio-2.0.13 Starting 1 process Jobs: 1 (f=1): [w] [100.0% done] [0K/312.0M/0K /s] [0 /312 /0 iops] [eta 00m:00s] /dev/rbd/pool1/bench1: (groupid=0, jobs=1): err= 0: pid=21675: Fri Jul 4 11:03:52 2014 write: io=21071MB, bw=358989KB/s, iops=350 , runt= 60104msec slat (usec): min=127 , max=8040 , avg=511.49, stdev=216.27 clat (msec): min=5 , max=4018 , avg=90.74, stdev=215.83 lat (msec): min=6 , max=4018 , avg=91.25, stdev=215.83 clat percentiles (msec): | 1.00th=[ 8], 5.00th=[ 9], 10.00th=[ 11], 20.00th=[ 15], | 30.00th=[ 21], 40.00th=[ 30], 50.00th=[ 45], 60.00th=[ 63], | 70.00th=[ 83], 80.00th=[ 105], 90.00th=[ 129], 95.00th=[ 190], | 99.00th=[ 1254], 99.50th=[ 1680], 99.90th=[ 2409], 99.95th=[ 2638], | 99.99th=[ 3556] bw (KB/s) : min=68210, max=479232, per=100.00%, avg=368399.55, stdev=84457.12 lat (msec) : 10=9.50%, 20=20.02%, 50=23.56%, 100=24.56%, 250=18.09% lat (msec) : 500=1.39%, 750=0.81%, 1000=0.65%, 2000=1.13%, >=2000=0.29% cpu : usr=11.17%, sys=7.46%, ctx=17772, majf=0, minf=24 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=0/w=21071/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=21071MB, aggrb=358989KB/s, minb=358989KB/s, maxb=358989KB/s, mint=60104msec, maxt=60104msec On 100 pgs-Pool without distribution: WRITE: io=5884.0MB, aggrb=297953KB/s, minb=297953KB/s, maxb=297953KB/s, mint=20222msec, maxt=20222msec Do you have any suggestion on how to improve the performace? While Reading on the internet, typical write-rates should be around 800-1000 Mb/sec if using 10 Gbit/s-Connection with a similar setup. Thanks in advance -- Marco Allevato Projektteam Network Engineering GmbH Maximilianstrasse 93 D-67346 Speyer -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140704/5fa6d275/attachment.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2279 bytes Desc: not available URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140704/5fa6d275/attachment.bin>