Bad Write-Performance on Ceph/Possible bottlenecks?

wido@xxxxxxxx (Wido den Hollander) · Fri, 04 Jul 2014 12:08:37 +0200

On 07/04/2014 11:33 AM, Daniel Schwager wrote:
> Hi,
>
> I think, the problem is the rbd device. It's only ONE device.

I fully agree. Ceph excels in parallel performance. You should run 
multiple fio instances in parallel on different RBD devices and even 
better on different clients.

Then you will see a big difference.

Wido

>
>>fio --bs=1M --rw=randwrite --ioengine=libaio --direct=1 --iodepth=32
> --runtime=60 --name=/dev/rbd/pool1/bench1
>
> Try to create e.g. 20 (small) rbd devices, putting them all in a lvm vg,
> creating a logical volume (Raid0) with
>
> 20 stripes and e.g. stripeSize 1MB (better bandwith) or 4kb (better io)
> - or use md-raid0 (it's maybe 10% faster - but not that flexible):
>
> # create disks
>
> for i in `seq -f "%02.f" 0 19` ; do rbd create --size 40000
> vmware/vol6-$i.dsk ; done
>
> emacs -nw /etc/lvm/lvm.conf
>
> types = [ "rbd", 16 ]
>
> # rbd map ....
>
> # pvcreate
>
> for i in `seq -f "%02.f" 0 19` ; do pvcreate /dev/rbd/vmware/vol6-$i.dsk
> ; done
>
> # vcreate VG
>
> vgcreate VG_RBD20x40_VOL6 /dev/rbd/vmware/vol6-00.dsk
>
> for i in `seq -f "%02.f" 1 19` ; do vgextend VG_RBD20x40_VOL6
> /dev/rbd/vmware/vol6-$i.dsk ; done
>
> # lvcreate raid0
>
> # -i, --stripes Stripes - This is equal to the number of physical
> volumes to scatter the logical volume.
>
> # -I, --stripesize StripeSize - Gives the number of kilobytes for the
> granularity of the stripes, 2^n, (n = 2 to 9)
>
> # 20 stripes und 4k StripeSize
>
> lvcreate -i20 -I1024 -L700000m-n VmProd06VG_RBD20x40_VOL6
>
> Now, try to run fio against /dev/mapper/ VG_RBD20x40_VOL6-VmProd06
>
> I think, the performance will be about 10GBi.
>
> regards
>
> Danny
>
> *From:*ceph-users [mailto:ceph-users-bounces at lists.ceph.com] *On Behalf
> Of *Marco Allevato
> *Sent:* Friday, July 04, 2014 11:13 AM
> *To:* ceph-users at lists.ceph.com
> *Subject:* [ceph-users] Bad Write-Performance on Ceph/Possible bottlenecks?
>
> Hello Ceph-Community,
>
> I?m writing here because we have a bad write-performance on our
> Ceph-Cluster of about
>
> _As an overview the technical details of our Cluster:_
>
> 3 x monitoring-Servers; each with 2 x 1 Gbit/s NIC configured as Bond
> (Link Aggregation-Mode)
>
> 5 x datastore-Servers; each with 10 x 4 TB HDDs serving as OSDs, as
> Journal we use a 15 GB LVM on an 256 GB SSD-Raid1; 2 x 10 Gbit/s NIC
> configured as Bond (Link Aggregation-Mode)
>
> __
>
> _ceph.conf_
>
> [global]
>
> auth_service_required = cephx
>
> filestore_xattr_use_omap = true
>
> auth_client_required = cephx
>
> auth_cluster_required = cephx
>
> mon_host = 172.30.30.8,172.30.30.9
>
> mon_initial_members = monitoring1, monitoring2, monitoring3
>
> fsid = 5f22ab94-8d96-48c2-88d3-cff7bad443a9
>
> public network = 172.30.30.0/24
>
> [mon.monitoring1]
>
>          host = monitoring1
>
>          addr = 172.30.30.8:6789
>
> [mon.monitoring2]
>
>          host = monitoring2
>
>          addr = 172.30.30.9:6789
>
> [mon.monitoring3]
>
>          host = monitoring3
>
>          addr = 172.30.30.10:6789
>
> [filestore]
>
>         filestore max sync interval = 10
>
> [osd]
>
>          osd recovery max active = 1
>
>          osd journal size = 15360
>
>          osd op threads = 40
>
>          osd disk threads = 40
>
> [osd.0]
>
>          host = datastore1
>
> [osd.1]
>
>          host = datastore1
>
> [osd.2]
>
>          host = datastore1
>
> [osd.3]
>
>          host = datastore1
>
> [osd.4]
>
>          host = datastore1
>
> [osd.5]
>
>          host = datastore1
>
> [osd.6]
>
>          host = datastore1
>
> [osd.7]
>
>          host = datastore1
>
> [osd.8]
>
>          host = datastore1
>
> [osd.9]
>
>          host = datastore1
>
> [osd.10]
>
>          host = datastore2
>
> [osd.11]
>
>          host = datastore2
>
> [osd.11]
>
>          host = datastore2
>
> [osd.12]
>
>          host = datastore2
>
> [osd.13]
>
>          host = datastore2
>
> [osd.14]
>
>          host = datastore2
>
> [osd.15]
>
>          host = datastore2
>
> [osd.16]
>
>          host = datastore2
>
> [osd.17]
>
>          host = datastore2
>
> [osd.18]
>
>          host = datastore2
>
> [osd.19]
>
>          host = datastore2
>
> [osd.20]
>
>          host = datastore3
>
> [osd.21]
>
>          host = datastore3
>
> [osd.22]
>
>          host = datastore3
>
> [osd.23]
>
>          host = datastore3
>
> [osd.24]
>
>          host = datastore3
>
> [osd.25]
>
>          host = datastore3
>
> [osd.26]
>
>          host = datastore3
>
> [osd.27]
>
>          host = datastore3
>
> [osd.28]
>
>          host = datastore3
>
> [osd.29]
>
>          host = datastore3
>
> [osd.30]
>
>          host = datastore4
>
> [osd.31]
>
>          host = datastore4
>
> [osd.32]
>
>          host = datastore4
>
> [osd.33]
>
>          host = datastore4
>
> [osd.34]
>
>          host = datastore4
>
> [osd.35]
>
>          host = datastore4
>
> [osd.36]
>
>          host = datastore4
>
> [osd.37]
>
>          host = datastore4
>
> [osd.38]
>
>          host = datastore4
>
> [osd.39]
>
>          host = datastore4
>
> [osd.0]
>
>          host = datastore5
>
> [osd.40]
>
>          host = datastore5
>
> [osd.41]
>
>          host = datastore5
>
> [osd.42]
>
>          host = datastore5
>
> [osd.43]
>
>          host = datastore5
>
> [osd.44]
>
>          host = datastore5
>
> [osd.45]
>
>          host = datastore5
>
> [osd.46]
>
>          host = datastore5
>
> [osd.47]
>
>          host = datastore5
>
> [osd.48]
>
>          host = datastore5
>
> We have 3 pools:
>
> -> 2 x 1000 pgs with 2 Replicas distributing the data equally to two
> racks (Used for datastore 1-4)
>
> -> 1 x 100 pgs without replication; data only stored on datastore 5.
> This Pool is used to compare the performance on local disks without
> networking
>
> Here are the performance values, which I get using fio-Bench on a 32GB rbd:
>
> __
>
> _On 1000 pgs-Pool with distribution_
>
> fio --bs=1M --rw=randwrite --ioengine=libaio --direct=1 --iodepth=32
> --runtime=60 --name=/dev/rbd/pool1/bench1
>
> fio-2.0.13
>
> Starting 1 process
>
> Jobs: 1 (f=1): [w] [100.0% done] [0K/312.0M/0K /s] [0 /312 /0  iops]
> [eta 00m:00s]
>
> /dev/rbd/pool1/bench1: (groupid=0, jobs=1): err= 0: pid=21675: Fri Jul
> 4 11:03:52 2014
>
>    write: io=21071MB, bw=358989KB/s, iops=350 , runt= 60104msec
>
>      slat (usec): min=127 , max=8040 , avg=511.49, stdev=216.27
>
>      clat (msec): min=5 , max=4018 , avg=90.74, stdev=215.83
>
>       lat (msec): min=6 , max=4018 , avg=91.25, stdev=215.83
>
>      clat percentiles (msec):
>
>       |  1.00th=[    8],  5.00th=[    9], 10.00th=[   11], 20.00th=[   15],
>
>       | 30.00th=[   21], 40.00th=[   30], 50.00th=[   45], 60.00th=[   63],
>
>       | 70.00th=[   83], 80.00th=[  105], 90.00th=[  129], 95.00th=[  190],
>
>       | 99.00th=[ 1254], 99.50th=[ 1680], 99.90th=[ 2409], 99.95th=[ 2638],
>
>       | 99.99th=[ 3556]
>
>      bw (KB/s)  : min=68210, max=479232, per=100.00%, avg=368399.55,
> stdev=84457.12
>
>      lat (msec) : 10=9.50%, 20=20.02%, 50=23.56%, 100=24.56%, 250=18.09%
>
>      lat (msec) : 500=1.39%, 750=0.81%, 1000=0.65%, 2000=1.13%, >=2000=0.29%
>
>    cpu          : usr=11.17%, sys=7.46%, ctx=17772, majf=0, minf=24
>
>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%,
>  >=64=0.0%
>
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>  >=64=0.0%
>
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>  >=64=0.0%
>
>       issued    : total=r=0/w=21071/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
>
>    WRITE: io=21071MB, *aggrb=358989KB/s, minb=358989KB/s,
> maxb=358989KB/s, mint=60104msec, maxt=60104msec*
>
> **
>
> **
>
> _On 100 pgs-Pool without distribution:_
>
> __
>
> WRITE: io=5884.0MB, *aggrb=297953KB/s, minb=297953KB/s, maxb=297953KB/s,
> mint=20222msec, maxt=20222msec*
>
> Do you have any suggestion on how to improve the performace?
>
> While Reading on the internet, typical write-rates should be around
> 800-1000 Mb/sec if using 10 Gbit/s-Connection with a similar setup.
>
> Thanks in advance
>
> --
>
> Marco Allevato
> Projektteam
>
> Network Engineering GmbH
> Maximilianstrasse 93
> D-67346 Speyer
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on