Bad Write-Performance on Ceph/Possible bottlenecks?

Daniel.Schwager@xxxxxxxx (Daniel Schwager) · Fri, 4 Jul 2014 09:33:59 +0000

Hi,

I think, the problem is the rbd device. It's only ONE device.

> fio --bs=1M --rw=randwrite --ioengine=libaio --direct=1 --iodepth=32 --runtime=60 --name=/dev/rbd/pool1/bench1

Try to create e.g. 20 (small) rbd devices, putting them all in a lvm vg, creating a logical volume (Raid0) with
20 stripes and e.g. stripeSize 1MB (better bandwith) or 4kb (better io) - or use md-raid0 (it's maybe 10% faster - but not that flexible):

       # create disks
        for i in `seq -f "%02.f" 0 19` ; do rbd create --size 40000 vmware/vol6-$i.dsk ; done
        emacs -nw /etc/lvm/lvm.conf
                types = [ "rbd", 16 ]

      # rbd map ....

        # pvcreate
       for i in `seq -f "%02.f" 0 19` ; do pvcreate /dev/rbd/vmware/vol6-$i.dsk ; done

        # vcreate VG
        vgcreate VG_RBD20x40_VOL6 /dev/rbd/vmware/vol6-00.dsk
        for i in `seq -f "%02.f" 1 19` ; do vgextend VG_RBD20x40_VOL6 /dev/rbd/vmware/vol6-$i.dsk ; done

        # lvcreate raid0
        # -i, --stripes Stripes - This is equal to the number of physical volumes to scatter the logical volume.
        # -I, --stripesize StripeSize - Gives the number of kilobytes for the granularity of the stripes, 2^n, (n = 2 to 9)
        # 20 stripes und 4k StripeSize
        lvcreate -i20 -I1024 -L700000m  -n VmProd06  VG_RBD20x40_VOL6

Now, try to run fio against /dev/mapper/ VG_RBD20x40_VOL6-VmProd06

I think, the performance will be about 10GBi.

regards
Danny

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Marco Allevato
Sent: Friday, July 04, 2014 11:13 AM
To: ceph-users at lists.ceph.com
Subject: Bad Write-Performance on Ceph/Possible bottlenecks?

Hello Ceph-Community,

I'm writing here because we have a bad write-performance on our Ceph-Cluster of about

As an overview the technical details of our Cluster:

3 x monitoring-Servers; each with 2 x 1 Gbit/s NIC configured as Bond (Link Aggregation-Mode)

5 x datastore-Servers; each with 10 x 4 TB HDDs serving as OSDs, as Journal we use a 15 GB LVM on an 256 GB SSD-Raid1; 2 x 10 Gbit/s NIC configured as Bond (Link Aggregation-Mode)

ceph.conf

[global]
auth_service_required = cephx
filestore_xattr_use_omap = true
auth_client_required = cephx
auth_cluster_required = cephx
mon_host = 172.30.30.8,172.30.30.9
mon_initial_members = monitoring1, monitoring2, monitoring3
fsid = 5f22ab94-8d96-48c2-88d3-cff7bad443a9
public network = 172.30.30.0/24

[mon.monitoring1]
        host = monitoring1
        addr = 172.30.30.8:6789

[mon.monitoring2]
        host = monitoring2
        addr = 172.30.30.9:6789

[mon.monitoring3]
        host = monitoring3
        addr = 172.30.30.10:6789

[filestore]
       filestore max sync interval = 10

[osd]
        osd recovery max active = 1
        osd journal size = 15360
        osd op threads = 40
        osd disk threads = 40

[osd.0]
        host = datastore1

[osd.1]
        host = datastore1

[osd.2]
        host = datastore1

[osd.3]
        host = datastore1

[osd.4]
        host = datastore1

[osd.5]
        host = datastore1

[osd.6]
        host = datastore1

[osd.7]
        host = datastore1

[osd.8]
        host = datastore1

[osd.9]
        host = datastore1

[osd.10]
        host = datastore2

[osd.11]
        host = datastore2

[osd.11]
        host = datastore2

[osd.12]
        host = datastore2

[osd.13]
        host = datastore2

[osd.14]
        host = datastore2

[osd.15]
        host = datastore2

[osd.16]
        host = datastore2

[osd.17]
        host = datastore2

[osd.18]
        host = datastore2

[osd.19]
        host = datastore2

[osd.20]
        host = datastore3

[osd.21]
        host = datastore3

[osd.22]
        host = datastore3

[osd.23]
        host = datastore3

[osd.24]
        host = datastore3

[osd.25]
        host = datastore3

[osd.26]
        host = datastore3

[osd.27]
        host = datastore3

[osd.28]
        host = datastore3

[osd.29]
        host = datastore3

[osd.30]
        host = datastore4

[osd.31]
        host = datastore4

[osd.32]
        host = datastore4

[osd.33]
        host = datastore4

[osd.34]
        host = datastore4

[osd.35]
        host = datastore4

[osd.36]
        host = datastore4

[osd.37]
        host = datastore4

[osd.38]
        host = datastore4

[osd.39]
        host = datastore4

[osd.0]
        host = datastore5

[osd.40]
        host = datastore5

[osd.41]
        host = datastore5

[osd.42]
        host = datastore5

[osd.43]
        host = datastore5

[osd.44]
        host = datastore5

[osd.45]
        host = datastore5

[osd.46]
        host = datastore5

[osd.47]
        host = datastore5

[osd.48]
        host = datastore5

We have 3 pools:

-> 2 x 1000 pgs with 2 Replicas distributing the data equally to two racks (Used for datastore 1-4)
-> 1 x 100 pgs without replication; data only stored on datastore 5. This Pool is used to compare the performance on local disks without networking

Here are the performance values, which I get using fio-Bench on a 32GB rbd:

On 1000 pgs-Pool with distribution

fio --bs=1M --rw=randwrite --ioengine=libaio --direct=1 --iodepth=32 --runtime=60 --name=/dev/rbd/pool1/bench1

fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/312.0M/0K /s] [0 /312 /0  iops] [eta 00m:00s]
/dev/rbd/pool1/bench1: (groupid=0, jobs=1): err= 0: pid=21675: Fri Jul  4 11:03:52 2014
  write: io=21071MB, bw=358989KB/s, iops=350 , runt= 60104msec
    slat (usec): min=127 , max=8040 , avg=511.49, stdev=216.27
    clat (msec): min=5 , max=4018 , avg=90.74, stdev=215.83
     lat (msec): min=6 , max=4018 , avg=91.25, stdev=215.83
    clat percentiles (msec):
     |  1.00th=[    8],  5.00th=[    9], 10.00th=[   11], 20.00th=[   15],
     | 30.00th=[   21], 40.00th=[   30], 50.00th=[   45], 60.00th=[   63],
     | 70.00th=[   83], 80.00th=[  105], 90.00th=[  129], 95.00th=[  190],
     | 99.00th=[ 1254], 99.50th=[ 1680], 99.90th=[ 2409], 99.95th=[ 2638],
     | 99.99th=[ 3556]
    bw (KB/s)  : min=68210, max=479232, per=100.00%, avg=368399.55, stdev=84457.12
    lat (msec) : 10=9.50%, 20=20.02%, 50=23.56%, 100=24.56%, 250=18.09%
    lat (msec) : 500=1.39%, 750=0.81%, 1000=0.65%, 2000=1.13%, >=2000=0.29%
  cpu          : usr=11.17%, sys=7.46%, ctx=17772, majf=0, minf=24
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=21071/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=21071MB, aggrb=358989KB/s, minb=358989KB/s, maxb=358989KB/s, mint=60104msec, maxt=60104msec

On 100 pgs-Pool without distribution:

WRITE: io=5884.0MB, aggrb=297953KB/s, minb=297953KB/s, maxb=297953KB/s, mint=20222msec, maxt=20222msec

Do you have any suggestion on how to improve the performace?

While Reading on the internet, typical write-rates should be around 800-1000 Mb/sec if using 10 Gbit/s-Connection with a similar setup.

Thanks in advance

--
Marco Allevato
Projektteam

Network Engineering GmbH
Maximilianstrasse 93
D-67346 Speyer

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140704/5fa6d275/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2279 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140704/5fa6d275/attachment.bin>