Write IO Problem

Rottmann Jonas <j.rottmann@xxxxxxxxxx> · Fri, 20 Mar 2015 14:13:19 +0000

Hi,

We have a huge write IO Problem in our preproductive Ceph Cluster. First our Hardware:

4 OSD Nodes with:

Supermicro X10 Board
32GB DDR4 RAM
2x Intel Xeon E5-2620
LSI SAS 9300-8i Host Bus Adapter
Intel Corporation 82599EB 10-Gigabit
2x Intel SSDSA2CT040G3 in software raid 1 for system

Disks:
2x Samsung EVO 840 1TB

So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only added nodiratime)

Benchmarking one disk alone gives good values:

dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s

Fio 8k libaio depth=32:
write: io=488184KB, bw=52782KB/s, iops=5068 , runt=  9249msec

Here our ceph.conf (pretty much standard):

[global]
fsid = 89191a54-740a-46c7-a325-0899ab32fd1d
mon initial members = cephasp41,ceph-monitor41
mon host = 172.30.10.15,172.30.10.19
public network = 172.30.10.0/24
cluster network = 172.30.10.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

#Default is 1GB, which is fine for us
#osd journal size = {n}

#Only needed if ext4 comes to play
#filestore xattr use omap = true

osd pool default size = 3  # Write an object n times.
osd pool default min size = 2 # Allow writing n copy in a degraded state.

#Set individual per pool by a formula
#osd pool default pg num = {n}
#osd pool default pgp num = {n}
#osd crush chooseleaf type = {n}

When I benchmark the cluster with “rbd bench-write rbd/fio” I get pretty good results:
elapsed:    18  ops:   262144  ops/sec: 14466.30  bytes/sec: 59253946.11

If I for example bench i.e. with fio with rbd engine, I get very poor results:

[global]
ioengine=rbd
clientname=admin
pool=rbd
rbdname=fio
invalidate=0    # mandatory
rw=randwrite
bs=512k

[rbd_iodepth32]
iodepth=32

RESULTS:
ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec

Also if I mount the rbd with kernel as rbd0, format it with ext4 and then do a dd on it, its not that good:
“dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc”
RESULT:
1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s

I also tried presenting an rbd image with tgtd, mount it onto VMWare ESXi and test it in a vm, there I got only round about 50 iops with 4k, and writing sequentiell
 25Mbytes.
With NFS the read sequential values are good (400Mbyte/s) but writing only 25Mbyte/s.

What I tried tweaking so far:

Intel NIC optimazitions:
/etc/sysctl.conf

# Increase system file descriptor limit
fs.file-max = 65535

# Increase system IP port range to allow for more concurrent connections
net.ipv4.ip_local_port_range = 1024 65000

# -- 10gbe tuning from Intel ixgb driver README -- #

# turn off selective ACK and timestamps
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0

# memory allocation min/pressure/max.
# read buffer, write buffer, and buffer space
net.ipv4.tcp_rmem = 10000000 10000000 10000000
net.ipv4.tcp_wmem = 10000000 10000000 10000000
net.ipv4.tcp_mem = 10000000 10000000 10000000

net.core.rmem_max = 524287
net.core.wmem_max = 524287
net.core.rmem_default = 524287
net.core.wmem_default = 524287
net.core.optmem_max = 524287
net.core.netdev_max_backlog = 300000

AND

setpci -v -d 8086:10fb e6.b=2e

Setting tunables to firefly:

ceph osd crush tunables firefly

Setting scheduler to noop:
            This basically stopped IO on the cluster, and I had to revert it and restart some of the osds with requests stuck

And I tried moving the monitor from an VM to the Hardware where the OSDs run.

Any suggestions where to look, or what could cause that problem?

(because I can’t believe your loosing that much performance through ceph replication)

Thanks in advance.

If you need any info please tell me.

Mit freundlichen Grüßen/Kind regards  

Jonas Rottmann

Systems Engineer

FIS-ASP Application Service Providing und

IT-Outsourcing GmbH 

Röthleiner Weg 4

D-97506 Grafenrheinfeld 

Phone: +49 (9723) 9188-568

Fax: +49 (9723) 9188-600 

email:
j.rottmann@xxxxxxxxxx
 web:
www.fis-asp.de 

Geschäftsführer Robert Schuhmann

Registergericht Schweinfurt HRB 3865

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com