Re: Write IO Problem

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Tue, 24 Mar 2015 07:56:33 +0100 (CET)

Hi,

>>dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 
>>
>>1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s 

How much do you get with o_dsync? (ceph journal use o_dsync, and some ssd are pretty slow with dsync)

http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
$ sudo dd if=/dev/urandom of=randfile bs=1M count=1024 && sync
$ sudo dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync

>>When I benchmark the cluster with “rbd bench-write rbd/fio” I get pretty good results: 
>>elapsed: 18 ops: 262144 ops/sec: 14466.30 bytes/sec: 59253946.11 

theses results seem strange.
14466.30 bytes/sec for 262144 ops/sec ? (0,05 bytes by ops ????)
BTW, I never see big write ops/s with ceph without big big cluster and big cpus

about dd benchmark, the problem is that dd use 1job / iodepth=1 / sequential.
So here, network latencies make the difference. (but ceph team is also working to optimize that, with async messenger for example)
That's why you'll have more iops with fio, with more jobs/ bigger iodepth.

If you use full ssd setup, you should use at least Giant, because of sharding feature.
With firefly, osd daemons doesn't scale well on multiple cores.

Also from my tests, writes use a lot more cpu than read. (can be cpu bound on 3 nodes 8cores xeon-E5 1,7ghz, replication x3, with 10000 4k randwrite)

also disabling cephx auth and debug help to get more iops.

if your workload is mainly sequential, enabling rbd_cache will help for writes, merging colleasced blocks request,
so less ops (but bigger ops), so less cpu.

Alexandre

----- Mail original -----
De: "Rottmann Jonas" <j.rottmann@xxxxxxxxxx>
À: "ceph-users" <ceph-users@xxxxxxxx>
Envoyé: Vendredi 20 Mars 2015 15:13:19
Objet:  Write IO Problem

Hi, 

We have a huge write IO Problem in our preproductive Ceph Cluster. First our Hardware: 

4 OSD Nodes with: 

Supermicro X10 Board 

32GB DDR4 RAM 

2x Intel Xeon E5-2620 

LSI SAS 9300-8i Host Bus Adapter 

Intel Corporation 82599EB 10-Gigabit 

2x Intel SSDSA2CT040G3 in software raid 1 for system 

Disks: 

2x Samsung EVO 840 1TB 

So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only added nodiratime) 

Benchmarking one disk alone gives good values: 

dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 

1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s 

Fio 8k libaio depth=32: 

write: io=488184KB, bw=52782KB/s, iops=5068 , runt= 9249msec 

Here our ceph.conf (pretty much standard): 

[global] 

fsid = 89191a54-740a-46c7-a325-0899ab32fd1d 

mon initial members = cephasp41,ceph-monitor41 

mon host = 172.30.10.15,172.30.10.19 

public network = 172.30.10.0/24 

cluster network = 172.30.10.0/24 

auth cluster required = cephx 

auth service required = cephx 

auth client required = cephx 

#Default is 1GB, which is fine for us 

#osd journal size = {n} 

#Only needed if ext4 comes to play 

#filestore xattr use omap = true 

osd pool default size = 3 # Write an object n times. 

osd pool default min size = 2 # Allow writing n copy in a degraded state. 

#Set individual per pool by a formula 

#osd pool default pg num = {n} 

#osd pool default pgp num = {n} 

#osd crush chooseleaf type = {n} 

When I benchmark the cluster with “rbd bench-write rbd/fio” I get pretty good results: 

elapsed: 18 ops: 262144 ops/sec: 14466.30 bytes/sec: 59253946.11 

If I for example bench i.e. with fio with rbd engine, I get very poor results: 

[global] 

ioengine=rbd 

clientname=admin 

pool=rbd 

rbdname=fio 

invalidate=0 # mandatory 

rw=randwrite 

bs=512k 

[rbd_iodepth32] 

iodepth=32 

RESULTS: 

ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec 

Also if I mount the rbd with kernel as rbd0, format it with ext4 and then do a dd on it, its not that good: 

“dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc” 

RESULT: 

1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s 

I also tried presenting an rbd image with tgtd, mount it onto VMWare ESXi and test it in a vm, there I got only round about 50 iops with 4k, and writing sequentiell 25Mbytes. 

With NFS the read sequential values are good (400Mbyte/s) but writing only 25Mbyte/s. 

What I tried tweaking so far: 

Intel NIC optimazitions: 

/etc/sysctl.conf 

# Increase system file descriptor limit 

fs.file-max = 65535 

# Increase system IP port range to allow for more concurrent connections 

net.ipv4.ip_local_port_range = 1024 65000 

# -- 10gbe tuning from Intel ixgb driver README -- # 

# turn off selective ACK and timestamps 

net.ipv4.tcp_sack = 0 

net.ipv4.tcp_timestamps = 0 

# memory allocation min/pressure/max. 

# read buffer, write buffer, and buffer space 

net.ipv4.tcp_rmem = 10000000 10000000 10000000 

net.ipv4.tcp_wmem = 10000000 10000000 10000000 

net.ipv4.tcp_mem = 10000000 10000000 10000000 

net.core.rmem_max = 524287 

net.core.wmem_max = 524287 

net.core.rmem_default = 524287 

net.core.wmem_default = 524287 

net.core.optmem_max = 524287 

net.core.netdev_max_backlog = 300000 

AND 

setpci -v -d 8086:10fb e6.b=2e 

Setting tunables to firefly: 

ceph osd crush tunables firefly 

Setting scheduler to noop: 

This basically stopped IO on the cluster, and I had to revert it and restart some of the osds with requests stuck 

And I tried moving the monitor from an VM to the Hardware where the OSDs run. 

Any suggestions where to look, or what could cause that problem? 

(because I can’t believe your loosing that much performance through ceph replication) 

Thanks in advance. 

If you need any info please tell me. 

Mit freundlichen Grüßen/Kind regards 

Jonas Rottmann 
Systems Engineer 

FIS-ASP Application Service Providing und 
IT-Outsourcing GmbH 
Röthleiner Weg 4 
D-97506 Grafenrheinfeld 
Phone: +49 (9723) 9188-568 
Fax: +49 (9723) 9188-600 

email: j.rottmann@xxxxxxxxxx web: www.fis-asp.de 

Geschäftsführer Robert Schuhmann 
Registergericht Schweinfurt HRB 3865 

_______________________________________________ 
ceph-users mailing list 
ceph-users@xxxxxxxxxxxxxx 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com