Re: Write IO Problem

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Tue, 24 Mar 2015 08:36:40 +0100 (CET)

>>No, 262144 ops total in 18 seconds. 
>>
Oh ok ;)

>>"rbd bench-write" is clearly doing something VERY differently from "rados 
>>bench" (and given its output was also written by somebody else), maybe some 
>>Ceph dev can enlighten us? 

Maybe rbd_cache is merging 4k block to 4M rados objects ?
does rbd_cache=false change the results ?

How many iops do you see with "#ceph -w" ?

----- Mail original -----
De: "Christian Balzer" <chibi@xxxxxxx>
À: "ceph-users" <ceph-users@xxxxxxxx>
Cc: "aderumier" <aderumier@xxxxxxxxx>
Envoyé: Mardi 24 Mars 2015 08:24:23
Objet: Re:  Write IO Problem

On Tue, 24 Mar 2015 07:56:33 +0100 (CET) Alexandre DERUMIER wrote: 

> Hi, 
> 
> >>dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 
> >> 
> >>1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s 
> 
> How much do you get with o_dsync? (ceph journal use o_dsync, and some 
> ssd are pretty slow with dsync) 
> 
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ 
> $ sudo dd if=/dev/urandom of=randfile bs=1M count=1024 && sync 
> $ sudo dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync 
> 
> 
> 
> >>When I benchmark the cluster with “rbd bench-write rbd/fio” I get 
> >>pretty good results: elapsed: 18 ops: 262144 ops/sec: 14466.30 
> >>bytes/sec: 59253946.11 
> 
> theses results seem strange. 
> 14466.30 bytes/sec for 262144 ops/sec ? (0,05 bytes by ops ????) 
> 
No, 262144 ops total in 18 seconds. 

"rbd bench-write" is clearly doing something VERY differently from "rados 
bench" (and given its output was also written by somebody else), maybe some 
Ceph dev can enlighten us? 

On my production cluster: 
--- 
# rbd bench-write rbd/fio 
bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq 
SEC OPS OPS/SEC BYTES/SEC 
1 33872 33844.52 147291343.43 
2 66580 33269.52 144790497.45 
3 99235 33078.06 143956816.36 
4 130755 32686.56 142252741.07 
5 162499 32498.23 141432978.13 
6 193987 32329.15 140696998.44 
7 226440 32343.08 140757971.93 
elapsed: 7 ops: 246723 ops/sec: 32064.33 bytes/sec: 139544931.69 
--- 

Doing the same with rados bench gives us the expected ~1300 IOPS for this 
cluster that I can see from inside a VM as well: 
--- 
# rados -p rbd bench 8 write -t 16 -b 4096 
Maintaining 16 concurrent writes of 4096 bytes for up to 8 seconds or 0 objects 
Object prefix: benchmark_data_comp-01_6105 
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
0 0 0 0 0 0 - 0 
1 16 1926 1910 7.45896 7.46094 0.0098110.00831868 
2 15 3944 3929 7.67216 7.88672 0.0053750.00811645 
3 16 5346 5330 6.93867 5.47266 0.0062460.00870079 
4 16 5800 5784 5.64732 1.77344 0.005610.00894914 
5 16 6649 6633 5.18109 3.31641 0.006165 0.0120499 
6 16 7382 7366 4.79472 2.86328 0.006601 0.0130159 
7 16 7980 7964 4.44342 2.33594 0.007786 0.0140532 
8 16 9308 9292 4.53638 5.1875 0.00839 0.0137575 
Total time run: 8.007909 
Total writes made: 9308 
Write size: 4096 
Bandwidth (MB/sec): 4.540 

Stddev Bandwidth: 2.64905 
--- 

Christian 

> BTW, I never see big write ops/s with ceph without big big cluster and 
> big cpus 
> 
> 
> 
> about dd benchmark, the problem is that dd use 1job / iodepth=1 / 
> sequential. So here, network latencies make the difference. (but ceph 
> team is also working to optimize that, with async messenger for example) 
> That's why you'll have more iops with fio, with more jobs/ bigger 
> iodepth. 
> 
> 
> 
> If you use full ssd setup, you should use at least Giant, because of 
> sharding feature. With firefly, osd daemons doesn't scale well on 
> multiple cores. 
> 
> Also from my tests, writes use a lot more cpu than read. (can be cpu 
> bound on 3 nodes 8cores xeon-E5 1,7ghz, replication x3, with 10000 4k 
> randwrite) 
> 
> 
> 
> also disabling cephx auth and debug help to get more iops. 
> 
> 
> if your workload is mainly sequential, enabling rbd_cache will help for 
> writes, merging colleasced blocks request, so less ops (but bigger ops), 
> so less cpu. 
> 
> 
> Alexandre 
> 
> 
> ----- Mail original ----- 
> De: "Rottmann Jonas" <j.rottmann@xxxxxxxxxx> 
> À: "ceph-users" <ceph-users@xxxxxxxx> 
> Envoyé: Vendredi 20 Mars 2015 15:13:19 
> Objet:  Write IO Problem 
> 
> 
> 
> Hi, 
> 
> 
> 
> We have a huge write IO Problem in our preproductive Ceph Cluster. First 
> our Hardware: 
> 
> 
> 
> 4 OSD Nodes with: 
> 
> 
> 
> Supermicro X10 Board 
> 
> 32GB DDR4 RAM 
> 
> 2x Intel Xeon E5-2620 
> 
> LSI SAS 9300-8i Host Bus Adapter 
> 
> Intel Corporation 82599EB 10-Gigabit 
> 
> 2x Intel SSDSA2CT040G3 in software raid 1 for system 
> 
> 
> 
> Disks: 
> 
> 2x Samsung EVO 840 1TB 
> 
> 
> 
> So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only 
> added nodiratime) 
> 
> 
> 
> Benchmarking one disk alone gives good values: 
> 
> 
> 
> dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 
> 
> 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s 
> 
> 
> 
> Fio 8k libaio depth=32: 
> 
> write: io=488184KB, bw=52782KB/s, iops=5068 , runt= 9249msec 
> 
> 
> 
> Here our ceph.conf (pretty much standard): 
> 
> 
> 
> [global] 
> 
> fsid = 89191a54-740a-46c7-a325-0899ab32fd1d 
> 
> mon initial members = cephasp41,ceph-monitor41 
> 
> mon host = 172.30.10.15,172.30.10.19 
> 
> public network = 172.30.10.0/24 
> 
> cluster network = 172.30.10.0/24 
> 
> auth cluster required = cephx 
> 
> auth service required = cephx 
> 
> auth client required = cephx 
> 
> 
> 
> #Default is 1GB, which is fine for us 
> 
> #osd journal size = {n} 
> 
> 
> 
> #Only needed if ext4 comes to play 
> 
> #filestore xattr use omap = true 
> 
> 
> 
> osd pool default size = 3 # Write an object n times. 
> 
> osd pool default min size = 2 # Allow writing n copy in a degraded 
> state. 
> 
> 
> 
> #Set individual per pool by a formula 
> 
> #osd pool default pg num = {n} 
> 
> #osd pool default pgp num = {n} 
> 
> #osd crush chooseleaf type = {n} 
> 
> 
> 
> 
> 
> When I benchmark the cluster with “rbd bench-write rbd/fio” I get pretty 
> good results: 
> 
> elapsed: 18 ops: 262144 ops/sec: 14466.30 bytes/sec: 59253946.11 
> 
> 
> 
> If I for example bench i.e. with fio with rbd engine, I get very poor 
> results: 
> 
> 
> 
> [global] 
> 
> ioengine=rbd 
> 
> clientname=admin 
> 
> pool=rbd 
> 
> rbdname=fio 
> 
> invalidate=0 # mandatory 
> 
> rw=randwrite 
> 
> bs=512k 
> 
> 
> 
> [rbd_iodepth32] 
> 
> iodepth=32 
> 
> 
> 
> RESULTS: 
> 
> ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec 
> 
> 
> 
> Also if I mount the rbd with kernel as rbd0, format it with ext4 and 
> then do a dd on it, its not that good: 
> 
> “dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc” 
> 
> RESULT: 
> 
> 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s 
> 
> 
> 
> I also tried presenting an rbd image with tgtd, mount it onto VMWare 
> ESXi and test it in a vm, there I got only round about 50 iops with 4k, 
> and writing sequentiell 25Mbytes. 
> 
> With NFS the read sequential values are good (400Mbyte/s) but writing 
> only 25Mbyte/s. 
> 
> 
> 
> What I tried tweaking so far: 
> 
> 
> 
> Intel NIC optimazitions: 
> 
> /etc/sysctl.conf 
> 
> 
> 
> # Increase system file descriptor limit 
> 
> fs.file-max = 65535 
> 
> 
> 
> # Increase system IP port range to allow for more concurrent connections 
> 
> net.ipv4.ip_local_port_range = 1024 65000 
> 
> 
> 
> # -- 10gbe tuning from Intel ixgb driver README -- # 
> 
> 
> 
> # turn off selective ACK and timestamps 
> 
> net.ipv4.tcp_sack = 0 
> 
> net.ipv4.tcp_timestamps = 0 
> 
> 
> 
> # memory allocation min/pressure/max. 
> 
> # read buffer, write buffer, and buffer space 
> 
> net.ipv4.tcp_rmem = 10000000 10000000 10000000 
> 
> net.ipv4.tcp_wmem = 10000000 10000000 10000000 
> 
> net.ipv4.tcp_mem = 10000000 10000000 10000000 
> 
> 
> 
> net.core.rmem_max = 524287 
> 
> net.core.wmem_max = 524287 
> 
> net.core.rmem_default = 524287 
> 
> net.core.wmem_default = 524287 
> 
> net.core.optmem_max = 524287 
> 
> net.core.netdev_max_backlog = 300000 
> 
> 
> 
> AND 
> 
> 
> 
> setpci -v -d 8086:10fb e6.b=2e 
> 
> 
> 
> 
> 
> Setting tunables to firefly: 
> 
> ceph osd crush tunables firefly 
> 
> 
> 
> Setting scheduler to noop: 
> 
> This basically stopped IO on the cluster, and I had to revert it and 
> restart some of the osds with requests stuck 
> 
> 
> 
> And I tried moving the monitor from an VM to the Hardware where the OSDs 
> run. 
> 
> 
> 
> 
> 
> Any suggestions where to look, or what could cause that problem? 
> 
> (because I can’t believe your loosing that much performance through ceph 
> replication) 
> 
> 
> 
> Thanks in advance. 
> 
> 
> 
> If you need any info please tell me. 
> 
> 
> 
> 
> Mit freundlichen Grüßen/Kind regards 
> 
> 
> Jonas Rottmann 
> Systems Engineer 
> 
> FIS-ASP Application Service Providing und 
> IT-Outsourcing GmbH 
> Röthleiner Weg 4 
> D-97506 Grafenrheinfeld 
> Phone: +49 (9723) 9188-568 
> Fax: +49 (9723) 9188-600 
> 
> email: j.rottmann@xxxxxxxxxx web: www.fis-asp.de 
> 
> Geschäftsführer Robert Schuhmann 
> Registergericht Schweinfurt HRB 3865 
> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@xxxxxxxxxxxxxx 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@xxxxxxxxxxxxxx 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

-- 
Christian Balzer Network/Systems Engineer 
chibi@xxxxxxx Global OnLine Japan/Fusion Communications 
http://www.gol.com/ 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com