Re: Write IO Problem

Christian Balzer <chibi@xxxxxxx> · Tue, 24 Mar 2015 16:48:58 +0900

On Tue, 24 Mar 2015 08:36:40 +0100 (CET) Alexandre DERUMIER wrote:

> >>No, 262144 ops total in 18 seconds. 
> >>
> Oh ok ;)
> 
> >>"rbd bench-write" is clearly doing something VERY differently from
> >>"rados bench" (and given its output was also written by somebody
> >>else), maybe some Ceph dev can enlighten us? 
> 
> Maybe rbd_cache is merging 4k block to 4M rados objects ?
> does rbd_cache=false change the results ?
> 
Indeed that was it. On a node w/o rbd_cache enabled it sloooooows down to
a crawl.

> How many iops do you see with "#ceph -w" ?
>
Very few with the cache enabled, but exactly the bandwidth indicated by
the test.

Another mystery solved. ^.^

And another data point for the OP on how to compare these results.

Christian

>  
> 
> 
> ----- Mail original -----
> De: "Christian Balzer" <chibi@xxxxxxx>
> À: "ceph-users" <ceph-users@xxxxxxxx>
> Cc: "aderumier" <aderumier@xxxxxxxxx>
> Envoyé: Mardi 24 Mars 2015 08:24:23
> Objet: Re:  Write IO Problem
> 
> On Tue, 24 Mar 2015 07:56:33 +0100 (CET) Alexandre DERUMIER wrote: 
> 
> > Hi, 
> > 
> > >>dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 
> > >> 
> > >>1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s 
> > 
> > How much do you get with o_dsync? (ceph journal use o_dsync, and some 
> > ssd are pretty slow with dsync) 
> > 
> > http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ 
> > $ sudo dd if=/dev/urandom of=randfile bs=1M count=1024 && sync 
> > $ sudo dd if=randfile of=/dev/sda bs=4k count=100000
> > oflag=direct,dsync 
> > 
> > 
> > 
> > >>When I benchmark the cluster with “rbd bench-write rbd/fio” I get 
> > >>pretty good results: elapsed: 18 ops: 262144 ops/sec: 14466.30 
> > >>bytes/sec: 59253946.11 
> > 
> > theses results seem strange. 
> > 14466.30 bytes/sec for 262144 ops/sec ? (0,05 bytes by ops ????) 
> > 
> No, 262144 ops total in 18 seconds. 
> 
> "rbd bench-write" is clearly doing something VERY differently from
> "rados bench" (and given its output was also written by somebody else),
> maybe some Ceph dev can enlighten us? 
> 
> On my production cluster: 
> --- 
> # rbd bench-write rbd/fio 
> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq 
> SEC OPS OPS/SEC BYTES/SEC 
> 1 33872 33844.52 147291343.43 
> 2 66580 33269.52 144790497.45 
> 3 99235 33078.06 143956816.36 
> 4 130755 32686.56 142252741.07 
> 5 162499 32498.23 141432978.13 
> 6 193987 32329.15 140696998.44 
> 7 226440 32343.08 140757971.93 
> elapsed: 7 ops: 246723 ops/sec: 32064.33 bytes/sec: 139544931.69 
> --- 
> 
> Doing the same with rados bench gives us the expected ~1300 IOPS for
> this cluster that I can see from inside a VM as well: 
> --- 
> # rados -p rbd bench 8 write -t 16 -b 4096 
> Maintaining 16 concurrent writes of 4096 bytes for up to 8 seconds or 0
> objects Object prefix: benchmark_data_comp-01_6105 
> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
> 0 0 0 0 0 0 - 0 
> 1 16 1926 1910 7.45896 7.46094 0.0098110.00831868 
> 2 15 3944 3929 7.67216 7.88672 0.0053750.00811645 
> 3 16 5346 5330 6.93867 5.47266 0.0062460.00870079 
> 4 16 5800 5784 5.64732 1.77344 0.005610.00894914 
> 5 16 6649 6633 5.18109 3.31641 0.006165 0.0120499 
> 6 16 7382 7366 4.79472 2.86328 0.006601 0.0130159 
> 7 16 7980 7964 4.44342 2.33594 0.007786 0.0140532 
> 8 16 9308 9292 4.53638 5.1875 0.00839 0.0137575 
> Total time run: 8.007909 
> Total writes made: 9308 
> Write size: 4096 
> Bandwidth (MB/sec): 4.540 
> 
> Stddev Bandwidth: 2.64905 
> --- 
> 
> Christian 
> 
> > BTW, I never see big write ops/s with ceph without big big cluster and 
> > big cpus 
> > 
> > 
> > 
> > about dd benchmark, the problem is that dd use 1job / iodepth=1 / 
> > sequential. So here, network latencies make the difference. (but ceph 
> > team is also working to optimize that, with async messenger for
> > example) That's why you'll have more iops with fio, with more jobs/
> > bigger iodepth. 
> > 
> > 
> > 
> > If you use full ssd setup, you should use at least Giant, because of 
> > sharding feature. With firefly, osd daemons doesn't scale well on 
> > multiple cores. 
> > 
> > Also from my tests, writes use a lot more cpu than read. (can be cpu 
> > bound on 3 nodes 8cores xeon-E5 1,7ghz, replication x3, with 10000 4k 
> > randwrite) 
> > 
> > 
> > 
> > also disabling cephx auth and debug help to get more iops. 
> > 
> > 
> > if your workload is mainly sequential, enabling rbd_cache will help
> > for writes, merging colleasced blocks request, so less ops (but bigger
> > ops), so less cpu. 
> > 
> > 
> > Alexandre 
> > 
> > 
> > ----- Mail original ----- 
> > De: "Rottmann Jonas" <j.rottmann@xxxxxxxxxx> 
> > À: "ceph-users" <ceph-users@xxxxxxxx> 
> > Envoyé: Vendredi 20 Mars 2015 15:13:19 
> > Objet:  Write IO Problem 
> > 
> > 
> > 
> > Hi, 
> > 
> > 
> > 
> > We have a huge write IO Problem in our preproductive Ceph Cluster.
> > First our Hardware: 
> > 
> > 
> > 
> > 4 OSD Nodes with: 
> > 
> > 
> > 
> > Supermicro X10 Board 
> > 
> > 32GB DDR4 RAM 
> > 
> > 2x Intel Xeon E5-2620 
> > 
> > LSI SAS 9300-8i Host Bus Adapter 
> > 
> > Intel Corporation 82599EB 10-Gigabit 
> > 
> > 2x Intel SSDSA2CT040G3 in software raid 1 for system 
> > 
> > 
> > 
> > Disks: 
> > 
> > 2x Samsung EVO 840 1TB 
> > 
> > 
> > 
> > So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only 
> > added nodiratime) 
> > 
> > 
> > 
> > Benchmarking one disk alone gives good values: 
> > 
> > 
> > 
> > dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 
> > 
> > 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s 
> > 
> > 
> > 
> > Fio 8k libaio depth=32: 
> > 
> > write: io=488184KB, bw=52782KB/s, iops=5068 , runt= 9249msec 
> > 
> > 
> > 
> > Here our ceph.conf (pretty much standard): 
> > 
> > 
> > 
> > [global] 
> > 
> > fsid = 89191a54-740a-46c7-a325-0899ab32fd1d 
> > 
> > mon initial members = cephasp41,ceph-monitor41 
> > 
> > mon host = 172.30.10.15,172.30.10.19 
> > 
> > public network = 172.30.10.0/24 
> > 
> > cluster network = 172.30.10.0/24 
> > 
> > auth cluster required = cephx 
> > 
> > auth service required = cephx 
> > 
> > auth client required = cephx 
> > 
> > 
> > 
> > #Default is 1GB, which is fine for us 
> > 
> > #osd journal size = {n} 
> > 
> > 
> > 
> > #Only needed if ext4 comes to play 
> > 
> > #filestore xattr use omap = true 
> > 
> > 
> > 
> > osd pool default size = 3 # Write an object n times. 
> > 
> > osd pool default min size = 2 # Allow writing n copy in a degraded 
> > state. 
> > 
> > 
> > 
> > #Set individual per pool by a formula 
> > 
> > #osd pool default pg num = {n} 
> > 
> > #osd pool default pgp num = {n} 
> > 
> > #osd crush chooseleaf type = {n} 
> > 
> > 
> > 
> > 
> > 
> > When I benchmark the cluster with “rbd bench-write rbd/fio” I get
> > pretty good results: 
> > 
> > elapsed: 18 ops: 262144 ops/sec: 14466.30 bytes/sec: 59253946.11 
> > 
> > 
> > 
> > If I for example bench i.e. with fio with rbd engine, I get very poor 
> > results: 
> > 
> > 
> > 
> > [global] 
> > 
> > ioengine=rbd 
> > 
> > clientname=admin 
> > 
> > pool=rbd 
> > 
> > rbdname=fio 
> > 
> > invalidate=0 # mandatory 
> > 
> > rw=randwrite 
> > 
> > bs=512k 
> > 
> > 
> > 
> > [rbd_iodepth32] 
> > 
> > iodepth=32 
> > 
> > 
> > 
> > RESULTS: 
> > 
> > ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec 
> > 
> > 
> > 
> > Also if I mount the rbd with kernel as rbd0, format it with ext4 and 
> > then do a dd on it, its not that good: 
> > 
> > “dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc” 
> > 
> > RESULT: 
> > 
> > 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s 
> > 
> > 
> > 
> > I also tried presenting an rbd image with tgtd, mount it onto VMWare 
> > ESXi and test it in a vm, there I got only round about 50 iops with
> > 4k, and writing sequentiell 25Mbytes. 
> > 
> > With NFS the read sequential values are good (400Mbyte/s) but writing 
> > only 25Mbyte/s. 
> > 
> > 
> > 
> > What I tried tweaking so far: 
> > 
> > 
> > 
> > Intel NIC optimazitions: 
> > 
> > /etc/sysctl.conf 
> > 
> > 
> > 
> > # Increase system file descriptor limit 
> > 
> > fs.file-max = 65535 
> > 
> > 
> > 
> > # Increase system IP port range to allow for more concurrent
> > connections 
> > 
> > net.ipv4.ip_local_port_range = 1024 65000 
> > 
> > 
> > 
> > # -- 10gbe tuning from Intel ixgb driver README -- # 
> > 
> > 
> > 
> > # turn off selective ACK and timestamps 
> > 
> > net.ipv4.tcp_sack = 0 
> > 
> > net.ipv4.tcp_timestamps = 0 
> > 
> > 
> > 
> > # memory allocation min/pressure/max. 
> > 
> > # read buffer, write buffer, and buffer space 
> > 
> > net.ipv4.tcp_rmem = 10000000 10000000 10000000 
> > 
> > net.ipv4.tcp_wmem = 10000000 10000000 10000000 
> > 
> > net.ipv4.tcp_mem = 10000000 10000000 10000000 
> > 
> > 
> > 
> > net.core.rmem_max = 524287 
> > 
> > net.core.wmem_max = 524287 
> > 
> > net.core.rmem_default = 524287 
> > 
> > net.core.wmem_default = 524287 
> > 
> > net.core.optmem_max = 524287 
> > 
> > net.core.netdev_max_backlog = 300000 
> > 
> > 
> > 
> > AND 
> > 
> > 
> > 
> > setpci -v -d 8086:10fb e6.b=2e 
> > 
> > 
> > 
> > 
> > 
> > Setting tunables to firefly: 
> > 
> > ceph osd crush tunables firefly 
> > 
> > 
> > 
> > Setting scheduler to noop: 
> > 
> > This basically stopped IO on the cluster, and I had to revert it and 
> > restart some of the osds with requests stuck 
> > 
> > 
> > 
> > And I tried moving the monitor from an VM to the Hardware where the
> > OSDs run. 
> > 
> > 
> > 
> > 
> > 
> > Any suggestions where to look, or what could cause that problem? 
> > 
> > (because I can’t believe your loosing that much performance through
> > ceph replication) 
> > 
> > 
> > 
> > Thanks in advance. 
> > 
> > 
> > 
> > If you need any info please tell me. 
> > 
> > 
> > 
> > 
> > Mit freundlichen Grüßen/Kind regards 
> > 
> > 
> > Jonas Rottmann 
> > Systems Engineer 
> > 
> > FIS-ASP Application Service Providing und 
> > IT-Outsourcing GmbH 
> > Röthleiner Weg 4 
> > D-97506 Grafenrheinfeld 
> > Phone: +49 (9723) 9188-568 
> > Fax: +49 (9723) 9188-600 
> > 
> > email: j.rottmann@xxxxxxxxxx web: www.fis-asp.de 
> > 
> > Geschäftsführer Robert Schuhmann 
> > Registergericht Schweinfurt HRB 3865 
> > 
> > _______________________________________________ 
> > ceph-users mailing list 
> > ceph-users@xxxxxxxxxxxxxx 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > _______________________________________________ 
> > ceph-users mailing list 
> > ceph-users@xxxxxxxxxxxxxx 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com