Re: Write IO Problem

Christian Balzer <chibi@xxxxxxx> · Tue, 24 Mar 2015 16:24:23 +0900

On Tue, 24 Mar 2015 07:56:33 +0100 (CET) Alexandre DERUMIER wrote:

> Hi,
> 
> >>dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 
> >>
> >>1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s 
> 
> How much do you get with o_dsync? (ceph journal use o_dsync, and some
> ssd are pretty slow with dsync)
> 
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> $ sudo dd if=/dev/urandom of=randfile bs=1M count=1024 && sync
> $ sudo dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync
> 
> 
> 
> >>When I benchmark the cluster with “rbd bench-write rbd/fio” I get
> >>pretty good results: elapsed: 18 ops: 262144 ops/sec: 14466.30
> >>bytes/sec: 59253946.11 
> 
> theses results seem strange.
> 14466.30 bytes/sec for 262144 ops/sec ? (0,05 bytes by ops ????)
>
No, 262144 ops total in 18 seconds.

"rbd bench-write" is clearly doing something VERY differently from "rados
bench" (and given its output was also written by somebody else), maybe some
Ceph dev can enlighten us?

On my production cluster:
---
# rbd bench-write rbd/fio
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
  SEC       OPS   OPS/SEC   BYTES/SEC
    1     33872  33844.52  147291343.43
    2     66580  33269.52  144790497.45
    3     99235  33078.06  143956816.36
    4    130755  32686.56  142252741.07
    5    162499  32498.23  141432978.13
    6    193987  32329.15  140696998.44
    7    226440  32343.08  140757971.93
elapsed:     7  ops:   246723  ops/sec: 32064.33  bytes/sec: 139544931.69
---

Doing the same with rados bench gives us the expected ~1300 IOPS for this
cluster that I can see from inside a VM as well:
---
# rados -p rbd bench 8 write -t 16 -b 4096
 Maintaining 16 concurrent writes of 4096 bytes for up to 8 seconds or 0 objects
 Object prefix: benchmark_data_comp-01_6105
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16      1926      1910   7.45896   7.46094  0.0098110.00831868
     2      15      3944      3929   7.67216   7.88672  0.0053750.00811645
     3      16      5346      5330   6.93867   5.47266  0.0062460.00870079
     4      16      5800      5784   5.64732   1.77344   0.005610.00894914
     5      16      6649      6633   5.18109   3.31641  0.006165 0.0120499
     6      16      7382      7366   4.79472   2.86328  0.006601 0.0130159
     7      16      7980      7964   4.44342   2.33594  0.007786 0.0140532
     8      16      9308      9292   4.53638    5.1875   0.00839 0.0137575
 Total time run:         8.007909
Total writes made:      9308
Write size:             4096
Bandwidth (MB/sec):     4.540 

Stddev Bandwidth:       2.64905
---

Christian

> BTW, I never see big write ops/s with ceph without big big cluster and
> big cpus
> 
> 
> 
> about dd benchmark, the problem is that dd use 1job / iodepth=1 /
> sequential. So here, network latencies make the difference. (but ceph
> team is also working to optimize that, with async messenger for example)
> That's why you'll have more iops with fio, with more jobs/ bigger
> iodepth.
> 
> 
> 
> If you use full ssd setup, you should use at least Giant, because of
> sharding feature. With firefly, osd daemons doesn't scale well on
> multiple cores.
> 
> Also from my tests, writes use a lot more cpu than read. (can be cpu
> bound on 3 nodes 8cores xeon-E5 1,7ghz, replication x3, with 10000 4k
> randwrite)
> 
> 
> 
> also disabling cephx auth and debug help to get more iops.
> 
> 
> if your workload is mainly sequential, enabling rbd_cache will help for
> writes, merging colleasced blocks request, so less ops (but bigger ops),
> so less cpu.
> 
> 
> Alexandre
> 
> 
> ----- Mail original -----
> De: "Rottmann Jonas" <j.rottmann@xxxxxxxxxx>
> À: "ceph-users" <ceph-users@xxxxxxxx>
> Envoyé: Vendredi 20 Mars 2015 15:13:19
> Objet:  Write IO Problem
> 
> 
> 
> Hi, 
> 
> 
> 
> We have a huge write IO Problem in our preproductive Ceph Cluster. First
> our Hardware: 
> 
> 
> 
> 4 OSD Nodes with: 
> 
> 
> 
> Supermicro X10 Board 
> 
> 32GB DDR4 RAM 
> 
> 2x Intel Xeon E5-2620 
> 
> LSI SAS 9300-8i Host Bus Adapter 
> 
> Intel Corporation 82599EB 10-Gigabit 
> 
> 2x Intel SSDSA2CT040G3 in software raid 1 for system 
> 
> 
> 
> Disks: 
> 
> 2x Samsung EVO 840 1TB 
> 
> 
> 
> So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only
> added nodiratime) 
> 
> 
> 
> Benchmarking one disk alone gives good values: 
> 
> 
> 
> dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 
> 
> 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s 
> 
> 
> 
> Fio 8k libaio depth=32: 
> 
> write: io=488184KB, bw=52782KB/s, iops=5068 , runt= 9249msec 
> 
> 
> 
> Here our ceph.conf (pretty much standard): 
> 
> 
> 
> [global] 
> 
> fsid = 89191a54-740a-46c7-a325-0899ab32fd1d 
> 
> mon initial members = cephasp41,ceph-monitor41 
> 
> mon host = 172.30.10.15,172.30.10.19 
> 
> public network = 172.30.10.0/24 
> 
> cluster network = 172.30.10.0/24 
> 
> auth cluster required = cephx 
> 
> auth service required = cephx 
> 
> auth client required = cephx 
> 
> 
> 
> #Default is 1GB, which is fine for us 
> 
> #osd journal size = {n} 
> 
> 
> 
> #Only needed if ext4 comes to play 
> 
> #filestore xattr use omap = true 
> 
> 
> 
> osd pool default size = 3 # Write an object n times. 
> 
> osd pool default min size = 2 # Allow writing n copy in a degraded
> state. 
> 
> 
> 
> #Set individual per pool by a formula 
> 
> #osd pool default pg num = {n} 
> 
> #osd pool default pgp num = {n} 
> 
> #osd crush chooseleaf type = {n} 
> 
> 
> 
> 
> 
> When I benchmark the cluster with “rbd bench-write rbd/fio” I get pretty
> good results: 
> 
> elapsed: 18 ops: 262144 ops/sec: 14466.30 bytes/sec: 59253946.11 
> 
> 
> 
> If I for example bench i.e. with fio with rbd engine, I get very poor
> results: 
> 
> 
> 
> [global] 
> 
> ioengine=rbd 
> 
> clientname=admin 
> 
> pool=rbd 
> 
> rbdname=fio 
> 
> invalidate=0 # mandatory 
> 
> rw=randwrite 
> 
> bs=512k 
> 
> 
> 
> [rbd_iodepth32] 
> 
> iodepth=32 
> 
> 
> 
> RESULTS: 
> 
> ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec 
> 
> 
> 
> Also if I mount the rbd with kernel as rbd0, format it with ext4 and
> then do a dd on it, its not that good: 
> 
> “dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc” 
> 
> RESULT: 
> 
> 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s 
> 
> 
> 
> I also tried presenting an rbd image with tgtd, mount it onto VMWare
> ESXi and test it in a vm, there I got only round about 50 iops with 4k,
> and writing sequentiell 25Mbytes. 
> 
> With NFS the read sequential values are good (400Mbyte/s) but writing
> only 25Mbyte/s. 
> 
> 
> 
> What I tried tweaking so far: 
> 
> 
> 
> Intel NIC optimazitions: 
> 
> /etc/sysctl.conf 
> 
> 
> 
> # Increase system file descriptor limit 
> 
> fs.file-max = 65535 
> 
> 
> 
> # Increase system IP port range to allow for more concurrent connections 
> 
> net.ipv4.ip_local_port_range = 1024 65000 
> 
> 
> 
> # -- 10gbe tuning from Intel ixgb driver README -- # 
> 
> 
> 
> # turn off selective ACK and timestamps 
> 
> net.ipv4.tcp_sack = 0 
> 
> net.ipv4.tcp_timestamps = 0 
> 
> 
> 
> # memory allocation min/pressure/max. 
> 
> # read buffer, write buffer, and buffer space 
> 
> net.ipv4.tcp_rmem = 10000000 10000000 10000000 
> 
> net.ipv4.tcp_wmem = 10000000 10000000 10000000 
> 
> net.ipv4.tcp_mem = 10000000 10000000 10000000 
> 
> 
> 
> net.core.rmem_max = 524287 
> 
> net.core.wmem_max = 524287 
> 
> net.core.rmem_default = 524287 
> 
> net.core.wmem_default = 524287 
> 
> net.core.optmem_max = 524287 
> 
> net.core.netdev_max_backlog = 300000 
> 
> 
> 
> AND 
> 
> 
> 
> setpci -v -d 8086:10fb e6.b=2e 
> 
> 
> 
> 
> 
> Setting tunables to firefly: 
> 
> ceph osd crush tunables firefly 
> 
> 
> 
> Setting scheduler to noop: 
> 
> This basically stopped IO on the cluster, and I had to revert it and
> restart some of the osds with requests stuck 
> 
> 
> 
> And I tried moving the monitor from an VM to the Hardware where the OSDs
> run. 
> 
> 
> 
> 
> 
> Any suggestions where to look, or what could cause that problem? 
> 
> (because I can’t believe your loosing that much performance through ceph
> replication) 
> 
> 
> 
> Thanks in advance. 
> 
> 
> 
> If you need any info please tell me. 
> 
> 
> 
> 
> Mit freundlichen Grüßen/Kind regards 
> 
> 
> Jonas Rottmann 
> Systems Engineer 
> 
> FIS-ASP Application Service Providing und 
> IT-Outsourcing GmbH 
> Röthleiner Weg 4 
> D-97506 Grafenrheinfeld 
> Phone: +49 (9723) 9188-568 
> Fax: +49 (9723) 9188-600 
> 
> email: j.rottmann@xxxxxxxxxx web: www.fis-asp.de 
> 
> Geschäftsführer Robert Schuhmann 
> Registergericht Schweinfurt HRB 3865 
> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@xxxxxxxxxxxxxx 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com