>>No, 262144 ops total in 18 seconds. >> Oh ok ;) >>"rbd bench-write" is clearly doing something VERY differently from "rados >>bench" (and given its output was also written by somebody else), maybe some >>Ceph dev can enlighten us? Maybe rbd_cache is merging 4k block to 4M rados objects ? does rbd_cache=false change the results ? How many iops do you see with "#ceph -w" ? ----- Mail original ----- De: "Christian Balzer" <chibi@xxxxxxx> À: "ceph-users" <ceph-users@xxxxxxxx> Cc: "aderumier" <aderumier@xxxxxxxxx> Envoyé: Mardi 24 Mars 2015 08:24:23 Objet: Re: Write IO Problem On Tue, 24 Mar 2015 07:56:33 +0100 (CET) Alexandre DERUMIER wrote: > Hi, > > >>dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc > >> > >>1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s > > How much do you get with o_dsync? (ceph journal use o_dsync, and some > ssd are pretty slow with dsync) > > http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ > $ sudo dd if=/dev/urandom of=randfile bs=1M count=1024 && sync > $ sudo dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync > > > > >>When I benchmark the cluster with “rbd bench-write rbd/fio” I get > >>pretty good results: elapsed: 18 ops: 262144 ops/sec: 14466.30 > >>bytes/sec: 59253946.11 > > theses results seem strange. > 14466.30 bytes/sec for 262144 ops/sec ? (0,05 bytes by ops ????) > No, 262144 ops total in 18 seconds. "rbd bench-write" is clearly doing something VERY differently from "rados bench" (and given its output was also written by somebody else), maybe some Ceph dev can enlighten us? On my production cluster: --- # rbd bench-write rbd/fio bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 33872 33844.52 147291343.43 2 66580 33269.52 144790497.45 3 99235 33078.06 143956816.36 4 130755 32686.56 142252741.07 5 162499 32498.23 141432978.13 6 193987 32329.15 140696998.44 7 226440 32343.08 140757971.93 elapsed: 7 ops: 246723 ops/sec: 32064.33 bytes/sec: 139544931.69 --- Doing the same with rados bench gives us the expected ~1300 IOPS for this cluster that I can see from inside a VM as well: --- # rados -p rbd bench 8 write -t 16 -b 4096 Maintaining 16 concurrent writes of 4096 bytes for up to 8 seconds or 0 objects Object prefix: benchmark_data_comp-01_6105 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 1926 1910 7.45896 7.46094 0.0098110.00831868 2 15 3944 3929 7.67216 7.88672 0.0053750.00811645 3 16 5346 5330 6.93867 5.47266 0.0062460.00870079 4 16 5800 5784 5.64732 1.77344 0.005610.00894914 5 16 6649 6633 5.18109 3.31641 0.006165 0.0120499 6 16 7382 7366 4.79472 2.86328 0.006601 0.0130159 7 16 7980 7964 4.44342 2.33594 0.007786 0.0140532 8 16 9308 9292 4.53638 5.1875 0.00839 0.0137575 Total time run: 8.007909 Total writes made: 9308 Write size: 4096 Bandwidth (MB/sec): 4.540 Stddev Bandwidth: 2.64905 --- Christian > BTW, I never see big write ops/s with ceph without big big cluster and > big cpus > > > > about dd benchmark, the problem is that dd use 1job / iodepth=1 / > sequential. So here, network latencies make the difference. (but ceph > team is also working to optimize that, with async messenger for example) > That's why you'll have more iops with fio, with more jobs/ bigger > iodepth. > > > > If you use full ssd setup, you should use at least Giant, because of > sharding feature. With firefly, osd daemons doesn't scale well on > multiple cores. > > Also from my tests, writes use a lot more cpu than read. (can be cpu > bound on 3 nodes 8cores xeon-E5 1,7ghz, replication x3, with 10000 4k > randwrite) > > > > also disabling cephx auth and debug help to get more iops. > > > if your workload is mainly sequential, enabling rbd_cache will help for > writes, merging colleasced blocks request, so less ops (but bigger ops), > so less cpu. > > > Alexandre > > > ----- Mail original ----- > De: "Rottmann Jonas" <j.rottmann@xxxxxxxxxx> > À: "ceph-users" <ceph-users@xxxxxxxx> > Envoyé: Vendredi 20 Mars 2015 15:13:19 > Objet: Write IO Problem > > > > Hi, > > > > We have a huge write IO Problem in our preproductive Ceph Cluster. First > our Hardware: > > > > 4 OSD Nodes with: > > > > Supermicro X10 Board > > 32GB DDR4 RAM > > 2x Intel Xeon E5-2620 > > LSI SAS 9300-8i Host Bus Adapter > > Intel Corporation 82599EB 10-Gigabit > > 2x Intel SSDSA2CT040G3 in software raid 1 for system > > > > Disks: > > 2x Samsung EVO 840 1TB > > > > So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only > added nodiratime) > > > > Benchmarking one disk alone gives good values: > > > > dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc > > 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s > > > > Fio 8k libaio depth=32: > > write: io=488184KB, bw=52782KB/s, iops=5068 , runt= 9249msec > > > > Here our ceph.conf (pretty much standard): > > > > [global] > > fsid = 89191a54-740a-46c7-a325-0899ab32fd1d > > mon initial members = cephasp41,ceph-monitor41 > > mon host = 172.30.10.15,172.30.10.19 > > public network = 172.30.10.0/24 > > cluster network = 172.30.10.0/24 > > auth cluster required = cephx > > auth service required = cephx > > auth client required = cephx > > > > #Default is 1GB, which is fine for us > > #osd journal size = {n} > > > > #Only needed if ext4 comes to play > > #filestore xattr use omap = true > > > > osd pool default size = 3 # Write an object n times. > > osd pool default min size = 2 # Allow writing n copy in a degraded > state. > > > > #Set individual per pool by a formula > > #osd pool default pg num = {n} > > #osd pool default pgp num = {n} > > #osd crush chooseleaf type = {n} > > > > > > When I benchmark the cluster with “rbd bench-write rbd/fio” I get pretty > good results: > > elapsed: 18 ops: 262144 ops/sec: 14466.30 bytes/sec: 59253946.11 > > > > If I for example bench i.e. with fio with rbd engine, I get very poor > results: > > > > [global] > > ioengine=rbd > > clientname=admin > > pool=rbd > > rbdname=fio > > invalidate=0 # mandatory > > rw=randwrite > > bs=512k > > > > [rbd_iodepth32] > > iodepth=32 > > > > RESULTS: > > ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec > > > > Also if I mount the rbd with kernel as rbd0, format it with ext4 and > then do a dd on it, its not that good: > > “dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc” > > RESULT: > > 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s > > > > I also tried presenting an rbd image with tgtd, mount it onto VMWare > ESXi and test it in a vm, there I got only round about 50 iops with 4k, > and writing sequentiell 25Mbytes. > > With NFS the read sequential values are good (400Mbyte/s) but writing > only 25Mbyte/s. > > > > What I tried tweaking so far: > > > > Intel NIC optimazitions: > > /etc/sysctl.conf > > > > # Increase system file descriptor limit > > fs.file-max = 65535 > > > > # Increase system IP port range to allow for more concurrent connections > > net.ipv4.ip_local_port_range = 1024 65000 > > > > # -- 10gbe tuning from Intel ixgb driver README -- # > > > > # turn off selective ACK and timestamps > > net.ipv4.tcp_sack = 0 > > net.ipv4.tcp_timestamps = 0 > > > > # memory allocation min/pressure/max. > > # read buffer, write buffer, and buffer space > > net.ipv4.tcp_rmem = 10000000 10000000 10000000 > > net.ipv4.tcp_wmem = 10000000 10000000 10000000 > > net.ipv4.tcp_mem = 10000000 10000000 10000000 > > > > net.core.rmem_max = 524287 > > net.core.wmem_max = 524287 > > net.core.rmem_default = 524287 > > net.core.wmem_default = 524287 > > net.core.optmem_max = 524287 > > net.core.netdev_max_backlog = 300000 > > > > AND > > > > setpci -v -d 8086:10fb e6.b=2e > > > > > > Setting tunables to firefly: > > ceph osd crush tunables firefly > > > > Setting scheduler to noop: > > This basically stopped IO on the cluster, and I had to revert it and > restart some of the osds with requests stuck > > > > And I tried moving the monitor from an VM to the Hardware where the OSDs > run. > > > > > > Any suggestions where to look, or what could cause that problem? > > (because I can’t believe your loosing that much performance through ceph > replication) > > > > Thanks in advance. > > > > If you need any info please tell me. > > > > > Mit freundlichen Grüßen/Kind regards > > > Jonas Rottmann > Systems Engineer > > FIS-ASP Application Service Providing und > IT-Outsourcing GmbH > Röthleiner Weg 4 > D-97506 Grafenrheinfeld > Phone: +49 (9723) 9188-568 > Fax: +49 (9723) 9188-600 > > email: j.rottmann@xxxxxxxxxx web: www.fis-asp.de > > Geschäftsführer Robert Schuhmann > Registergericht Schweinfurt HRB 3865 > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com