On Tue, 24 Mar 2015 08:36:40 +0100 (CET) Alexandre DERUMIER wrote: > >>No, 262144 ops total in 18 seconds. > >> > Oh ok ;) > > >>"rbd bench-write" is clearly doing something VERY differently from > >>"rados bench" (and given its output was also written by somebody > >>else), maybe some Ceph dev can enlighten us? > > Maybe rbd_cache is merging 4k block to 4M rados objects ? > does rbd_cache=false change the results ? > Indeed that was it. On a node w/o rbd_cache enabled it sloooooows down to a crawl. > How many iops do you see with "#ceph -w" ? > Very few with the cache enabled, but exactly the bandwidth indicated by the test. Another mystery solved. ^.^ And another data point for the OP on how to compare these results. Christian > > > > ----- Mail original ----- > De: "Christian Balzer" <chibi@xxxxxxx> > À: "ceph-users" <ceph-users@xxxxxxxx> > Cc: "aderumier" <aderumier@xxxxxxxxx> > Envoyé: Mardi 24 Mars 2015 08:24:23 > Objet: Re: Write IO Problem > > On Tue, 24 Mar 2015 07:56:33 +0100 (CET) Alexandre DERUMIER wrote: > > > Hi, > > > > >>dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc > > >> > > >>1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s > > > > How much do you get with o_dsync? (ceph journal use o_dsync, and some > > ssd are pretty slow with dsync) > > > > http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ > > $ sudo dd if=/dev/urandom of=randfile bs=1M count=1024 && sync > > $ sudo dd if=randfile of=/dev/sda bs=4k count=100000 > > oflag=direct,dsync > > > > > > > > >>When I benchmark the cluster with “rbd bench-write rbd/fio” I get > > >>pretty good results: elapsed: 18 ops: 262144 ops/sec: 14466.30 > > >>bytes/sec: 59253946.11 > > > > theses results seem strange. > > 14466.30 bytes/sec for 262144 ops/sec ? (0,05 bytes by ops ????) > > > No, 262144 ops total in 18 seconds. > > "rbd bench-write" is clearly doing something VERY differently from > "rados bench" (and given its output was also written by somebody else), > maybe some Ceph dev can enlighten us? > > On my production cluster: > --- > # rbd bench-write rbd/fio > bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq > SEC OPS OPS/SEC BYTES/SEC > 1 33872 33844.52 147291343.43 > 2 66580 33269.52 144790497.45 > 3 99235 33078.06 143956816.36 > 4 130755 32686.56 142252741.07 > 5 162499 32498.23 141432978.13 > 6 193987 32329.15 140696998.44 > 7 226440 32343.08 140757971.93 > elapsed: 7 ops: 246723 ops/sec: 32064.33 bytes/sec: 139544931.69 > --- > > Doing the same with rados bench gives us the expected ~1300 IOPS for > this cluster that I can see from inside a VM as well: > --- > # rados -p rbd bench 8 write -t 16 -b 4096 > Maintaining 16 concurrent writes of 4096 bytes for up to 8 seconds or 0 > objects Object prefix: benchmark_data_comp-01_6105 > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 0 0 0 0 0 0 - 0 > 1 16 1926 1910 7.45896 7.46094 0.0098110.00831868 > 2 15 3944 3929 7.67216 7.88672 0.0053750.00811645 > 3 16 5346 5330 6.93867 5.47266 0.0062460.00870079 > 4 16 5800 5784 5.64732 1.77344 0.005610.00894914 > 5 16 6649 6633 5.18109 3.31641 0.006165 0.0120499 > 6 16 7382 7366 4.79472 2.86328 0.006601 0.0130159 > 7 16 7980 7964 4.44342 2.33594 0.007786 0.0140532 > 8 16 9308 9292 4.53638 5.1875 0.00839 0.0137575 > Total time run: 8.007909 > Total writes made: 9308 > Write size: 4096 > Bandwidth (MB/sec): 4.540 > > Stddev Bandwidth: 2.64905 > --- > > Christian > > > BTW, I never see big write ops/s with ceph without big big cluster and > > big cpus > > > > > > > > about dd benchmark, the problem is that dd use 1job / iodepth=1 / > > sequential. So here, network latencies make the difference. (but ceph > > team is also working to optimize that, with async messenger for > > example) That's why you'll have more iops with fio, with more jobs/ > > bigger iodepth. > > > > > > > > If you use full ssd setup, you should use at least Giant, because of > > sharding feature. With firefly, osd daemons doesn't scale well on > > multiple cores. > > > > Also from my tests, writes use a lot more cpu than read. (can be cpu > > bound on 3 nodes 8cores xeon-E5 1,7ghz, replication x3, with 10000 4k > > randwrite) > > > > > > > > also disabling cephx auth and debug help to get more iops. > > > > > > if your workload is mainly sequential, enabling rbd_cache will help > > for writes, merging colleasced blocks request, so less ops (but bigger > > ops), so less cpu. > > > > > > Alexandre > > > > > > ----- Mail original ----- > > De: "Rottmann Jonas" <j.rottmann@xxxxxxxxxx> > > À: "ceph-users" <ceph-users@xxxxxxxx> > > Envoyé: Vendredi 20 Mars 2015 15:13:19 > > Objet: Write IO Problem > > > > > > > > Hi, > > > > > > > > We have a huge write IO Problem in our preproductive Ceph Cluster. > > First our Hardware: > > > > > > > > 4 OSD Nodes with: > > > > > > > > Supermicro X10 Board > > > > 32GB DDR4 RAM > > > > 2x Intel Xeon E5-2620 > > > > LSI SAS 9300-8i Host Bus Adapter > > > > Intel Corporation 82599EB 10-Gigabit > > > > 2x Intel SSDSA2CT040G3 in software raid 1 for system > > > > > > > > Disks: > > > > 2x Samsung EVO 840 1TB > > > > > > > > So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only > > added nodiratime) > > > > > > > > Benchmarking one disk alone gives good values: > > > > > > > > dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc > > > > 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s > > > > > > > > Fio 8k libaio depth=32: > > > > write: io=488184KB, bw=52782KB/s, iops=5068 , runt= 9249msec > > > > > > > > Here our ceph.conf (pretty much standard): > > > > > > > > [global] > > > > fsid = 89191a54-740a-46c7-a325-0899ab32fd1d > > > > mon initial members = cephasp41,ceph-monitor41 > > > > mon host = 172.30.10.15,172.30.10.19 > > > > public network = 172.30.10.0/24 > > > > cluster network = 172.30.10.0/24 > > > > auth cluster required = cephx > > > > auth service required = cephx > > > > auth client required = cephx > > > > > > > > #Default is 1GB, which is fine for us > > > > #osd journal size = {n} > > > > > > > > #Only needed if ext4 comes to play > > > > #filestore xattr use omap = true > > > > > > > > osd pool default size = 3 # Write an object n times. > > > > osd pool default min size = 2 # Allow writing n copy in a degraded > > state. > > > > > > > > #Set individual per pool by a formula > > > > #osd pool default pg num = {n} > > > > #osd pool default pgp num = {n} > > > > #osd crush chooseleaf type = {n} > > > > > > > > > > > > When I benchmark the cluster with “rbd bench-write rbd/fio” I get > > pretty good results: > > > > elapsed: 18 ops: 262144 ops/sec: 14466.30 bytes/sec: 59253946.11 > > > > > > > > If I for example bench i.e. with fio with rbd engine, I get very poor > > results: > > > > > > > > [global] > > > > ioengine=rbd > > > > clientname=admin > > > > pool=rbd > > > > rbdname=fio > > > > invalidate=0 # mandatory > > > > rw=randwrite > > > > bs=512k > > > > > > > > [rbd_iodepth32] > > > > iodepth=32 > > > > > > > > RESULTS: > > > > ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec > > > > > > > > Also if I mount the rbd with kernel as rbd0, format it with ext4 and > > then do a dd on it, its not that good: > > > > “dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc” > > > > RESULT: > > > > 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s > > > > > > > > I also tried presenting an rbd image with tgtd, mount it onto VMWare > > ESXi and test it in a vm, there I got only round about 50 iops with > > 4k, and writing sequentiell 25Mbytes. > > > > With NFS the read sequential values are good (400Mbyte/s) but writing > > only 25Mbyte/s. > > > > > > > > What I tried tweaking so far: > > > > > > > > Intel NIC optimazitions: > > > > /etc/sysctl.conf > > > > > > > > # Increase system file descriptor limit > > > > fs.file-max = 65535 > > > > > > > > # Increase system IP port range to allow for more concurrent > > connections > > > > net.ipv4.ip_local_port_range = 1024 65000 > > > > > > > > # -- 10gbe tuning from Intel ixgb driver README -- # > > > > > > > > # turn off selective ACK and timestamps > > > > net.ipv4.tcp_sack = 0 > > > > net.ipv4.tcp_timestamps = 0 > > > > > > > > # memory allocation min/pressure/max. > > > > # read buffer, write buffer, and buffer space > > > > net.ipv4.tcp_rmem = 10000000 10000000 10000000 > > > > net.ipv4.tcp_wmem = 10000000 10000000 10000000 > > > > net.ipv4.tcp_mem = 10000000 10000000 10000000 > > > > > > > > net.core.rmem_max = 524287 > > > > net.core.wmem_max = 524287 > > > > net.core.rmem_default = 524287 > > > > net.core.wmem_default = 524287 > > > > net.core.optmem_max = 524287 > > > > net.core.netdev_max_backlog = 300000 > > > > > > > > AND > > > > > > > > setpci -v -d 8086:10fb e6.b=2e > > > > > > > > > > > > Setting tunables to firefly: > > > > ceph osd crush tunables firefly > > > > > > > > Setting scheduler to noop: > > > > This basically stopped IO on the cluster, and I had to revert it and > > restart some of the osds with requests stuck > > > > > > > > And I tried moving the monitor from an VM to the Hardware where the > > OSDs run. > > > > > > > > > > > > Any suggestions where to look, or what could cause that problem? > > > > (because I can’t believe your loosing that much performance through > > ceph replication) > > > > > > > > Thanks in advance. > > > > > > > > If you need any info please tell me. > > > > > > > > > > Mit freundlichen Grüßen/Kind regards > > > > > > Jonas Rottmann > > Systems Engineer > > > > FIS-ASP Application Service Providing und > > IT-Outsourcing GmbH > > Röthleiner Weg 4 > > D-97506 Grafenrheinfeld > > Phone: +49 (9723) 9188-568 > > Fax: +49 (9723) 9188-600 > > > > email: j.rottmann@xxxxxxxxxx web: www.fis-asp.de > > > > Geschäftsführer Robert Schuhmann > > Registergericht Schweinfurt HRB 3865 > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com