Hi, >>dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc >> >>1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s How much do you get with o_dsync? (ceph journal use o_dsync, and some ssd are pretty slow with dsync) http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ $ sudo dd if=/dev/urandom of=randfile bs=1M count=1024 && sync $ sudo dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync >>When I benchmark the cluster with “rbd bench-write rbd/fio” I get pretty good results: >>elapsed: 18 ops: 262144 ops/sec: 14466.30 bytes/sec: 59253946.11 theses results seem strange. 14466.30 bytes/sec for 262144 ops/sec ? (0,05 bytes by ops ????) BTW, I never see big write ops/s with ceph without big big cluster and big cpus about dd benchmark, the problem is that dd use 1job / iodepth=1 / sequential. So here, network latencies make the difference. (but ceph team is also working to optimize that, with async messenger for example) That's why you'll have more iops with fio, with more jobs/ bigger iodepth. If you use full ssd setup, you should use at least Giant, because of sharding feature. With firefly, osd daemons doesn't scale well on multiple cores. Also from my tests, writes use a lot more cpu than read. (can be cpu bound on 3 nodes 8cores xeon-E5 1,7ghz, replication x3, with 10000 4k randwrite) also disabling cephx auth and debug help to get more iops. if your workload is mainly sequential, enabling rbd_cache will help for writes, merging colleasced blocks request, so less ops (but bigger ops), so less cpu. Alexandre ----- Mail original ----- De: "Rottmann Jonas" <j.rottmann@xxxxxxxxxx> À: "ceph-users" <ceph-users@xxxxxxxx> Envoyé: Vendredi 20 Mars 2015 15:13:19 Objet: Write IO Problem Hi, We have a huge write IO Problem in our preproductive Ceph Cluster. First our Hardware: 4 OSD Nodes with: Supermicro X10 Board 32GB DDR4 RAM 2x Intel Xeon E5-2620 LSI SAS 9300-8i Host Bus Adapter Intel Corporation 82599EB 10-Gigabit 2x Intel SSDSA2CT040G3 in software raid 1 for system Disks: 2x Samsung EVO 840 1TB So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only added nodiratime) Benchmarking one disk alone gives good values: dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s Fio 8k libaio depth=32: write: io=488184KB, bw=52782KB/s, iops=5068 , runt= 9249msec Here our ceph.conf (pretty much standard): [global] fsid = 89191a54-740a-46c7-a325-0899ab32fd1d mon initial members = cephasp41,ceph-monitor41 mon host = 172.30.10.15,172.30.10.19 public network = 172.30.10.0/24 cluster network = 172.30.10.0/24 auth cluster required = cephx auth service required = cephx auth client required = cephx #Default is 1GB, which is fine for us #osd journal size = {n} #Only needed if ext4 comes to play #filestore xattr use omap = true osd pool default size = 3 # Write an object n times. osd pool default min size = 2 # Allow writing n copy in a degraded state. #Set individual per pool by a formula #osd pool default pg num = {n} #osd pool default pgp num = {n} #osd crush chooseleaf type = {n} When I benchmark the cluster with “rbd bench-write rbd/fio” I get pretty good results: elapsed: 18 ops: 262144 ops/sec: 14466.30 bytes/sec: 59253946.11 If I for example bench i.e. with fio with rbd engine, I get very poor results: [global] ioengine=rbd clientname=admin pool=rbd rbdname=fio invalidate=0 # mandatory rw=randwrite bs=512k [rbd_iodepth32] iodepth=32 RESULTS: ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec Also if I mount the rbd with kernel as rbd0, format it with ext4 and then do a dd on it, its not that good: “dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc” RESULT: 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s I also tried presenting an rbd image with tgtd, mount it onto VMWare ESXi and test it in a vm, there I got only round about 50 iops with 4k, and writing sequentiell 25Mbytes. With NFS the read sequential values are good (400Mbyte/s) but writing only 25Mbyte/s. What I tried tweaking so far: Intel NIC optimazitions: /etc/sysctl.conf # Increase system file descriptor limit fs.file-max = 65535 # Increase system IP port range to allow for more concurrent connections net.ipv4.ip_local_port_range = 1024 65000 # -- 10gbe tuning from Intel ixgb driver README -- # # turn off selective ACK and timestamps net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 0 # memory allocation min/pressure/max. # read buffer, write buffer, and buffer space net.ipv4.tcp_rmem = 10000000 10000000 10000000 net.ipv4.tcp_wmem = 10000000 10000000 10000000 net.ipv4.tcp_mem = 10000000 10000000 10000000 net.core.rmem_max = 524287 net.core.wmem_max = 524287 net.core.rmem_default = 524287 net.core.wmem_default = 524287 net.core.optmem_max = 524287 net.core.netdev_max_backlog = 300000 AND setpci -v -d 8086:10fb e6.b=2e Setting tunables to firefly: ceph osd crush tunables firefly Setting scheduler to noop: This basically stopped IO on the cluster, and I had to revert it and restart some of the osds with requests stuck And I tried moving the monitor from an VM to the Hardware where the OSDs run. Any suggestions where to look, or what could cause that problem? (because I can’t believe your loosing that much performance through ceph replication) Thanks in advance. If you need any info please tell me. Mit freundlichen Grüßen/Kind regards Jonas Rottmann Systems Engineer FIS-ASP Application Service Providing und IT-Outsourcing GmbH Röthleiner Weg 4 D-97506 Grafenrheinfeld Phone: +49 (9723) 9188-568 Fax: +49 (9723) 9188-600 email: j.rottmann@xxxxxxxxxx web: www.fis-asp.de Geschäftsführer Robert Schuhmann Registergericht Schweinfurt HRB 3865 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com