Hi, We have a huge write IO Problem in our preproductive Ceph Cluster. First our Hardware: 4 OSD Nodes with: Supermicro X10 Board 32GB DDR4 RAM 2x Intel Xeon E5-2620 LSI SAS 9300-8i Host Bus Adapter Intel Corporation 82599EB 10-Gigabit 2x Intel SSDSA2CT040G3 in software raid 1 for system Disks: 2x Samsung EVO 840 1TB So comulated 8 SSDs as OSD, with btrfs formatted (with ceph-disk, only added nodiratime) Benchmarking one disk alone gives good values: dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 1073741824 Bytes (1,1 GB) kopiert, 2,53986 s, 423 MB/s Fio 8k libaio depth=32: write: io=488184KB, bw=52782KB/s, iops=5068 , runt= 9249msec Here our ceph.conf (pretty much standard): [global] fsid = 89191a54-740a-46c7-a325-0899ab32fd1d mon initial members = cephasp41,ceph-monitor41 mon host = 172.30.10.15,172.30.10.19 public network = 172.30.10.0/24 cluster network = 172.30.10.0/24 auth cluster required = cephx auth service required = cephx auth client required = cephx #Default is 1GB, which is fine for us #osd journal size = {n} #Only needed if ext4 comes to play #filestore xattr use omap = true osd pool default size = 3 # Write an object n times. osd pool default min size = 2 # Allow writing n copy in a degraded state. #Set individual per pool by a formula #osd pool default pg num = {n} #osd pool default pgp num = {n} #osd crush chooseleaf type = {n} When I benchmark the cluster with “rbd bench-write rbd/fio” I get pretty good results: elapsed: 18 ops: 262144 ops/sec: 14466.30 bytes/sec: 59253946.11 If I for example bench i.e. with fio with rbd engine, I get very poor results: [global] ioengine=rbd clientname=admin pool=rbd rbdname=fio invalidate=0 # mandatory rw=randwrite bs=512k [rbd_iodepth32] iodepth=32 RESULTS: ite: io=2048.0MB, bw=53896KB/s, iops=105, runt= 38911msec Also if I mount the rbd with kernel as rbd0, format it with ext4 and then do a dd on it, its not that good: “dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc” RESULT: 1073741824 Bytes (1,1 GB) kopiert, 12,6152 s, 85,1 MB/s I also tried presenting an rbd image with tgtd, mount it onto VMWare ESXi and test it in a vm, there I got only round about 50 iops with 4k, and writing sequentiell
25Mbytes. With NFS the read sequential values are good (400Mbyte/s) but writing only 25Mbyte/s. What I tried tweaking so far: Intel NIC optimazitions: /etc/sysctl.conf # Increase system file descriptor limit fs.file-max = 65535 # Increase system IP port range to allow for more concurrent connections net.ipv4.ip_local_port_range = 1024 65000 # -- 10gbe tuning from Intel ixgb driver README -- # # turn off selective ACK and timestamps net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 0 # memory allocation min/pressure/max. # read buffer, write buffer, and buffer space net.ipv4.tcp_rmem = 10000000 10000000 10000000 net.ipv4.tcp_wmem = 10000000 10000000 10000000 net.ipv4.tcp_mem = 10000000 10000000 10000000 net.core.rmem_max = 524287 net.core.wmem_max = 524287 net.core.rmem_default = 524287 net.core.wmem_default = 524287 net.core.optmem_max = 524287 net.core.netdev_max_backlog = 300000 AND setpci -v -d 8086:10fb e6.b=2e Setting tunables to firefly:
ceph osd crush tunables firefly Setting scheduler to noop: This basically stopped IO on the cluster, and I had to revert it and restart some of the osds with requests stuck And I tried moving the monitor from an VM to the Hardware where the OSDs run. Any suggestions where to look, or what could cause that problem?
(because I can’t believe your loosing that much performance through ceph replication) Thanks in advance. If you need any info please tell me. Mit freundlichen Grüßen/Kind regards Jonas Rottmann
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com