Hi Gregory, Thanks for your replies. Let's take the 2 hosts config setup (3 MON + 3 idle MDS on same hosts). 2 dell R510 servers, CentOS 7.0.1406, dual xeon 5620 (8 cores+hyperthreading),16GB RAM, 2 or 1x10gbits/s Ethernet (same results with and without private 10gbits network), PERC H700 + 12 2TB SAS disks, and PERC H800
+ 11 2TB SAS disks (one unused SSD...) The EC pool is defined with k=4, m=1 I set the failure domain to OSD for the test The OSDs are set up with XFS and a 10GB journal 1st partition (the single doomed-dell SSD was a bottleneck for 23 disks…) All disks are presently configured with a single-RAID0 because H700/H800 do not support JBOD. I have 5 clients (CentOS 7.1), 10gbits/s ethernet, all running this command : rados -k ceph.client.admin.keyring -p testec bench 120 write -b 4194304 -t 32 --run-name "bench_`hostname -s`" --no-cleanup I'm aggregating the average bandwidth at the end of the tests. I'm monitoring the Ceph servers stats live with this dstat command: dstat -N p2p1,p2p2,total The network MTU is 9000 on all nodes. With this, the average client throughput is around 130MiB/s, i.e 650 MiB/s for the whole 2-nodes ceph cluster / 5 clients. I since have tried removing (ceph osd out/ceph osd crush reweight 0) either the H700 or the H800 disks, thus only using 11 or 12 disks per server, and I either get 550 MiB/s or 590MiB/s of aggregated clients bandwidth.
Not much less considering I removed half disks ! I'm therefore starting to think I am CPU /memory bandwidth limited... ? That's not however what I am tempted to conclude (for the cpu at least) when I see the dstat output, as it says the cpus still sit idle or IO waiting : ----total-cpu-usage---- -dsk/total- --net/p2p1----net/p2p2---net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send: recv send: recv send| in out | int csw 1 1 97 0 0 0| 586k 1870k| 0 0 : 0 0 : 0 0 | 49B 455B|8167 15k 29 17 24 27 0 3| 128k 734M| 367M 870k: 0 0 : 367M 870k| 0 0 | 61k 61k 30 17 34 16 0 3| 432k 750M| 229M 567k: 199M 168M: 427M 168M| 0 0 | 65k 68k 25 14 38 20 0 3| 16k 634M| 232M 654k: 162M 133M: 393M 134M| 0 0 | 56k 64k 19 10 46 23 0 2| 232k 463M| 244M 670k: 184M 138M: 428M 139M| 0 0 | 45k 55k 15 8 46 29 0 1| 368k 422M| 213M 623k: 149M 110M: 362M 111M| 0 0 | 35k 41k 25 17 37 19 0 3| 48k 584M| 139M 394k: 137M 90M: 276M 91M| 0 0 | 54k 53k Could it be the interruptions or system context switches that cause this relatively poor performance per node ? PCI-E interractions with the PERC cards ? I know I can get way more disk throughput with dd (command below) ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 1 1 97 0 0 0| 595k 2059k| 0 0 | 634B 2886B|7971 15k 1 93 0 3 0 3| 0 1722M| 49k 78k| 0 0 | 40k 47k 1 93 0 3 0 3| 0 1836M| 40k 69k| 0 0 | 45k 57k 1 95 0 2 0 2| 0 1805M| 40k 69k| 0 0 | 38k 34k 1 94 0 3 0 2| 0 1864M| 37k 38k| 0 0 | 35k 24k (…) Dd command :
# use at your own risk # FS_THR=64 ; FILE_MB=8 ; N_FS=`mount|grep ceph|wc -l` ; time (for i in `mount|grep ceph|awk '{print $3}'` ; do echo "writing $FS_THR times (threads) " $[ 4 * FILE_MB ] " mb
on $i..." ; for j in `seq 1 $FS_THR` ; do dd conv=fsync if=/dev/zero of=$i/test.zero.$j bs=4M count=$[ FILE_MB / 4 ] & done ; done ; wait) ; echo "wrote $[ N_FS * FILE_MB * FS_THR ] MB on $N_FS FS with $FS_THR threads" ; rm -f /var/lib/ceph/osd/*/test.zero* Hope I gave you more insights on what I’m trying to achieve, and where I’m failing ? Regards -----Message d'origine----- We might also be able to help you improve or better understand your results if you can tell us exactly what tests you're conducting that are giving you these numbers. -Greg On Wed, Jul 22, 2015 at 4:44 AM, Florent MONTHEL <fmonthel@xxxxxxxxxxxxx> wrote: > Hi Frederic, > > When you have Ceph cluster with 1 node you don’t experienced network and > communication overhead due to distributed model > With 2 nodes and EC 4+1 you will have communication between 2 nodes but you > will keep internal communication (2 chunks on first node and 3 chunks on > second node) > On your configuration EC pool is setup with 4+1 so you will have for each > write overhead due to write spreading on 5 nodes (for 1 customer IO, you > will experience 5 Ceph IO due to EC 4+1) > It’s the reason for that I think you’re reaching performance stability with > 5 nodes and more in your cluster > > > On Jul 20, 2015, at 10:35 AM, SCHAER Frederic <frederic.schaer@xxxxxx> > wrote: > > Hi, > > As I explained in various previous threads, I’m having a hard time getting > the most out of my test ceph cluster. > I’m benching things with rados bench. > All Ceph hosts are on the same 10GB switch. > > Basically, I know I can get about 1GB/s of disk write performance per host, > when I bench things with dd (hundreds of dd threads) +iperf 10gbit > inbound+iperf 10gbit outbound. > I also can get 2GB/s or even more if I don’t bench the network at the same > time, so yes, there is a bottleneck between disks and network, but I can’t > identify which one, and it’s not relevant for what follows anyway > (Dell R510 + MD1200 + PERC H700 + PERC H800 here, if anyone has hints about > this strange bottleneck though…) > > My hosts each are connected though a single 10Gbits/s link for now. > > My problem is the following. Please note I see the same kind of poor > performance with replicated pools... > When testing EC pools, I ended putting a 4+1 pool on a single node in order > to track down the ceph bottleneck. > On that node, I can get approximately 420MB/s write performance using rados > bench, but that’s fair enough since the dstat output shows that real data > throughput on disks is about 800+MB/s (that’s the ceph journal effect, I > presume). > > I tested Ceph on my other standalone nodes : I can also get around 420MB/s, > since they’re identical. > I’m testing things with 5 10Gbits/s clients, each running rados bench. > > But what I really don’t get is the following : > > - With 1 host : throughput is 420MB/s > - With 2 hosts : I get 640MB/s. That’s surely not 2x420MB/s. > - With 5 hosts : I get around 1375MB/s . That’s far from the > expected 2GB/s. > > The network never is maxed out, nor are the disks or CPUs. > The hosts throughput I see with rados bench seems to match the dstat > throughput. > That’s as if each additional host was only capable of adding 220MB/s of > throughput. Compare this to the 1GB/s they are capable of (420MB/s with > journals)… > > I’m therefore wondering what could possibly be so wrong with my setup ?? > Why would it impact so much the performance to add hosts ? > > On the hardware side, I have Broadcam BCM57711 10-Gigabit PCIe cards. > I know, not perfect, but not THAT bad neither… ? > > Any hint would be greatly appreciated ! > > Thanks > Frederic Schaer > _______________________________________________ > ceph-users mailing list >
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list >
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com