Re: Ceph 0.94 (and lower) performance on >1 hosts ??

SCHAER Frederic <frederic.schaer@xxxxxx> · Wed, 22 Jul 2015 16:19:12 +0000

Hi Gregory,

Thanks for your replies.
Let's take the 2 hosts config setup (3 MON + 3 idle MDS on same hosts).

2 dell R510 servers, CentOS 7.0.1406, dual xeon 5620 (8 cores+hyperthreading),16GB RAM, 2 or 1x10gbits/s Ethernet (same results with and without private 10gbits network), PERC H700 + 12 2TB SAS disks, and PERC H800
 + 11 2TB SAS disks (one unused SSD...)
The EC pool is defined with k=4, m=1
I set the failure domain to OSD for the test
The OSDs are set up with XFS and a 10GB journal 1st partition (the single doomed-dell SSD was a bottleneck for 23 disks…)
All disks are presently configured with a single-RAID0 because H700/H800 do not support JBOD.

I have 5 clients (CentOS 7.1), 10gbits/s ethernet, all running this command :
rados -k ceph.client.admin.keyring -p testec bench 120 write -b 4194304 -t 32 --run-name "bench_`hostname -s`" --no-cleanup

I'm aggregating the average bandwidth at the end of the tests.
I'm monitoring the Ceph servers stats live with this dstat command: dstat -N p2p1,p2p2,total
The network MTU is 9000 on all nodes.

With this, the average client throughput is around 130MiB/s, i.e 650 MiB/s for the whole 2-nodes ceph cluster / 5 clients.
I since have tried removing (ceph osd out/ceph osd crush reweight 0) either the H700 or the H800 disks, thus only using 11 or 12 disks per server, and I either get 550 MiB/s or 590MiB/s of aggregated clients bandwidth.
 Not much less considering I removed half disks !
I'm therefore starting to think I am CPU /memory bandwidth limited... ?

That's not however what I am tempted to conclude (for the cpu at least) when I see the dstat output, as it says the cpus still sit idle or IO waiting :

----total-cpu-usage---- -dsk/total- --net/p2p1----net/p2p2---net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send: recv  send: recv  send|  in   out | int   csw
  1   1  97   0   0   0| 586k 1870k|   0     0 :   0     0 :   0     0 |  49B  455B|8167    15k
29  17  24  27   0   3| 128k  734M| 367M  870k:   0     0 : 367M  870k|   0     0 |  61k   61k
30  17  34  16   0   3| 432k  750M| 229M  567k: 199M  168M: 427M  168M|   0     0 |  65k   68k
25  14  38  20   0   3|  16k  634M| 232M  654k: 162M  133M: 393M  134M|   0     0 |  56k   64k
19  10  46  23   0   2| 232k  463M| 244M  670k: 184M  138M: 428M  139M|   0     0 |  45k   55k
15   8  46  29   0   1| 368k  422M| 213M  623k: 149M  110M: 362M  111M|   0     0 |  35k   41k
25  17  37  19   0   3|  48k  584M| 139M  394k: 137M   90M: 276M   91M|   0     0 |  54k   53k

Could it be the interruptions or system context switches that cause this relatively poor performance per node ?
PCI-E interractions with the PERC cards ?
I know I can get way more disk throughput with dd (command below)

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  1   1  97   0   0   0| 595k 2059k|   0     0 | 634B 2886B|7971    15k
  1  93   0   3   0   3|   0  1722M|  49k   78k|   0     0 |  40k   47k
  1  93   0   3   0   3|   0  1836M|  40k   69k|   0     0 |  45k   57k
  1  95   0   2   0   2|   0  1805M|  40k   69k|   0     0 |  38k   34k
  1  94   0   3   0   2|   0  1864M|  37k   38k|   0     0 |  35k   24k
(…)

Dd command :
# use at your own risk # FS_THR=64 ; FILE_MB=8 ; N_FS=`mount|grep ceph|wc -l` ; time (for i in `mount|grep ceph|awk '{print $3}'` ; do echo "writing $FS_THR times (threads) " $[ 4 * FILE_MB ] " mb
 on $i..." ; for j in `seq 1 $FS_THR` ; do dd conv=fsync if=/dev/zero of=$i/test.zero.$j bs=4M count=$[ FILE_MB / 4 ] & done ; done ; wait) ; echo "wrote $[ N_FS * FILE_MB * FS_THR ] MB on $N_FS FS with $FS_THR threads" ; rm -f /var/lib/ceph/osd/*/test.zero*

Hope I gave you more insights on what I’m trying to achieve, and where I’m failing ?

Regards

-----Message d'origine-----

De : Gregory Farnum [mailto:greg@xxxxxxxxxxx] 

Envoyé : mercredi 22 juillet 2015 16:01

À : Florent MONTHEL

Cc : SCHAER Frederic; ceph-users@xxxxxxxxxxxxxx

Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

We might also be able to help you improve or better understand your
results if you can tell us exactly what tests you're conducting that
are giving you these numbers.
-Greg

On Wed, Jul 22, 2015 at 4:44 AM, Florent MONTHEL <fmonthel@xxxxxxxxxxxxx> wrote:
> Hi Frederic,
> 
> When you have Ceph cluster with 1 node you don’t experienced network and
> communication overhead due to distributed model
> With 2 nodes and EC 4+1 you will have communication between 2 nodes but you
> will keep internal communication (2 chunks on first node and 3 chunks on
> second node)
> On your configuration EC pool is setup with 4+1 so you will have for each
> write overhead due to write spreading on 5 nodes (for 1 customer IO, you
> will experience 5 Ceph IO due to EC 4+1)
> It’s the reason for that I think you’re reaching performance stability with
> 5 nodes and more in your cluster
> 
> 
> On Jul 20, 2015, at 10:35 AM, SCHAER Frederic <frederic.schaer@xxxxxx>
> wrote:
> 
> Hi,
> 
> As I explained in various previous threads, I’m having a hard time getting
> the most out of my test ceph cluster.
> I’m benching things with rados bench.
> All Ceph hosts are on the same 10GB switch.
> 
> Basically, I know I can get about 1GB/s of disk write performance per host,
> when I bench things with dd (hundreds of dd threads) +iperf 10gbit
> inbound+iperf 10gbit outbound.
> I also can get 2GB/s or even more if I don’t bench the network at the same
> time, so yes, there is a bottleneck between disks and network, but I can’t
> identify which one, and it’s not relevant for what follows anyway
> (Dell R510 + MD1200 + PERC H700 + PERC H800 here, if anyone has hints about
> this strange bottleneck though…)
> 
> My hosts each are connected though a single 10Gbits/s link for now.
> 
> My problem is the following. Please note I see the same kind of poor
> performance with replicated pools...
> When testing EC pools, I ended putting a 4+1 pool on a single node in order
> to track down the ceph bottleneck.
> On that node, I can get approximately 420MB/s write performance using rados
> bench, but that’s fair enough since the dstat output shows that real data
> throughput on disks is about 800+MB/s (that’s the ceph journal effect, I
> presume).
> 
> I tested Ceph on my other standalone nodes : I can also get around 420MB/s,
> since they’re identical.
> I’m testing things with 5 10Gbits/s clients, each running rados bench.
> 
> But what I really don’t get is the following :
> 
> -          With 1 host : throughput is 420MB/s
> -          With 2 hosts : I get 640MB/s. That’s surely not 2x420MB/s.
> -          With 5 hosts : I get around 1375MB/s . That’s far from the
> expected 2GB/s.
> 
> The network never is maxed out, nor are the disks or CPUs.
> The hosts throughput I see with rados bench seems to match the dstat
> throughput.
> That’s as if each additional host was only capable of adding 220MB/s of
> throughput. Compare this to the 1GB/s they are capable of (420MB/s with
> journals)…
> 
> I’m therefore wondering what could possibly be so wrong with my setup ??
> Why would it impact so much the performance to add hosts ?
> 
> On the hardware side, I have Broadcam BCM57711 10-Gigabit PCIe cards.
> I know, not perfect, but not THAT bad neither… ?
> 
> Any hint would be greatly appreciated !
> 
> Thanks
> Frederic Schaer
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com