The speed is divided because ist fair :) You reach the limit your hardware (I guess the SSDs) can deliver. For 2 clients each doing 1200 MB/s you’ll have basically to double the amount of OSDs. greetings Johannes > Am 28.07.2015 um 11:56 schrieb Shneur Zalman Mattern <shzama@xxxxxxxxxxxx>: > > Hi, > > But my question is why speed is divided between clients? > And how much OSDnodes, OSDdaemos, PGs, I have to add/remove to ceph, > that each cephfs-client could write with his max network speed (10Gbit/s ~ 1.2GB/s)??? > > > ________________________________________ > From: Johannes Formann <mlmail@xxxxxxxxxx> > Sent: Tuesday, July 28, 2015 12:46 PM > To: Shneur Zalman Mattern > Subject: Re: Did maximum performance reached? > > Hi, > > size=3 would decrease your performance. But with size=2 your results are not bad too: > Math: > size=2 means each write is written 4 times (2 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: > > 2 (size) * 1300 MB/s / 6 (SSD) => 433MB/s each SSD > 2 (size) * 1300 MB/s / 30 (HDD) => 87MB/s each HDD > > > greetings > > Johannes > >> Am 28.07.2015 um 11:41 schrieb Shneur Zalman Mattern <shzama@xxxxxxxxxxxx>: >> >> Hi, Johannes (that's my grandpa's name) >> >> The size is 2, do you really think that number of replicas can increase performance? >> on the http://ceph.com/docs/master/architecture/ >> written "Note: Striping is independent of object replicas. Since CRUSH replicates objects across OSDs, stripes get replicated automatically. " >> >> OK, I'll check it, >> Regards, Shneur >> ________________________________________ >> From: Johannes Formann <mlmail@xxxxxxxxxx> >> Sent: Tuesday, July 28, 2015 12:09 PM >> To: Shneur Zalman Mattern >> Cc: ceph-users@xxxxxxxxxxxxxx >> Subject: Re: Did maximum performance reached? >> >> Hello, >> >> what is the „size“ parameter of your pool? >> >> Some math do show the impact: >> size=3 means each write is written 6 times (3 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: >> >> 3 (size) * 1300 MB/s / 6 (SSD) => 650MB/s per SSD >> 3 (size) * 1300 MB/s / 30 (HDD) => 130MB/s per HDD >> >> If you use size=3, the results are as good as one can expect. (Even with size=2 the results won’t be bad) >> >> greetings >> >> Johannes >> >>> Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern <shzama@xxxxxxxxxxxx>: >>> >>> We've built Ceph cluster: >>> 3 mon nodes (one of them is combined with mds) >>> 3 osd nodes (each one have 10 osd + 2 ssd for journaling) >>> switch 24 ports x 10G >>> 10 gigabit - for public network >>> 20 gigabit bonding - between osds >>> Ubuntu 12.04.05 >>> Ceph 0.87.2 >>> ----------------------------------------------------- >>> Clients has: >>> 10 gigabit for ceph-connection >>> CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule >>> >>> >>> >>> ====== fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 =========== >>> Single client: >>> ++++++++++++++++++++++++++++ >>> >>> Starting 16 processes >>> >>> .....below is just 1 job info.... >>> trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 >>> write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec >>> slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 >>> clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 >>> lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 >>> clat percentiles (usec): >>> | 1.00th=[ 1], 5.00th=[ 2], 10.00th=[ 2], 20.00th=[ 2], >>> | 30.00th=[ 3], 40.00th=[ 3], 50.00th=[ 3], 60.00th=[ 4], >>> | 70.00th=[ 4], 80.00th=[ 5], 90.00th=[ 5], 95.00th=[ 6], >>> | 99.00th=[ 9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], >>> | 99.99th=[ 62] >>> bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 >>> lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% >>> lat (usec) : 100=0.03% >>> cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 >>> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>> issued : total=r=0/w=10240/d=0, short=r=0/w=0/d=0 >>> >>> ...what's above repeats 16 times... >>> >>> Run status group 0 (all jobs): >>> WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec >>> >>> +++++++++++++++++++++++++++++++++ >>> Two clients: >>> +++++++++++++++++++++++++++++++++ >>> ....below is just 1 job info.... >>> trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 >>> write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec >>> slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 >>> clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 >>> lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 >>> clat percentiles (usec): >>> | 1.00th=[ 2], 5.00th=[ 2], 10.00th=[ 2], 20.00th=[ 2], >>> | 30.00th=[ 3], 40.00th=[ 3], 50.00th=[ 3], 60.00th=[ 4], >>> | 70.00th=[ 4], 80.00th=[ 5], 90.00th=[ 5], 95.00th=[ 6], >>> | 99.00th=[ 8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], >>> | 99.99th=[ 56] >>> bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 >>> lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18% >>> lat (usec) : 100=0.03% >>> cpu : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9 >>> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>> issued : total=r=0/w=10240/d=0, short=r=0/w=0/d=0 >>> >>> ...what's above repeats 16 times... >>> >>> Run status group 0 (all jobs): >>> WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, mint=242331msec, maxt=243869msec >>> >>> --------- And almost the same(?!) aggregated result from the second client: --------- >>> >>> Run status group 0 (all jobs): >>> WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, mint=244697msec, maxt=246941msec >>> >>> ----------------- If I'll summarize: --------------------- >>> aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s >>> >>> it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it was divided? why? >>> Question: If I'll connect 12 clients nodes - each one can write just on 100MB/s? >>> Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not? >>> >>> ============================================================================ >>> >>> health HEALTH_OK >>> monmap e1: 3 mons at {mon1=192.168.56.251:6789/0,mon2=192.168.56.252:6789/0,mon3=192.168.56.253:6789/0}, election epoch 140, quorum 0,1,2 mon1,mon2,mon3 >>> mdsmap e12: 1/1/1 up {0=mon3=up:active} >>> osdmap e832: 31 osds: 30 up, 30 in >>> pgmap v106186: 6144 pgs, 3 pools, 2306 GB data, 1379 kobjects >>> 4624 GB used, 104 TB / 109 TB avail >>> 6144 active+clean >>> >>> >>> Perhaps, I don't understand something in Ceph architecture? I thought, that: >>> >>> Each spindel-disk can write ~ 100MB/s , and we have 10 SAS disks on each node = aggregated write speed is ~ 900MB/s (because of striping etc.) >>> And we have 3 OSD nodes, and objects are striped also on 30 osds - I thought it's also aggregateble and we'll get something around 2.5 GB/s, but not... >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> ************************************************************************************ >>> This footnote confirms that this email message has been scanned by >>> PineApp Mail-SeCure for the presence of malicious code, vandals & computer viruses. >>> ************************************************************************************ >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> ************************************************************************************ >> This footnote confirms that this email message has been scanned by >> PineApp Mail-SeCure for the presence of malicious code, vandals & computer viruses. >> ************************************************************************************ >> >> >> >> >> >> >> ************************************************************************************ >> This footnote confirms that this email message has been scanned by >> PineApp Mail-SeCure for the presence of malicious code, vandals & computer viruses. >> ************************************************************************************ >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > ************************************************************************************ > This footnote confirms that this email message has been scanned by > PineApp Mail-SeCure for the presence of malicious code, vandals & computer viruses. > ************************************************************************************ > > > > > > > ************************************************************************************ > This footnote confirms that this email message has been scanned by > PineApp Mail-SeCure for the presence of malicious code, vandals & computer viruses. > ************************************************************************************ > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com