Re: Predict performance

Christian Balzer <chibi@xxxxxxx> · Sat, 3 Oct 2015 01:15:13 +0900

On Fri, 2 Oct 2015 08:57:44 +0000 Javier C.A. wrote:

> Christian
> thank you so much for your answer.
> You're right, when I say Performance, I actually mean the "classic FIO
> test"..... Regarding the CPU, you meant 2Ghz per OSD and per CPU CORE,
> isn't? 
Yes.

Given mixed, typical load your CPU will be sufficient, but at 100% small
IOPS it will become a bottleneck.

> One last question, with a total number of 18xOSD (2TB/OSD), and a
> replica factor of 2, is it really risky? This won't be a critical
> cluster, but neither is a lab/test cluster, you know.... Thanks again.J
>
See the next mail.

> > Date: Fri, 2 Oct 2015 17:16:21 +0900
> > From: chibi@xxxxxxx
> > To: ceph-users@xxxxxxxxxxxxxx
> > CC: magicboiz@xxxxxxxxxxx
> > Subject: Re:  Predict performance
> > 
> > 
> > Hello,
> > 
> > More line breaks, formatting. 
> > A wall of text makes people less likely to read things.
> > 
> > On Fri, 2 Oct 2015 07:08:29 +0000 Javier C.A. wrote:
> > 
> > > Hello
> > > Before posting this message, I've been reading older posts in the
> > > mailing list, but I didn't get any clear answer..... 
> > 
> > Define performance. 
> > Many people seem to be fascinated by the speed of sequential (more or
> > less) writes and reads, while their use case would actually be better
> > served by an increased small IOPS performance.
> > 
> > >I happen to have
> > > three servers available to test Ceph, and I would like to know if
> > > there is any kind of "performance prediction formula". 
> > 
> > If there is such a thing (that actually works with less than a 10%
> > error margin), I'm sure RedHat would like to charge you for it. ^_-
> > 
> > >-My OSD servers are:
> > >  - 1 x Intel E5-2603v3 1.6Ghz (6 cores) 
> > Slightly underpowered, especially when it comes to small write IOPS.
> > My personal formula is at least 2GHz per OSD with SSD journal.
> > 
> > >- 32G RAM D4 
> > OK, more is better (for read performance, see below).
> > 
> > >- 10Gb ethernet network, jumbo frames enabled - 
> > 
> > Slight overkill given the rest of your setup, I guess you saw all the
> > fun people keep having with jumbo frames in the ML archives.
> > 
> > >SSOO: 2 x 500GB RAID 1 
> > >- OSD (6 OSD):       - 2TB 7200 SATA4 6Gbps 
> > >- 1 x SSD Intel SC3700 200GB for
> > > journaling of all 6 OSDs. - 
> > This means that the most throughput you'll ever be able to write to
> > those nodes is the speed of that SSD, 365MB/s, lets make that 350MB/s.
> > Thus the slight overkill comment earlier.
> > OTOH the HDDs get to use most of the IOPS (after discounting FS
> > journals, overhead, the OSD leveldb, etc). 
> > So lets say slightly less than 100 IOPS per OSD.
> > 
> > >Replication factor = 2. 
> > see below. 
> > 
> > >- XFS 
> > I find Ext4 faster, but that's me.
> > 
> > >-MON nodes
> > > will be running in other servers. With this OSD setup, how could I
> > > predict the cpeh cluster performace (IOPS, R/W BW, latency...)? 
> > 
> > Of these, latency is the trickiest one, as so many things factor into
> > it aside from the network.
> > A test case where you're hitting basically just one OSD will look a lot
> > worse than what an evenly spread out (more threads over a sufficiently
> > large data set) test would.
> > 
> > Userspace (librbd) results can/will vastly differ from kernel RBD
> > clients.
> > 
> > IOPS is a totally worthless data point w/o clearly defining what you're
> > measuring how. 
> > Lets assume the "standard" of 4KB blocks and 32threads, random writes.
> > Also lets assume a replication factor of 3, see below.
> > 
> > Sustained sync'ed (direct=1 option in fio) IOPS with your setup will
> > be in the 500 to 600 range (given a quiescent cluster). 
> > This of course can change dramatically with non-direct writes and
> > caching (kernel page cache and/or RBD client caches).
> > 
> > The same is true for reads, if your data set fits into the page caches
> > of your storage nodes, it will be fast, if everything needs to be read
> > from the HDDs, you're back to what these devices can do (~100 IOPS per
> > HDD).
> > 
> > To give you a concrete example, on my test cluster I have 5 nodes, 4
> > HDDs/OSDs each and no journal SSDs. 
> > So that's in theory 100 IOPS per HDD, divided by 2 for the on-disk
> > journal, divided by 3 for replication:
> > 20*100/2/3=333 
> > Which amazingly is what I get with rados bench and 4K blocks, fio from
> > a kernel client and direct I/O is around 200.
> > 
> > BW, as in throughput is easier, about 350MB/s max for sustained
> > sequential writes (the limit of the journal SSD) and lets say 750MB/s
> > for sustained reads.
> > Again, if you're reading just 8GB in your tests and that fits nicely in
> > the page caches of the OSDs, it will be wire speed.
> > 
> > >Should I  configure a replica factor of 3?
> > > 
> > If you value your data, which you will on a production server, then
> > yes. This will of course cost you 1/3 of your performance compared to
> > replica 2.
> > 
> > Regards,
> > 
> > Christian
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
>  		 	   		  

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com