Re: Predict performance

"Simon Hallam" <sha@xxxxxxxxx> · Fri, 2 Oct 2015 10:01:49 +0000

The way I look at it is:

Would you normally put 18*2TB disks in a single RAID5 volume? If the answer is no, then a replication factor of 2 is not enough.

Cheers,

Simon

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Javier C.A.

Sent: 02 October 2015 09:58

To: ceph-users@xxxxxxxxxxxxxx

Subject: Re:  Predict performance

Christian

thank you so much for your answer.

You're right, when I say Performance, I actually mean the "classic FIO test".....

Regarding the CPU, you meant 2Ghz per OSD and per CPU CORE, isn't?

One last question, with a total number of 18xOSD (2TB/OSD), and a replica factor of 2, is it really risky? This won't be a critical cluster, but neither is a lab/test cluster, you know....

Thanks again.

J

> Date: Fri, 2 Oct 2015 17:16:21 +0900

> From: chibi@xxxxxxx

> To: ceph-users@xxxxxxxxxxxxxx

> CC: magicboiz@xxxxxxxxxxx

> Subject: Re:  Predict performance

> 

> 

> Hello,

> 

> More line breaks, formatting. 

> A wall of text makes people less likely to read things.

> 

> On Fri, 2 Oct 2015 07:08:29 +0000 Javier C.A. wrote:

> 

> > Hello

> > Before posting this message, I've been reading older posts in the

> > mailing list, but I didn't get any clear answer..... 

> 

> Define performance. 

> Many people seem to be fascinated by the speed of sequential (more or less)

> writes and reads, while their use case would actually be better served by

> an increased small IOPS performance.

> 

> >I happen to have

> > three servers available to test Ceph, and I would like to know if there

> > is any kind of "performance prediction formula". 

> 

> If there is such a thing (that actually works with less than a 10% error

> margin), I'm sure RedHat would like to charge you for it. ^_-

> 

> >-My OSD servers are:

> > - 1 x Intel E5-2603v3 1.6Ghz (6 cores) 

> Slightly underpowered, especially when it comes to small write IOPS.

> My personal formula is at least 2GHz per OSD with SSD journal.

> 

> >- 32G RAM D4 

> OK, more is better (for read performance, see below).

> 

> >- 10Gb ethernet network, jumbo frames enabled - 

> 

> Slight overkill given the rest of your setup, I guess you saw all the fun

> people keep having with jumbo frames in the ML archives.

> 

> >SSOO: 2 x 500GB RAID 1 

> >- OSD (6 OSD): - 2TB 7200 SATA4 6Gbps 

> >- 1 x SSD Intel SC3700 200GB for

> > journaling of all 6 OSDs. - 

> This means that the most throughput you'll ever be able to write to those

> nodes is the speed of that SSD, 365MB/s, lets make that 350MB/s.

> Thus the slight overkill comment earlier.

> OTOH the HDDs get to use most of the IOPS (after discounting FS journals,

> overhead, the OSD leveldb, etc). 

> So lets say slightly less than 100 IOPS per OSD.

> 

> >Replication factor = 2. 

> see below. 

> 

> >- XFS 

> I find Ext4 faster, but that's me.

> 

> >-MON nodes

> > will be running in other servers. With this OSD setup, how could I

> > predict the cpeh cluster performace (IOPS, R/W BW, latency...)? 

> 

> Of these, latency is the trickiest one, as so many things factor into it

> aside from the network.

> A test case where you're hitting basically just one OSD will look a lot

> worse than what an evenly spread out (more threads over a sufficiently

> large data set) test would.

> 

> Userspace (librbd) results can/will vastly differ from kernel RBD clients.

> 

> IOPS is a totally worthless data point w/o clearly defining what you're

> measuring how. 

> Lets assume the "standard" of 4KB blocks and 32threads, random writes.

> Also lets assume a replication factor of 3, see below.

> 

> Sustained sync'ed (direct=1 option in fio) IOPS with your setup will be in

> the 500 to 600 range (given a quiescent cluster). 

> This of course can change dramatically with non-direct writes and caching

> (kernel page cache and/or RBD client caches).

> 

> The same is true for reads, if your data set fits into the page caches of

> your storage nodes, it will be fast, if everything needs to be read from

> the HDDs, you're back to what these devices can do (~100 IOPS per HDD).

> 

> To give you a concrete example, on my test cluster I have 5 nodes, 4

> HDDs/OSDs each and no journal SSDs. 

> So that's in theory 100 IOPS per HDD, divided by 2 for the on-disk journal,

> divided by 3 for replication:

> 20*100/2/3=333 

> Which amazingly is what I get with rados bench and 4K blocks, fio from a

> kernel client and direct I/O is around 200.

> 

> BW, as in throughput is easier, about 350MB/s max for sustained sequential 

> writes (the limit of the journal SSD) and lets say 750MB/s for sustained

> reads.

> Again, if you're reading just 8GB in your tests and that fits nicely in

> the page caches of the OSDs, it will be wire speed.

> 

> >Should I configure a replica factor of 3?

> > 

> If you value your data, which you will on a production server, then yes.

> This will of course cost you 1/3 of your performance compared to replica 2.

> 

> Regards,

> 

> Christian

> -- 

> Christian Balzer Network/Systems Engineer 

> chibi@xxxxxxx Global OnLine Japan/Fusion Communications

> http://www.gol.com/

Please visit our new website at www.pml.ac.uk and follow us on Twitter  @PlymouthMarine

Winner of the Environment & Conservation category, the Charity Awards 2014.

Plymouth Marine Laboratory (PML) is a company limited by guarantee registered in England & Wales, company number 4178503. Registered Charity No. 1091222. Registered Office: Prospect Place, The Hoe, Plymouth  PL1 3DH, UK. 

This message is private and confidential. If you have received this message in error, please notify the sender and remove it from your system. You are reminded that e-mail communications are not secure and may contain viruses; PML accepts no liability for any loss or damage which may be caused by viruses.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com