Re: Predict performance

Christian Balzer <chibi@xxxxxxx> · Sat, 3 Oct 2015 01:15:19 +0900

Hello,
On Fri, 2 Oct 2015 15:31:11 +0200 Javier C.A. wrote:

Firstly, this has been discussed countless times here.
For one of the latest recurrences, check the archive for:

"calculating maximum number of disk and node failure that can
be handled by cluster with out data loss"

> A classic raid5 system takes a looong time to rebullid  the raid, so i
> would say NO, but how long does it take for ceph to rebullid the
> placement group?
>
A placement group resides on an OSD. 
Until the LAST PG on a failed OSD has been recovered, you are prone to
data loss.
And a single lost PG might affect ALL your images...

So while your OSDs are mostly empty, recovery will be faster than a RAID5.

Once it gets fuller AND you realize that rebuilding OSDs SEVERELY impacts
your cluster performance (at least in your smallish example) you are
likely to tune down the recovery and backfill parameters to a level where
it takes LONGER than a typical RAID controller recovery.

So yes, for all intends and purposes a replication of 2 is just as bad as
RAID5 (unless your OSDs are RAIDs themselves or very reliable SSDs).

Christian
> J
> 
> > El 2 oct 2015, a las 12:01, Simon Hallam <sha@xxxxxxxxx> escribió:
> > 
> > The way I look at it is:
> >  
> > Would you normally put 18*2TB disks in a single RAID5 volume? If the
> > answer is no, then a replication factor of 2 is not enough. 
> > Cheers,
> >  
> > Simon
> >  
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Javier C.A. Sent: 02 October 2015 09:58
> > To: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  Predict performance
> >  
> > Christian
> >  
> > thank you so much for your answer.
> >  
> > You're right, when I say Performance, I actually mean the "classic FIO
> > test"..... 
> > Regarding the CPU, you meant 2Ghz per OSD and per CPU CORE, isn't?
> >  
> > One last question, with a total number of 18xOSD (2TB/OSD), and a
> > replica factor of 2, is it really risky? This won't be a critical
> > cluster, but neither is a lab/test cluster, you know.... Thanks again.
> > J
> > 
> > > Date: Fri, 2 Oct 2015 17:16:21 +0900
> > > From: chibi@xxxxxxx
> > > To: ceph-users@xxxxxxxxxxxxxx
> > > CC: magicboiz@xxxxxxxxxxx
> > > Subject: Re:  Predict performance
> > > 
> > > 
> > > Hello,
> > > 
> > > More line breaks, formatting. 
> > > A wall of text makes people less likely to read things.
> > > 
> > > On Fri, 2 Oct 2015 07:08:29 +0000 Javier C.A. wrote:
> > > 
> > > > Hello
> > > > Before posting this message, I've been reading older posts in the
> > > > mailing list, but I didn't get any clear answer..... 
> > > 
> > > Define performance. 
> > > Many people seem to be fascinated by the speed of sequential (more
> > > or less) writes and reads, while their use case would actually be
> > > better served by an increased small IOPS performance.
> > > 
> > > >I happen to have
> > > > three servers available to test Ceph, and I would like to know if
> > > > there is any kind of "performance prediction formula". 
> > > 
> > > If there is such a thing (that actually works with less than a 10%
> > > error margin), I'm sure RedHat would like to charge you for it. ^_-
> > > 
> > > >-My OSD servers are:
> > > > - 1 x Intel E5-2603v3 1.6Ghz (6 cores) 
> > > Slightly underpowered, especially when it comes to small write IOPS.
> > > My personal formula is at least 2GHz per OSD with SSD journal.
> > > 
> > > >- 32G RAM D4 
> > > OK, more is better (for read performance, see below).
> > > 
> > > >- 10Gb ethernet network, jumbo frames enabled - 
> > > 
> > > Slight overkill given the rest of your setup, I guess you saw all
> > > the fun people keep having with jumbo frames in the ML archives.
> > > 
> > > >SSOO: 2 x 500GB RAID 1 
> > > >- OSD (6 OSD): - 2TB 7200 SATA4 6Gbps 
> > > >- 1 x SSD Intel SC3700 200GB for
> > > > journaling of all 6 OSDs. - 
> > > This means that the most throughput you'll ever be able to write to
> > > those nodes is the speed of that SSD, 365MB/s, lets make that
> > > 350MB/s. Thus the slight overkill comment earlier.
> > > OTOH the HDDs get to use most of the IOPS (after discounting FS
> > > journals, overhead, the OSD leveldb, etc). 
> > > So lets say slightly less than 100 IOPS per OSD.
> > > 
> > > >Replication factor = 2. 
> > > see below. 
> > > 
> > > >- XFS 
> > > I find Ext4 faster, but that's me.
> > > 
> > > >-MON nodes
> > > > will be running in other servers. With this OSD setup, how could I
> > > > predict the cpeh cluster performace (IOPS, R/W BW, latency...)? 
> > > 
> > > Of these, latency is the trickiest one, as so many things factor
> > > into it aside from the network.
> > > A test case where you're hitting basically just one OSD will look a
> > > lot worse than what an evenly spread out (more threads over a
> > > sufficiently large data set) test would.
> > > 
> > > Userspace (librbd) results can/will vastly differ from kernel RBD
> > > clients.
> > > 
> > > IOPS is a totally worthless data point w/o clearly defining what
> > > you're measuring how. 
> > > Lets assume the "standard" of 4KB blocks and 32threads, random
> > > writes. Also lets assume a replication factor of 3, see below.
> > > 
> > > Sustained sync'ed (direct=1 option in fio) IOPS with your setup will
> > > be in the 500 to 600 range (given a quiescent cluster). 
> > > This of course can change dramatically with non-direct writes and
> > > caching (kernel page cache and/or RBD client caches).
> > > 
> > > The same is true for reads, if your data set fits into the page
> > > caches of your storage nodes, it will be fast, if everything needs
> > > to be read from the HDDs, you're back to what these devices can do
> > > (~100 IOPS per HDD).
> > > 
> > > To give you a concrete example, on my test cluster I have 5 nodes, 4
> > > HDDs/OSDs each and no journal SSDs. 
> > > So that's in theory 100 IOPS per HDD, divided by 2 for the on-disk
> > > journal, divided by 3 for replication:
> > > 20*100/2/3=333 
> > > Which amazingly is what I get with rados bench and 4K blocks, fio
> > > from a kernel client and direct I/O is around 200.
> > > 
> > > BW, as in throughput is easier, about 350MB/s max for sustained
> > > sequential writes (the limit of the journal SSD) and lets say
> > > 750MB/s for sustained reads.
> > > Again, if you're reading just 8GB in your tests and that fits nicely
> > > in the page caches of the OSDs, it will be wire speed.
> > > 
> > > >Should I configure a replica factor of 3?
> > > > 
> > > If you value your data, which you will on a production server, then
> > > yes. This will of course cost you 1/3 of your performance compared
> > > to replica 2.
> > > 
> > > Regards,
> > > 
> > > Christian
> > > -- 
> > > Christian Balzer Network/Systems Engineer 
> > > chibi@xxxxxxx Global OnLine Japan/Fusion Communications
> > > http://www.gol.com/
> > Please visit our new website at www.pml.ac.uk and follow us on
> > Twitter  @PlymouthMarine
> > 
> > Winner of the Environment & Conservation category, the Charity Awards
> > 2014.
> > 
> > Plymouth Marine Laboratory (PML) is a company limited by guarantee
> > registered in England & Wales, company number 4178503. Registered
> > Charity No. 1091222. Registered Office: Prospect Place, The Hoe,
> > Plymouth  PL1 3DH, UK. 
> > 
> > This message is private and confidential. If you have received this
> > message in error, please notify the sender and remove it from your
> > system. You are reminded that e-mail communications are not secure and
> > may contain viruses; PML accepts no liability for any loss or damage
> > which may be caused by viruses.

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com