Re: OSD - choose the right controller card, HBA/IT mode explanation

Christian Balzer <chibi@xxxxxxx> · Mon, 6 Oct 2014 23:48:16 +0900

On Fri, 03 Oct 2014 11:56:42 +0200 Massimiliano Cuttini wrote:

> 
> Il 02/10/2014 17:24, Christian Balzer ha scritto:
> > On Thu, 02 Oct 2014 12:20:06 +0200 Massimiliano Cuttini wrote:
> >> Il 02/10/2014 03:18, Christian Balzer ha scritto:
> >>> On Wed, 01 Oct 2014 20:12:03 +0200 Massimiliano Cuttini wrote:
> >>>> Hello Christian,
> >>>>
> >>>> Il 01/10/2014 19:20, Christian Balzer ha scritto:
> >>>>> Hello,
> >>>>>
> >>>>> On Wed, 01 Oct 2014 18:26:53 +0200 Massimiliano Cuttini wrote:
> >>>>>
> >>>>>> Dear all,
> >>>>>>
> >>>>>> i need few tips about Ceph best solution for driver controller.
> >>>>>> I'm getting confused about IT mode, RAID and JBoD.
> >>>>>> I read many posts about don't go for RAID but use instead a JBoD
> >>>>>> configuration.
> >>>>>>
> >>>>>> I have 2 storage alternatives right now in my mind:
> >>>>>>
> >>>>>>        *SuperStorage Server 2027R-E1CR24L*
> >>>>>>        which use SAS3 via LSI 3008 AOC; IT Mode/Pass-through
> >>>>>>        http://www.supermicro.nl/products/system/2U/2027/SSG-2027R-E1CR24L.cfm
> >>>>>>
> >>>>>> and
> >>>>>>
> >>>>>>        *SuperStorage Server 2027R-E1CR24N*
> >>>>>>        which use SAS3 via LSI 3108 SAS3 AOC (in RAID mode?)
> >>>>>>        http://www.supermicro.nl/products/system/2U/2027/SSG-2027R-E1CR24N.cfm
> >>>>>>
> >>>>> Firstly, both of these use an expander backplane.
> >>>>> So if you're planning on putting SSDs in there (even if just like 6
> >>>>> for journals) you may be hampered by that.
> >>>>> The Supermicro homepage is vague as usual and the manual doesn't
> >>>>> actually have a section for that backplane. I guess it will be a
> >>>>> 4link connection, so 4x12Gb/s aka 4.8 GB/s.
> >>>>> If the disks all going to be HDDs you're OK, but keep that bit in
> >>>>> mind.
> >>>> ok i was thinking about connect 24 SSD disks connected with SATA3
> >>>> (6Gbps). This is why i choose a 8x SAS3 port LSI card that use
> >>>> double PCI 3.0 connection, that support even (12Gbps).
> >>>> This should allow me to use the full speed of the SSD (i guess).
> >>>>
> >>> Given the SSD speeds you cite below, SAS2 aka SATA3 would do, too.
> >>> And of course be cheaper.
> >>>
> >>> Also what SSDs are you planning to deploy?
> >> I would go with bulk of cheap consumer SSD.
> >> I just need to perform better than HDDs, and that's all.
> >> Everything better is just fine.
> > Bad idea.
> > Read the current "SSD MTBF" thread.
> > If your cluster is even remotely busy "cheap" consumer SSDs will cost
> > you more than top end Enterprise ones in a short time (TBW/$).
> > And they are so unpredictable and likely to fail that a replication of
> > 2 is going to be very risky proposition, so increasing your cost by
> > 1/3rd anyway if you really care about reliability.
> I read the SSD MTBF post and i don't agree to point that cheap SSD are 
> bad (as I wrote).
> The problem is not related to cheap or not, but to the size of the disk.
> Having everyday 50Gb of data written on a 100Gb SSD or on a 1Tb SSD it's 
> completly different.
Nobody is arguing that, and I wrote as much as that in the thread and 
pointed out an article also mentioning this:
http://www.anandtech.com/show/8239/update-on-samsung-850-pro-endurance-vnand-die-size

> The 1st solution will last just half a year, the second will last 5 
> years (of course they are both cheap).
And while this is true in theory you need to factor all the numbers before
you can make the assumption that your SSD will in fact survive 5 years. 
I will stick with the 850 Pros for my examples, as they are so well
documented and still in my head.

According to the article we can expect 550TBW on a 1TB drive in their worst
case scenario and one would always do well to start with the worst case.
Next your journals are on the same SSDs, so an additional write
amplification, plus the one from replication.
So for each GB written to a PG, you're in fact writing 4GB to the cluster
(ignoring FS overhead).

Now remember that the SSDs in the SSD MTBF thread failed when they
supposedly still had 40% of endurance left.

All of a sudden there is very little margin for error left.

Lastly, your SSD will likely last longer than the warranty (150TBW for
the 850 Pros) but once you exceed that number and your SSD fails, it is
total loss in the financial sense.

At the very least (if the numbers make sense) I would pick a DC S3500 800GB
over a 1TB 850 Pro because Intel will give me a warranty for 450TBW at a
very similar price level AND I'd get the steady performance that comes with
these SSDs.

> SSD have no unpredictable failure, they are not mechanic, they just end 
> their life-cicle in a predeterminated number of writes.
That is vastly, even dangerously oversimplifying things. 
See above with the "death at 40%" for starters.
Also manufacturers test/analyze their production results and sort them
accordingly, the better parts go into more expensive products. 
There is a non-trivial variance between seemingly identical NANDs and all
the makers do is giving what they feel is a safe minimum number.
You're also ignoring the fact that there is more on a SSD than the NAND
that can fail.

I have identical SSDs that had the same write activity (RAID1) and still
show different wear levels and most tellingly replaced blocks.

> Just take more space and you get more writes.
> Take a SSD of 100Gb, both commercial or enterprise, is just silly IMOH.
> 
> > If you can't afford a cluster made entirely of SSDs, a typical HDDs
> > with SSDs for journal mix is probably going to be fast enough.
> >
> > Ceph at this point in time can't utilize the potential of a pure SSD
> > cluster anyway, see the:
> > "[Single OSD performance on SSD] Can't go over 3,2K IOPS"
> > thread.
> Ok... this is a good point: "why spend a lot if you will not get 
> performance anyway?"
> I definitly have to take into account this reccomandation.

You can expect Ceph to get better with time, but if that is already with
the next release I can't say.

> >>>> I made this analysis:
> >>>> - Total output: 8x12 = 96Gbps full speed available on the PCI3.0
> >>> That's the speed/capacity of the controller.
> >>>
> >>> I'm talking about the actual backplane, where drives plug in.
> >>> And that is connected either by one cable  (and thus 48Gb/s) or two
> >>> (and thus the 96GB/s you're expecting), the documentation is unclear
> >>> on the homepage and not in the manual of that server. Digging around
> >>> I found http://www.supermicro.com.tw/manuals/other/BPN-SAS3-216EL.pdf
> >>> which suggests two ports, so your basic assumptions are correct.
> >> This is what is wrote for the backpane: One SATA backplane
> >> (BPN-SAS3-216EL1) /SAS3 2.5" drive slots and 4x mini-SAS3 HD
> >> connectors for SAS3 uplink/downlink//
> >> //It support 4x port mini SAS3 HD connector.//
> >> //This because there are somebody that will buy a AOC LSI card to
> >> speed up further the backpane.//
> >> /
> >> It understood that support 1 or 2 expander card, each one with 4x mini
> >> SAS3 cable.
> >> 2 cards daughter cards to have failover on the backplane (however this
> >> storage come with just 1 port).
> >> Then should be 4x 12Gb/s ? I'm getting confused.
> >>
> > No, read that PDF closely.
> > The single expander card of that server backplane has 2 uplink ports.
> > Each port usually (and in this case pretty much certainly) has 4 lanes
> > at 12Gb/s each.
> Definitly thank you! I'm not a hardware guru and i couldn't understood
> that. Thanks you heartened me! :)
> 
> >>> But verify that with your Supermicro vendor and read up about
> >>> SAS/SATA expanders.
> >>>
> >>> If you want/need full speed, the only option with Supermicro seems to
> >>> be
> >>> http://www.supermicro.com.tw/products/chassis/2U/216/SC216BAC-R920LP.cfm
> >>> at this time for SAS3.
> >> That backplane (BPN-SAS3-216A) come for 300$ while the one on the
> >> storage worth 600$ (BPN-SAS3-216EL1).
> >> I think that they are both great, however i cannot choose the
> >> backlplane for that model.
> > Build it yourself or have your vendor do a BTO, Build To Order.
> As I wrote, I'm not a HW guru... I'm afraid to build a not working
> config. I need at last an half finished solution. I saw across "storage
> solution proposed by supermicro", i think they are fine.

Supermicro themselves or any competent vendor should able to make
appropriate HW suggestion based on your needs. And the vendor can build it
from parts for your. 
Our vendor certainly can/could. 

> >>> Of course a direct connect backplane chassis with SAS2/SATA3 will do
> >>> fine as I wrote above, like this one.
> >>> http://www.supermicro.com.tw/products/chassis/2U/216/SC216BA-R1K28LP.cfm
> >>>
> >>> In either case get the fastest motherboard/CPUs (Ceph will need those
> >>> for SSDs) and the appropriate controller(s). If you're unwilling to
> >>> build them yourself, I'm sure some vendor will do BTO. ^^
> >> I cannot change the motherboard (but seems really good!).
> > Why?
> > Not being able to purchase the optimum solution (especially when it is
> > CHEAPER!) strikes me as odd...
> You are right!  -_-
> But I'm not aware how to compare 2 motherboards and what are the key 
> factor to take in account.

The motherboard is probably fine all things considered. 

> ... if you have some suggestions are more than welcome! :)
> >> About CPUs i decided to go for a double E5-2620.
> >> http://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI
> >> Not so fast .... i went for quantity instead for quality (12cores will
> >> be enought, no?).
> >> Do you think i need to change it with something better?
> > If all your OSDs are all going to be SSDs, yes.
> Ouch... é_è
> ...what is the issue with that CPU? ...too slow or too few cores?
> http://ark.intel.com/search/advanced?FamilyText=Intel%C2%AE%20Xeon%C2%AE%20Processor%20E5%20v2%20Family

Both really, but too slow mostly. 
Look at that SSD OSD performance thread, they managed to saturate 7.2GHz
of CPU with one OSD, who knows how much it really wanted/needed. 
The configuration guide is based on pure HDD OSDs (no SSD journal either)
and suggests 1GHz per OSD.
I can definitely say that I can saturate 8 3.2GHz cores with machine
consisting of 4 journal SSDs and 8 OSD HDDs with the right tests (small
writes mostly). 

In short your 24bay server full of SSDs is too dense for anything that
exists. You might be better off with something less dense, like:
http://www.supermicro.com.tw/products/system/2U/2027/SYS-2027TR-D70RF_.cfm

> Help me!
> >> RAM is 4x 8Gb = 32gb
> > Barely enough, if you should go for a mixed HDD/SSD setup, add as much
> > RAM as you can afford, it will speed up things and reads in particular.
> >
> > Have a look at:
> > https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
> This document seems to be well done

It is a good start, but getting dated a bit, especially when it comes to
CPU suggestions which are based on HDD only OSDs. 

> Thanks you really do help me! :)
> >>>> - Than i should have at least for each disk a maximum speed of
> >>>> 96Gbps/24 disks which 4Gbps each disk
> >>>> - The disks are SATA3 6Gbps than i should have here a little
> >>>> bootleneck that lower me at 4Gbps.
> >>>> - However a common SSD never hit the interface speed, the tend to be
> >>>> at 450MB/s.
> >>>>
> >>>> Average speed of a SSD:
> >>>> Min 	Avg 	Max
> >>>> 369 	Read 485 	522
> >>>> 162 	Write 428 	504
> >>>> 223 	Mixed 449 	512
> >>>>
> >>>>
> >>>> Then having a bottleneck to 4Gbps (which mean 400MB/s) should be
> >>>> fine (should only if I'm not in wrong).
> >>>> Is it right what i thougth?
> >>>>
> >>> Also expanders introduce some level of overhead, so you're probably
> >>> going to wind up with less than 400MB/s per drive.
> >> Is it good 400MB/s per drive?
> >> I don't think that a SAS HDD would even reach this speed.
> >>
> > A HDD not, SSDs certainly could depending on the model. I thought you
> > were going to deploy all SSDs?
> Only SSDs.
> >>>> I think that the only bottleneck here is the 4x1Gb ethernet
> >>>> connection.
> >>> With a firebreathing storage server like that, you definitely do NOT
> >>> want to limit yourself to 1Gb/s links. The latency of these links,
> >>> never mind bandwidth will render all your investment in the storage
> >>> nodes rather moot.
> >>>
> >>> Even if your clients would not be on something faster, for
> >>> replication at least use 10Gb/s Ethernet or my favorite (price and
> >>> performance wise), Infiniband.
> >> I read something about infiniband but i really don't know much.
> >> If you have some usefull link i will take a further look.
> > Search the ML archives for Infiniband for starters, read up on it on
> > Wikipedia, compare prices.
> But I'll need to change all the switches to infiniband to support it?
> Do I need to connect every CPU node with infiniband too?
>
If you start fresh, with a clean slate (like I did recently), that would
be the best path, as it keeps costs and complexity down.
If you have existing 10GbE infrastructure and/or compute nodes with such
interfaces, us that network environment.
The same is true if you have to mix 1GbE and 10GbE machines. 

> >>>>>> Is it so or I'm completly wasting my time on useless specs?
> >>>>> It might be a good idea to tell us what your actual plans are.
> >>>>> As in, how many nodes (these are quite dense ones with 24 drives!),
> >>>>> how much storage in total, what kind of use pattern, clients.
> >>>> Right now we are just testing and experimenting.
> >>>> We would start with a non-production environment with 2 nodes, learn
> >>>> Cephs in depth and then replicate test&findings on other 2 nodes,
> >>>> upgrade it to 10GB ethernet and go live.
> >>> Given that you're aiming for all SSDs, definitely consider Infiniband
> >>> for the backend (replication network) at least.
> >>> It's cheaper/faster and also will have more native support (thus even
> >>> faster) in upcoming Ceph releases.
> >>> Failing that, definitely dedicated client and replication networks,
> >>> each with 2x10Gb/s bonded links to get somewhere close to your
> >>> storage abilities/bandwidth.
> >> I have 3 options:
> >>
> >>    * add another 4x1Gb card (cheap but cost many port on the switch -
> >>      2x4x port for 1 storage + 1of management)
> >>    * add a 2x10Gb card (expensive but probably necessary)
> >>    * investigate further Infiniband
> >>
> > Again, what is the point of having a super fast storage node all based
> > on SSDs when your network is slow (latency, thus cutting into your
> > IOPS) and can't use even 10% of the bandwidth the storage system could
> > deliver?
> I know, but i can upgrade networks later... no?
A _single_ SSD can saturate your 4 1GbE links, never mind that they're
unlikely to be bonded in a way to actually give you that performance.
Building a SSD cluster and then interconnecting it with 1GbE links is
really pointless.
Run the numbers!
With 4 storage nodes, the best speed you could possibly hope for is
800MB/s, 2 SSDs worth...
And even if bandwidth is not your concern (it isn't for most people) that
small and SLOW network pipe is going to effect your IOPS as well.

> (how can I measure network performance issues while i'm growing?)
That goes way beyond Ceph, various monitoring tools come to mind and for a
live/momentary analysis atop and ethstats will do for starters.

> >>> Next consider the HA aspects of your cluster. Aside from the obvious
> >>> like having redundant power feeds and network links/switches, what
> >>> happens if a storage node fails?
> >>> If you're starting with 2 nodes, that's risky in and by itself (also
> >>> deploy at least 3 mons).
> >>>
> >>> If you start with 4 nodes, if one goes down the default behavior of
> >>> Ceph would be to redistribute the data on the 3 remaining nodes to
> >>> maintain the replication level (a level of 2 is probably acceptable
> >>> with the right kind of SSDs).
> >>> Now what means is a LOT of traffic for the replication, potentially
> >>> impacting your performance depending on the configuration options and
> >>> actual hardware used. It also means your "near full" settings should
> >>> be at 70% or lower, because otherwise a node failure could result in
> >>> full OSDs and thus a blocked cluster. And of course after the data is
> >>> rebalanced the lack of one node means that your cluster is about 25%
> >>> slower than before.
> >> This settings are good to me. I don't expect node failure to be the
> >> standard.
> > Nobody expects the Spanish inquisition. Or Mr. Murphy.
> > Being aware of what happens in case it does happen goes a long way.
> > And with your 2 node test cluster you can't even test this!
> I don't want to test the failure workload. I need to learn how to 
> install Ceph.
> I cannot buy x8 storage server without ever started to deploy Ceph once.
> This test is for start. x4 will be for a small production environment.
> I count on double it within a year, but we need to do little steps.
> 
You will have to do the failure mode testing and other things before you
start production on the 4 node cluster then.
A 2 node cluster is a very special, unnatural thing in Ceph and not
representative.

> >
> >> To me running 25% slower is just nothing instead of don't running at
> >> all.
> > If the recovery traffic is too much for your cluster (network, CPUs,
> > disks), it will be pretty much the same thing.
> > And if your cluster gets full because it was over 70% capacity when
> > that node failed, it IS the same thing.
> I don't think already to fill it up all the bays.
> I will start with half filled bays and growing by adding more OSD 
> instead of disks.
> However 24bays give me room to grow just by adding a disk for each bay 
> from time to time.
Your talking about adding OSDs over time, I'm talking about that you need
to keep a 4 node cluster below 70% space usage all the time.
You won't likely have the time (with SSDs and 10GB/s links) to add more
OSDs before the re-balancing of data will fill your cluster and make it
stop working otherwise.
And with a replication of 2 (and consumer SSDs) disabling re-balancing on
node failure would be a very risky proposal.

Christian

> >>> The most common and from a pure HA perspective sensible suggestion is
> >>> to start with enough nodes that a failure won't have too much impact,
> >>> but that of course is also the most expensive option. ^^
> >> Yes! But "Expensive" is not an option in this epoch :)
> >> I need to be effective
> > You can only be effective once you know all the components (HW and SW)
> > as well as the environment (client I/O mostly).
> This is the point.
> I need to make 1 step for HW and 1 step for SW.
> This is just the starting point... but we have in front a long way to 
> learn the SW.
> Let's see.
> 
> 
> Max

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com