Optimal OSD Configuration for 45 drives?

chibi@xxxxxxx (Christian Balzer) · Tue, 29 Jul 2014 12:41:29 +0900

Re-added ML.

On Mon, 28 Jul 2014 20:38:37 +1000 Matt Harlum wrote:
> 
> On 27 Jul 2014, at 1:45 am, Christian Balzer <chibi at gol.com> wrote:
> 
> > On Sat, 26 Jul 2014 20:49:46 +1000 Matt Harlum wrote:
> > 
> >> 
> >> On 25 Jul 2014, at 5:54 pm, Christian Balzer <chibi at gol.com> wrote:
> >> 
> >>> On Fri, 25 Jul 2014 13:31:34 +1000 Matt Harlum wrote:
> >>> 
> >>>> Hi,
> >>>> 
> >>>> I?ve purchased a couple of 45Drives enclosures and would like to
> >>>> figure out the best way to configure these for ceph?
> >>>> 
> >>> That's the second time within a month somebody mentions these 45
> >>> drive chassis. 
> >>> Would you mind elaborating which enclosures these are precisely?
> >>> 
> >>> I'm wondering especially about the backplane, as 45 is such an odd
> >>> number.
> >>> 
> >> 
> >> The Chassis is from 45drives.com. it has 3 rows of 15 direct wire sas
> >> connectors connected to two highpoint rocket 750s using 12 SFF-8087
> >> Connectors. I?m considering replacing the highpoints with 3x LSI
> >> 9201-16I cards The chassis? are loaded up with 45 Seagate 4TB drives,
> >> and separate to the 45 large drives are the two boot drives in raid 1.
> >> 
> > Oh, Backblaze inspired!
> > I stared at the originals a couple of years ago. ^.^
> > And yeah, replacing the Highpoint controllers sounds like a VERY good
> > idea. ^o^
> > 
> > You might want to get 2 (large and thus fast) Intel DC 3700 SSDs for
> > the OS drives and put the journals on those (OS MD RAID1, journals on
> > individual partitions). 
> 
> The fact that I have a failure domain containing 180TB of data terrifies
> me to be honest If the whole host dies I?ll be pretty boned I?m guessing
> and I?m going to lose sleep worrying about it, but I will eventually
> have 10Gbit for the replication network,  just waiting on the switches.
> 
Well, if you go for 4) it won't be quite as big, at most 160TB. ^o^

See the other current threads in the ML on how to avoid unwanted
(untimely) recovery events.
Using any of these methods you will have time to bring your host (or OSD)
back online and if that shouldn't be possible at least control WHEN the
recovery kicks in.

> I?m glad you mentioned OS + journal partitions! I didn?t install any
> SSDs initially because I didn?t want to deal with losing a bunch of OSDs
> at once due to journal failure. because even at 4x Raid 6 OSDs I?ve got
> 36TB per OSD to replicate in case of an issue. Combining the journals
> into partitions on a Raided set is a great idea! Not sure I can get the
> boss to spring for some S3700?s but I?ll see :)
> 
Actually my suggestion was just to RAID the OS and not the journals, for
the obvious performance reasons. 
Also see above, recoveries can be controlled and those 36TB would be the
worst case of course.

Remember that you won't be able to write faster than the journal speed to
your OSDs.
So if you were to get 2 400GB DC3700 SSDs that would just shy of 1GB/s,
which is definitely less than what your HDDs can scribble away.
But it will deal with bursty IO much nicer.
It boils down to what that storage is used for, since you said backups
we're looking more at sequential writes and reads than anything else (of
course if those backups come in in parallel...).
The DC3700 400GB SSD is rated for 4TB/day for 5 years, so if you're
thinking of writing more than that per day a RAID controller with lots of
cache is probably the better choice. 

> > 
> >>> Also if you don't mind, specify "a couple" and what your net storage
> >>> requirements are.
> >>> 
> >> 
> >> Total is 3 of these 45drives.com enclosures for 3 replicas of our
> >> data, 
> >> 
> > If you're going to use RAID6, a replica of 2 will be fine.
> 
> Awesome, Should give me a bunch of extra space then :)
>
And higher speed, too.

> > 
> >>> In fact, read this before continuing:
> >>> ---
> >>> https://www.mail-archive.com/ceph-users at lists.ceph.com/msg11011.html
> >>> ---
> >>> 
> >>>> Mainly I was wondering if it was better to set up multiple raid
> >>>> groups and then put an OSD on each rather than an OSD for each of
> >>>> the 45 drives in the chassis? 
> >>>> 
> >>> Steve already towed the conservative Ceph party line here, let me
> >>> give you some alternative views and options on top of that and to
> >>> recap what I wrote in the thread above.
> >>> 
> >>> In addition to his links, read this:
> >>> ---
> >>> https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
> >>> ---
> >>> 
> >>> Lets go from cheap and cheerful to "comes with racing stripes".
> >>> 
> >>> 1) All spinning rust, all the time. Plunk in 45 drives, as JBOD
> >>> behind the cheapest (and densest) controllers you can get. Having
> >>> the journal on the disks will halve their performance, but you just
> >>> wanted the space and are not that pressed for IOPS. 
> >>> The best you can expect per node with this setup is something around
> >>> 2300 IOPS with normal (7200RPM) disks.
> >>> 
> >>> 2) Same as 1), but use controllers with a large HW cache (4GB Areca
> >>> comes to mind) in JBOD (or 45 times RAID0) mode. 
> >>> This will alleviate some of the thrashing problems, particular if
> >>> you're expecting high IOPS to be in short bursts.
> >>> 
> >>> 3) Ceph Classic, basically what Steve wrote. 
> >>> 32HDDs, 8SSDs for journals (you do NOT want an uneven spread of
> >>> journals). This will give you sustainable 3200 IOPS, but of course
> >>> the journals on SSDs not only avoid all that trashing about on the
> >>> disk but also allow for coalescing of writes, so this is going to be
> >>> fastest solution so far. Of course you will need 3 of these at
> >>> minimum for acceptable redundancy, unlike 4) which just needs a
> >>> replication level of 2.
> >>> 
> >>> 4) The anti-cephalopod. See my reply from a month ago in the link
> >>> above. All the arguments apply, it very much depends upon your use
> >>> case and budget. In my case the higher density, lower cost and ease
> >>> of maintaining the cluster where well worth the lower IOPS.
> >>> 
> >>> 5) We can improve upon 3) by using HW cached controllers of course.
> >>> And hey, you did need to connect those drive bays somehow anyway.
> >>> ^o^ Maybe even squeeze some more out of it by having the SSD
> >>> controller separate from the HDD one(s).
> >>> This is as fast (IOPS) as it comes w/o going to full SSD.
> >>> 
> >>> 
> >> 
> >> Thanks, ?All Spinning Rust? will probably be fine, we?re looking to
> >> just store full server backups for a long time, so there?s not
> >> expected to be high IO or anything like that. The servers came with
> >> some pretty underpowered specs re: cpu/ram and they support a max of
> >> 32GB each and single socket. but at some point I plan to upgrade the
> >> motherboard to allow much much more ram to be fitted.
> >> 
> >> Mainly the reason why I ask if it?s a good idea to set up raid groups
> >> for the OSDs is that I can?t put 96GB ram in these and can?t put
> >> enough cpu power in to them. I?m imagining it?ll all start to fall to
> >> pieces if I try to operate these with ceph due to the small amount of
> >> ram and cpu?
> >> 
> > Yeah, you would probably be in some tight spots with the default mobo
> > and 45 individual OSDs. 
> > For your use case and this HW RAIDed OSDs look like a good alternative
> > to 1), heck even MD RAID might do the trick if the CPU is beefy enough.
> > 
> > If you can replace the mobo/CPUs/RAM with something more adequate
> > before deployment, go for 1).
> > 
> > 
> 
> It?s probably going to have to be MD raid, but I don?t envision that?ll
> be an issue considering I?ll probably upgrade the CPUs to Xeons, I?ve
> currently got one of my pods temporarily running just NFS and Raid60
> until I get ceph deployed and there are no issues there. Ram is cheap,
> not sure how expensive the motherboards are because I?ll need something
> with IPMI etc.
> 
I'm happy with Supermicro in general, we deploy a lot of their gear. 
Make sure to have a long burn-in, stress test cycle. I had twice mobos
fail, as in sporadic reboots and getting worse over time, once just after
the machine was put into production and fully loaded with data.

> I?m sure it would be a lot easier on me if someone didn?t have the
> bright idea to use backblaze enclosures. maybe I?m just worrying too
> much!
> 
They are a pretty good bang for the buck and density counts in many
situations. I wouldn't use them for a generic Ceph cluster unless I had the
need to fill a few racks with them, as larger numbers make individual
failures less and less significant. 

For your use case they are pretty decent, once you beef them up.
An option (other than using duct tape ^o^) for 4 2.5 OS (and journal) SSDs
would be nice, as that would be a limiting factor when trying to coax the
full speed out of these chassis.

Christian

> Thanks for being so helpful Christian!
> 
> 
> > Christian 
> >>> Networking:
> >>> Either of the setups above will saturate a single 10Gb/s aka 1GB/s as
> >>> Steve noted. 
> >>> In fact 3) to 5) will be able to write up to 4GB/s in theory based on
> >>> the HDDs sequential performance, but that is unlikely to be seen in
> >>> real live. And of course your maximum write speed is  based on the
> >>> speed of the SSDs. So for example with 3) you would want those 8 SSDs
> >>> to have write speeds of about 250MB/s, giving you 2GB/s max write.
> >>> Which in turn means 2 10GB/s links at least, up to 4 if you want
> >>> redundancy and/or a separation of public and cluster network.
> >>> 
> >>> RAM:
> >>> The more, the merrier. 
> >>> It's relatively cheap and avoiding have to actually read from the
> >>> disks will make your write IOPS so much happier.
> >>> 
> >>> CPU:
> >>> You'll want something like Steve recommended for 3), I'd go with 2
> >>> 8core CPUs actually, so you have some Oomps to spare for the OS, IRQ
> >>> handling, etc. With 4) and actual 4 OSDs, about half of that will be
> >>> fine, with the expectation of Ceph code improvements. 
> >>> 
> >>> Mobo:
> >>> You're fine for overall PCIe bandwidth, even w/o going to PCIe v3. 
> >>> But you might have up to 3 HBAs/RAID cards and 2 network cards, so
> >>> make sure you and get this all into appropriate slots.
> >>> 
> >>> Regards,
> >>> 
> >>> Christian
> >>> -- 
> >>> Christian Balzer        Network/Systems Engineer                
> >>> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> >>> http://www.gol.com/
> >> 
> >> 
> > 
> > 
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi at gol.com   	Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> 
> --Matt Harlum
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/