Optimal OSD Configuration for 45 drives?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 26 Jul 2014 20:49:46 +1000 Matt Harlum wrote:

> 
> On 25 Jul 2014, at 5:54 pm, Christian Balzer <chibi at gol.com> wrote:
> 
> > On Fri, 25 Jul 2014 13:31:34 +1000 Matt Harlum wrote:
> > 
> >> Hi,
> >> 
> >> I?ve purchased a couple of 45Drives enclosures and would like to
> >> figure out the best way to configure these for ceph?
> >> 
> > That's the second time within a month somebody mentions these 45 drive
> > chassis. 
> > Would you mind elaborating which enclosures these are precisely?
> > 
> > I'm wondering especially about the backplane, as 45 is such an odd
> > number.
> > 
> 
> The Chassis is from 45drives.com. it has 3 rows of 15 direct wire sas
> connectors connected to two highpoint rocket 750s using 12 SFF-8087
> Connectors. I?m considering replacing the highpoints with 3x LSI
> 9201-16I cards The chassis? are loaded up with 45 Seagate 4TB drives,
> and separate to the 45 large drives are the two boot drives in raid 1.
> 
Oh, Backblaze inspired!
I stared at the originals a couple of years ago. ^.^
And yeah, replacing the Highpoint controllers sounds like a VERY good
idea. ^o^

You might want to get 2 (large and thus fast) Intel DC 3700 SSDs for the
OS drives and put the journals on those (OS MD RAID1, journals on
individual partitions). 

> > Also if you don't mind, specify "a couple" and what your net storage
> > requirements are.
> > 
> 
> Total is 3 of these 45drives.com enclosures for 3 replicas of our data, 
> 
If you're going to use RAID6, a replica of 2 will be fine.

> > In fact, read this before continuing:
> > ---
> > https://www.mail-archive.com/ceph-users at lists.ceph.com/msg11011.html
> > ---
> > 
> >> Mainly I was wondering if it was better to set up multiple raid groups
> >> and then put an OSD on each rather than an OSD for each of the 45
> >> drives in the chassis? 
> >> 
> > Steve already towed the conservative Ceph party line here, let me give
> > you some alternative views and options on top of that and to recap
> > what I wrote in the thread above.
> > 
> > In addition to his links, read this:
> > ---
> > https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
> > ---
> > 
> > Lets go from cheap and cheerful to "comes with racing stripes".
> > 
> > 1) All spinning rust, all the time. Plunk in 45 drives, as JBOD behind
> > the cheapest (and densest) controllers you can get. Having the journal
> > on the disks will halve their performance, but you just wanted the
> > space and are not that pressed for IOPS. 
> > The best you can expect per node with this setup is something around
> > 2300 IOPS with normal (7200RPM) disks.
> > 
> > 2) Same as 1), but use controllers with a large HW cache (4GB Areca
> > comes to mind) in JBOD (or 45 times RAID0) mode. 
> > This will alleviate some of the thrashing problems, particular if
> > you're expecting high IOPS to be in short bursts.
> > 
> > 3) Ceph Classic, basically what Steve wrote. 
> > 32HDDs, 8SSDs for journals (you do NOT want an uneven spread of
> > journals). This will give you sustainable 3200 IOPS, but of course the
> > journals on SSDs not only avoid all that trashing about on the disk
> > but also allow for coalescing of writes, so this is going to be
> > fastest solution so far. Of course you will need 3 of these at minimum
> > for acceptable redundancy, unlike 4) which just needs a replication
> > level of 2.
> > 
> > 4) The anti-cephalopod. See my reply from a month ago in the link
> > above. All the arguments apply, it very much depends upon your use
> > case and budget. In my case the higher density, lower cost and ease of
> > maintaining the cluster where well worth the lower IOPS.
> > 
> > 5) We can improve upon 3) by using HW cached controllers of course. And
> > hey, you did need to connect those drive bays somehow anyway. ^o^ 
> > Maybe even squeeze some more out of it by having the SSD controller
> > separate from the HDD one(s).
> > This is as fast (IOPS) as it comes w/o going to full SSD.
> > 
> > 
> 
> Thanks, ?All Spinning Rust? will probably be fine, we?re looking to just
> store full server backups for a long time, so there?s not expected to be
> high IO or anything like that. The servers came with some pretty
> underpowered specs re: cpu/ram and they support a max of 32GB each and
> single socket. but at some point I plan to upgrade the motherboard to
> allow much much more ram to be fitted.
> 
> Mainly the reason why I ask if it?s a good idea to set up raid groups
> for the OSDs is that I can?t put 96GB ram in these and can?t put enough
> cpu power in to them. I?m imagining it?ll all start to fall to pieces if
> I try to operate these with ceph due to the small amount of ram and cpu?
> 
Yeah, you would probably be in some tight spots with the default mobo and
45 individual OSDs. 
For your use case and this HW RAIDed OSDs look like a good alternative to
1), heck even MD RAID might do the trick if the CPU is beefy enough.

If you can replace the mobo/CPUs/RAM with something more adequate before
deployment, go for 1).


Christian 
> > Networking:
> > Either of the setups above will saturate a single 10Gb/s aka 1GB/s as
> > Steve noted. 
> > In fact 3) to 5) will be able to write up to 4GB/s in theory based on
> > the HDDs sequential performance, but that is unlikely to be seen in
> > real live. And of course your maximum write speed is  based on the
> > speed of the SSDs. So for example with 3) you would want those 8 SSDs
> > to have write speeds of about 250MB/s, giving you 2GB/s max write.
> > Which in turn means 2 10GB/s links at least, up to 4 if you want
> > redundancy and/or a separation of public and cluster network.
> > 
> > RAM:
> > The more, the merrier. 
> > It's relatively cheap and avoiding have to actually read from the disks
> > will make your write IOPS so much happier.
> > 
> > CPU:
> > You'll want something like Steve recommended for 3), I'd go with 2
> > 8core CPUs actually, so you have some Oomps to spare for the OS, IRQ
> > handling, etc. With 4) and actual 4 OSDs, about half of that will be
> > fine, with the expectation of Ceph code improvements. 
> > 
> > Mobo:
> > You're fine for overall PCIe bandwidth, even w/o going to PCIe v3. 
> > But you might have up to 3 HBAs/RAID cards and 2 network cards, so make
> > sure you and get this all into appropriate slots.
> > 
> > Regards,
> > 
> > Christian
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi at gol.com   	Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux