Optimal OSD Configuration for 45 drives?

matt@xxxxxxxxxxx (Matt Harlum) · Sat, 26 Jul 2014 20:49:46 +1000

On 25 Jul 2014, at 5:54 pm, Christian Balzer <chibi at gol.com> wrote:

> On Fri, 25 Jul 2014 13:31:34 +1000 Matt Harlum wrote:
> 
>> Hi,
>> 
>> I?ve purchased a couple of 45Drives enclosures and would like to figure
>> out the best way to configure these for ceph?
>> 
> That's the second time within a month somebody mentions these 45 drive
> chassis. 
> Would you mind elaborating which enclosures these are precisely?
> 
> I'm wondering especially about the backplane, as 45 is such an odd number.
> 

The Chassis is from 45drives.com. it has 3 rows of 15 direct wire sas connectors connected to two highpoint rocket 750s using 12 SFF-8087 Connectors. I?m considering replacing the highpoints with 3x LSI 9201-16I cards
The chassis? are loaded up with 45 Seagate 4TB drives, and separate to the 45 large drives are the two boot drives in raid 1.

> Also if you don't mind, specify "a couple" and what your net storage
> requirements are.
> 

Total is 3 of these 45drives.com enclosures for 3 replicas of our data, 

> In fact, read this before continuing:
> ---
> https://www.mail-archive.com/ceph-users at lists.ceph.com/msg11011.html
> ---
> 
>> Mainly I was wondering if it was better to set up multiple raid groups
>> and then put an OSD on each rather than an OSD for each of the 45 drives
>> in the chassis? 
>> 
> Steve already towed the conservative Ceph party line here, let me give you
> some alternative views and options on top of that and to recap what I
> wrote in the thread above.
> 
> In addition to his links, read this:
> ---
> https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
> ---
> 
> Lets go from cheap and cheerful to "comes with racing stripes".
> 
> 1) All spinning rust, all the time. Plunk in 45 drives, as JBOD behind the
> cheapest (and densest) controllers you can get. Having the journal on the
> disks will halve their performance, but you just wanted the space and are
> not that pressed for IOPS. 
> The best you can expect per node with this setup is something around 2300
> IOPS with normal (7200RPM) disks.
> 
> 2) Same as 1), but use controllers with a large HW cache (4GB Areca comes
> to mind) in JBOD (or 45 times RAID0) mode. 
> This will alleviate some of the thrashing problems, particular if you're
> expecting high IOPS to be in short bursts.
> 
> 3) Ceph Classic, basically what Steve wrote. 
> 32HDDs, 8SSDs for journals (you do NOT want an uneven spread of journals). 
> This will give you sustainable 3200 IOPS, but of course the journals on
> SSDs not only avoid all that trashing about on the disk but also allow for
> coalescing of writes, so this is going to be fastest solution so far.
> Of course you will need 3 of these at minimum for acceptable redundancy,
> unlike 4) which just needs a replication level of 2.
> 
> 4) The anti-cephalopod. See my reply from a month ago in the link above.
> All the arguments apply, it very much depends upon your use case and
> budget. In my case the higher density, lower cost and ease of maintaining
> the cluster where well worth the lower IOPS.
> 
> 5) We can improve upon 3) by using HW cached controllers of course. And
> hey, you did need to connect those drive bays somehow anyway. ^o^ 
> Maybe even squeeze some more out of it by having the SSD controller
> separate from the HDD one(s).
> This is as fast (IOPS) as it comes w/o going to full SSD.
> 
> 

Thanks, ?All Spinning Rust? will probably be fine, we?re looking to just store full server backups for a long time, so there?s not expected to be high IO or anything like that. 
The servers came with some pretty underpowered specs re: cpu/ram and they support a max of 32GB each and single socket. but at some point I plan to upgrade the motherboard to allow much much more ram to be fitted.

Mainly the reason why I ask if it?s a good idea to set up raid groups for the OSDs is that I can?t put 96GB ram in these and can?t put enough cpu power in to them. I?m imagining it?ll all start to fall to pieces if I try to operate these with ceph due to the small amount of ram and cpu?

> Networking:
> Either of the setups above will saturate a single 10Gb/s aka 1GB/s as
> Steve noted. 
> In fact 3) to 5) will be able to write up to 4GB/s in theory based on the
> HDDs sequential performance, but that is unlikely to be seen in real live.
> And of course your maximum write speed is  based on the speed of the SSDs.
> So for example with 3) you would want those 8 SSDs to have write speeds of
> about 250MB/s, giving you 2GB/s max write.
> Which in turn means 2 10GB/s links at least, up to 4 if you want
> redundancy and/or a separation of public and cluster network.
> 
> RAM:
> The more, the merrier. 
> It's relatively cheap and avoiding have to actually read from the disks
> will make your write IOPS so much happier.
> 
> CPU:
> You'll want something like Steve recommended for 3), I'd go with 2 8core
> CPUs actually, so you have some Oomps to spare for the OS, IRQ handling,
> etc. With 4) and actual 4 OSDs, about half of that will be fine, with the
> expectation of Ceph code improvements. 
> 
> Mobo:
> You're fine for overall PCIe bandwidth, even w/o going to PCIe v3. 
> But you might have up to 3 HBAs/RAID cards and 2 network cards, so make
> sure you and get this all into appropriate slots.
> 
> Regards,
> 
> Christian
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> http://www.gol.com/