Optimal OSD Configuration for 45 drives?

mark.nelson@xxxxxxxxxxx (Mark Nelson) · Fri, 25 Jul 2014 12:10:20 -0500

On 07/25/2014 12:04 PM, Christian Balzer wrote:
> On Fri, 25 Jul 2014 07:24:26 -0500 Mark Nelson wrote:
>
>> On 07/25/2014 02:54 AM, Christian Balzer wrote:
>>> On Fri, 25 Jul 2014 13:31:34 +1000 Matt Harlum wrote:
>>>
>>>> Hi,
>>>>
>>>> I?ve purchased a couple of 45Drives enclosures and would like to
>>>> figure out the best way to configure these for ceph?
>>>>
>>> That's the second time within a month somebody mentions these 45 drive
>>> chassis.
>>> Would you mind elaborating which enclosures these are precisely?
>>
>> I'm guessing the supermicro SC847E26:
>>
>> http://www.supermicro.com/products/chassis/4U/847/SC847E26-RJBOD1.cfm
>>
> Le Ouch!
>
> They really must be getting  desperate for high density chassis that are
> not top loading at Supermicro.
>
> Well, if I read that link and the actual manual correctly, the most one
> can hope to get from this is 48Gb/s (2 mini-SAS with 4 lanes each) which is
> short of what 45 regular HDDs can dish out (or take in).
> And that's ignoring the the inherent deficiencies when dealing with port
> expanders.
>
> Either way, a head for this kind of enclosure would need pretty much all
> the things mentioned before, a low density (8 lanes), but high performance
> and large cache controller and definitely SSDs for journals.
>
> There must be some actual threshold, but my gut feeling tells me that
> something slightly less dense where you don't have to get another case for
> the head might turn out cheaper.
> Especially if a 1U head (RAID/HBA and network cards) and space for
> journal SSDs doesn't cut it.

Personally I'm a much bigger fan of the SC847A.  No expanders in the 
backplane, 36 3.5" bays with the MB integrated.  It's a bit old at this 
point and the fattwin nodes can go denser (both in terms of nodes and 
drives), but I've been pretty happy with it as a performance test 
platform.  It's really nice having the drives directly connected to the 
controllers.  having 4-5 controllers in 1 box is a bit tricky though. 
The fattwin hadoop nodes are a bit nicer in that regard.

Mark

>
> Christian
>
>>>
>>> I'm wondering especially about the backplane, as 45 is such an odd
>>> number.
>>>
>>> Also if you don't mind, specify "a couple" and what your net storage
>>> requirements are.
>>>
>>> In fact, read this before continuing:
>>> ---
>>> https://www.mail-archive.com/ceph-users at lists.ceph.com/msg11011.html
>>> ---
>>>
>>>> Mainly I was wondering if it was better to set up multiple raid groups
>>>> and then put an OSD on each rather than an OSD for each of the 45
>>>> drives in the chassis?
>>>>
>>> Steve already towed the conservative Ceph party line here, let me give
>>> you some alternative views and options on top of that and to recap
>>> what I wrote in the thread above.
>>>
>>> In addition to his links, read this:
>>> ---
>>> https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
>>> ---
>>>
>>> Lets go from cheap and cheerful to "comes with racing stripes".
>>>
>>> 1) All spinning rust, all the time. Plunk in 45 drives, as JBOD behind
>>> the cheapest (and densest) controllers you can get. Having the journal
>>> on the disks will halve their performance, but you just wanted the
>>> space and are not that pressed for IOPS.
>>> The best you can expect per node with this setup is something around
>>> 2300 IOPS with normal (7200RPM) disks.
>>>
>>> 2) Same as 1), but use controllers with a large HW cache (4GB Areca
>>> comes to mind) in JBOD (or 45 times RAID0) mode.
>>> This will alleviate some of the thrashing problems, particular if
>>> you're expecting high IOPS to be in short bursts.
>>>
>>> 3) Ceph Classic, basically what Steve wrote.
>>> 32HDDs, 8SSDs for journals (you do NOT want an uneven spread of
>>> journals). This will give you sustainable 3200 IOPS, but of course the
>>> journals on SSDs not only avoid all that trashing about on the disk
>>> but also allow for coalescing of writes, so this is going to be
>>> fastest solution so far. Of course you will need 3 of these at minimum
>>> for acceptable redundancy, unlike 4) which just needs a replication
>>> level of 2.
>>>
>>> 4) The anti-cephalopod. See my reply from a month ago in the link
>>> above. All the arguments apply, it very much depends upon your use
>>> case and budget. In my case the higher density, lower cost and ease of
>>> maintaining the cluster where well worth the lower IOPS.
>>>
>>> 5) We can improve upon 3) by using HW cached controllers of course. And
>>> hey, you did need to connect those drive bays somehow anyway. ^o^
>>> Maybe even squeeze some more out of it by having the SSD controller
>>> separate from the HDD one(s).
>>> This is as fast (IOPS) as it comes w/o going to full SSD.
>>>
>>>
>>> Networking:
>>> Either of the setups above will saturate a single 10Gb/s aka 1GB/s as
>>> Steve noted.
>>> In fact 3) to 5) will be able to write up to 4GB/s in theory based on
>>> the HDDs sequential performance, but that is unlikely to be seen in
>>> real live. And of course your maximum write speed is  based on the
>>> speed of the SSDs. So for example with 3) you would want those 8 SSDs
>>> to have write speeds of about 250MB/s, giving you 2GB/s max write.
>>> Which in turn means 2 10GB/s links at least, up to 4 if you want
>>> redundancy and/or a separation of public and cluster network.
>>>
>>> RAM:
>>> The more, the merrier.
>>> It's relatively cheap and avoiding have to actually read from the disks
>>> will make your write IOPS so much happier.
>>>
>>> CPU:
>>> You'll want something like Steve recommended for 3), I'd go with 2
>>> 8core CPUs actually, so you have some Oomps to spare for the OS, IRQ
>>> handling, etc. With 4) and actual 4 OSDs, about half of that will be
>>> fine, with the expectation of Ceph code improvements.
>>>
>>> Mobo:
>>> You're fine for overall PCIe bandwidth, even w/o going to PCIe v3.
>>> But you might have up to 3 HBAs/RAID cards and 2 network cards, so make
>>> sure you and get this all into appropriate slots.
>>>
>>> Regards,
>>>
>>> Christian
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>