Re: Cost- and Powerefficient OSD-Nodes

Dominik Hannen <hannen@xxxxxxxxx> · Wed, 29 Apr 2015 19:06:10 +0200 (CEST)

> FWIW, I tried using some 256G MX100s with ceph and had horrible performance
> issues within a month or two.  I was seeing 100% utilization with high
> latency but only 20 MB/s writes.  I had a number of S3500s in the same pool
> that were dramatically better.  Which is to say that they were actually
> faster than the hard disk pool they were fronting, rather than slower.
> 
> If you do go with MX200s, I'd recommend only using at most 80% of the
> drive; most cheap SSDs perform *much* better at sustained writes if you
> give them more overprovisioning space to work with.

I had planned to use at maximum 80GB of the available 250GB.
1 x 16GB OS
4 x 8, 12 or 16GB partitions for osd-journals.

For a total SSD Usage of 19.2%, 25.6% or 32%
and over-provisioning of 80.8%, 74.3% or 68%.

I am relatively certain that those SSDs would last ages with THAT
much over-provisioning.

But it still is a consumer-grade SSD and it looks like those will be
exchanged with Samsung 845DC Pro 400GB SSDs, if there are no known
issues with those.

Due to the added cost follows a reduction of nodes in the initial setup
(6 Nodes with enterprise HDDs and SSDs, instead of 8 with consumer HDDs
and SSDs). I would have liked more nodes, but 6 is still a good number
to start.
Exactly 2 times better than the lowest reasonable minimum. :)

> Scott
> 
> On Tue, Apr 28, 2015, 4:30 PM Dominik Hannen <hannen@xxxxxxxxx> wrote:
> 
>> > It's all about the total latency per operation. Most IO sizes over 10GB
>> > don't make much difference to the Round Trip Time. But comparatively even
>> > 128KB IO's over 1GB take quite a while. For example ping a host with a
>> > payload of 64k over 1GB and 10GB networks and look at the difference in
>> > times. Now double this for Ceph (Client->Prim OSD->Sec OSD)
>> >
>> > When you are using SSD journals you normally end up with write latency of
>> > 3-4ms over 10GB, 1GB networking will probably increase this by another
>> > 2-4ms. IOPs=1000/latency
>> >
>> > I guess it all really depends on how important performance is
>>
>> I recon we are talking about single-threaded IOPs? It looks like 10ms
>> latency
>> is in the worst-case region.. 100 IOPs will do fine.
>>
>> At least in my understanding heavily multi-threaded load should be able to
>> get higher IOPs regardless of latency?
>>
>> Some presentation material suggested that the adverse effects of higher
>> latency, due to 1Gbit, begin above IO sizes of 2k, maybe there is room to
>> tune IOPs hungry applications/vms accordingly.
>>
>> > Just had a look and the Seagate Surveillance disks spin at 7200RPM
>> (missed
>> > that you put that there), whereas the WD ones that I am familiar with
>> spin
>> > at 5400rpm, so not as bad as I thought.
>> >
>> > So probably ok to use, but I don't see many people using them for Ceph/
>> > generic NAS so can't be sure there's no hidden gotchas.
>>
>> I am not sure how trustworthy newegg-reviews are, but somehow I get some
>> doubts about them now.
>> I guess it does not matter that much, at least if not more than a disk a
>> month
>> is failing? The 3-year warranty gives some hope..
>>
>> Are there some cost-efficient HDDs that someone can suggest? (Most likely
>> 3TB
>> drives, that seems to be the sweet-spot at the moment.)
>>
>> > Sorry nothing in detail, I did actually build a ceph cluster on the same
>> 8
>> > core CPU as you have listed. I didn't have any performance problems but
>> I do
>> > remember with SSD journals when doing high queue depth writes I could get
>> > the CPU quite high. It's like what I said before about the 1vs10Gb
>> > networking, how important is performance, If using this CPU gives you an
>> > extra 1ms of latency per OSD, is that acceptable?
>> >
>> > Agree 12cores (guessing 2.5Ghz each) will be an overkill for just 12
>> OSDs. I
>> > have a very similar spec and see exactly the same as you, but will change
>> > the nodes to 1CPU each when I expand and use the spare CPU's for the new
>> > nodes.
>> >
>> > I'm using this:-
>> >
>> > http://www.supermicro.nl/products/system/4U/F617/SYS-F617H6-FTPTL_.cfm
>> >
>> > Mainly because of rack density, which I know doesn't apply to you. But
>> the
>> > fact they share PSU's/Rails/Chassis helps reduce power a bit and drives
>> down
>> > cost
>> >
>> > I can get 14 disks in each and they have 10GB on board. The SAS
>> controller
>> > is flashable to JBOD mode.
>> >
>> > Maybe one of the other Twin solutions might be suitable?
>>
>> I did consider that exact model (It was mentioned on the list some time
>> ago)
>> I could get about the same effective storage-capacity with it, but
>> 10G-Networking is just too expensive on the Switch-side.
>>
>> Also those nodes and 10G-Switches consume a lot more power.
>>
>> By my estimates and numbers I found, the Avoton-Nodes should run at about
>> 55W
>> each. The Switches (EX3300) according to tech-specs would need 76W at max
>> each.

___
Dominik
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com