full osd ssd cluster advise : replication 2x or 3x ?

aderumier@xxxxxxxxx (Alexandre DERUMIER) · Fri, 23 May 2014 07:02:15 +0200 (CEST)

>>What is your main goal for that cluster, high IOPS, high sequential writes
>>or reads?

high iops, mostly random. (it's an rbd cluster, with qemu-kvm guest, around 1000vms, doing smalls ios each one).

80%read|20% write

I don't care about sequential workload, or bandwith. 

>>Remember my "Slow IOPS on RBD..." thread, you probably shouldn't expect
>>more than 800 write IOPS and 4000 read IOPS per OSD (replication 2).

Yes, that's enough for me !  I can't use spinner disk, because it's really too slow.
I need around 30000iops for around 20TB of storage.

I could even go to cheaper consummer ssd (like crucial m550), I think I could reach 2000-4000 iops from it.
But I'm afraid of durability|stability.

----- Mail original ----- 

De: "Christian Balzer" <chibi at gol.com> 
?: ceph-users at lists.ceph.com 
Envoy?: Vendredi 23 Mai 2014 04:57:51 
Objet: Re: full osd ssd cluster advise : replication 2x or 3x ? 

Hello, 

On Thu, 22 May 2014 18:00:56 +0200 (CEST) Alexandre DERUMIER wrote: 

> Hi, 
> 
> I'm looking to build a full osd ssd cluster, with this config: 
> 
What is your main goal for that cluster, high IOPS, high sequential writes 
or reads? 

Remember my "Slow IOPS on RBD..." thread, you probably shouldn't expect 
more than 800 write IOPS and 4000 read IOPS per OSD (replication 2). 

> 6 nodes, 
> 
> each node 10 osd/ ssd drives (dual 10gbit network). (1journal + datas 
> on each osd) 
> 
Halving the write speed of the SSD, leaving you with about 2GB/s max write 
speed per node. 

If you're after good write speeds and with a replication factor of 2 I 
would split the network into public and cluster ones. 
If you're however after top read speeds, use bonding for the 2 links into 
the public network, half of your SSDs per node are able to saturate that. 

> ssd drive will be entreprise grade, 
> 
> maybe intel sc3500 800GB (well known ssd) 
> 
How much write activity do you expect per OSD (remember that you in your 
case writes are doubled)? Those drives have a total write capacity of 
about 450TB (within 5 years). 

> or new Samsung SSD PM853T 960GB (don't have too much info about it for 
> the moment, but price seem a little bit lower than intel) 
> 

Looking at the specs it seems to have a better endurance (I used 
500GB/day, a value that seemed realistic given the 2 numbers they gave), 
at least double that of the Intel. 
Alas they only give a 3 year warranty, which makes me wonder. 
Also the latencies are significantly higher than the 3500. 

> 
> I would like to have some advise on replication level, 
> 
> 
> Maybe somebody have experience with intel sc3500 failure rate ? 

I doubt many people have managed to wear out SSDs of that vintage in 
normal usage yet. And so far none of my dozens of Intel SSDs (including 
some ancient X25-M ones) have died. 

> How many chance to have 2 failing disks on 2 differents nodes at the 
> same time (murphy's law ;). 
> 
Indeed. 

>From my experience and looking at the technology I would postulate that: 
1. SSD failures are very rare during their guaranteed endurance 
period/data volume. 
2. Once the endurance level is exceeded the probability of SSDs failing 
within short periods of each other becomes pretty high. 

So if you're monitoring the SSDs (SMART) religiously and take measure to 
avoid clustered failures (for example by replacing SSDs early or adding 
new nodes gradually, like 1 every 6 months or so) you probably are OK. 

Keep in mind however that the larger this cluster grows, the more likely a 
double failure scenario becomes. 
Statistics and Murphy are out to get you. 

With normal disks I would use a Ceph replication of 3 or when using RAID6 
nothing larger than 12 disks per set. 

> 
> I think in case of disk failure, pgs should replicate fast with 10gbits 
> links. 
> 
That very much also depends on your cluster load and replication settings. 

Regards, 

Christian 

> 
> So the question is: 
> 
> 2x or 3x ? 
> 
> 
> Regards, 
> 
> Alexandre 

-- 
Christian Balzer Network/Systems Engineer 
chibi at gol.com Global OnLine Japan/Fusion Communications 
http://www.gol.com/ 
_______________________________________________ 
ceph-users mailing list 
ceph-users at lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com