full osd ssd cluster advise : replication 2x or 3x ?

chibi@xxxxxxx (Christian Balzer) · Fri, 23 May 2014 14:29:52 +0900



On Fri, 23 May 2014 07:02:15 +0200 (CEST) Alexandre DERUMIER wrote:

> >>What is your main goal for that cluster, high IOPS, high sequential
> >>writes or reads?
> 
> high iops, mostly random. (it's an rbd cluster, with qemu-kvm guest,
> around 1000vms, doing smalls ios each one).
> 
> 80%read|20% write
> 
> I don't care about sequential workload, or bandwith. 
> 
> 
> >>Remember my "Slow IOPS on RBD..." thread, you probably shouldn't expect
> >>more than 800 write IOPS and 4000 read IOPS per OSD (replication 2).
> 
> Yes, that's enough for me !  I can't use spinner disk, because it's
> really too slow. I need around 30000iops for around 20TB of storage.
> 
> I could even go to cheaper consummer ssd (like crucial m550), I think I
> could reach 2000-4000 iops from it. But I'm afraid of
> durability|stability.
> 
That's not the only thing you should worry about.
Aside from the higher risk there's total cost of ownership or Cost per
terabyte written ($/TBW).
So while the DC S3700 800GB is about $1800 and the same sized DC S3500 at
about $850, the 3700 can reliably store 7300TB while the 3500 is only
rated for 450TB. 
You do the math. ^.^

Christian
> ----- Mail original ----- 
> 
> De: "Christian Balzer" <chibi at gol.com> 
> ?: ceph-users at lists.ceph.com 
> Envoy?: Vendredi 23 Mai 2014 04:57:51 
> Objet: Re: full osd ssd cluster advise : replication 2x or
> 3x ? 
> 
> 
> Hello, 
> 
> On Thu, 22 May 2014 18:00:56 +0200 (CEST) Alexandre DERUMIER wrote: 
> 
> > Hi, 
> > 
> > I'm looking to build a full osd ssd cluster, with this config: 
> > 
> What is your main goal for that cluster, high IOPS, high sequential
> writes or reads? 
> 
> Remember my "Slow IOPS on RBD..." thread, you probably shouldn't expect 
> more than 800 write IOPS and 4000 read IOPS per OSD (replication 2). 
> 
> > 6 nodes, 
> > 
> > each node 10 osd/ ssd drives (dual 10gbit network). (1journal + datas 
> > on each osd) 
> > 
> Halving the write speed of the SSD, leaving you with about 2GB/s max
> write speed per node. 
> 
> If you're after good write speeds and with a replication factor of 2 I 
> would split the network into public and cluster ones. 
> If you're however after top read speeds, use bonding for the 2 links
> into the public network, half of your SSDs per node are able to saturate
> that. 
> 
> > ssd drive will be entreprise grade, 
> > 
> > maybe intel sc3500 800GB (well known ssd) 
> > 
> How much write activity do you expect per OSD (remember that you in your 
> case writes are doubled)? Those drives have a total write capacity of 
> about 450TB (within 5 years). 
> 
> > or new Samsung SSD PM853T 960GB (don't have too much info about it for 
> > the moment, but price seem a little bit lower than intel) 
> > 
> 
> Looking at the specs it seems to have a better endurance (I used 
> 500GB/day, a value that seemed realistic given the 2 numbers they gave), 
> at least double that of the Intel. 
> Alas they only give a 3 year warranty, which makes me wonder. 
> Also the latencies are significantly higher than the 3500. 
> 
> > 
> > I would like to have some advise on replication level, 
> > 
> > 
> > Maybe somebody have experience with intel sc3500 failure rate ? 
> 
> I doubt many people have managed to wear out SSDs of that vintage in 
> normal usage yet. And so far none of my dozens of Intel SSDs (including 
> some ancient X25-M ones) have died. 
> 
> > How many chance to have 2 failing disks on 2 differents nodes at the 
> > same time (murphy's law ;). 
> > 
> Indeed. 
> 
> From my experience and looking at the technology I would postulate that: 
> 1. SSD failures are very rare during their guaranteed endurance 
> period/data volume. 
> 2. Once the endurance level is exceeded the probability of SSDs failing 
> within short periods of each other becomes pretty high. 
> 
> So if you're monitoring the SSDs (SMART) religiously and take measure to 
> avoid clustered failures (for example by replacing SSDs early or adding 
> new nodes gradually, like 1 every 6 months or so) you probably are OK. 
> 
> Keep in mind however that the larger this cluster grows, the more likely
> a double failure scenario becomes. 
> Statistics and Murphy are out to get you. 
> 
> With normal disks I would use a Ceph replication of 3 or when using
> RAID6 nothing larger than 12 disks per set. 
> 
> > 
> > I think in case of disk failure, pgs should replicate fast with
> > 10gbits links. 
> > 
> That very much also depends on your cluster load and replication
> settings. 
> 
> Regards, 
> 
> Christian 
> 
> > 
> > So the question is: 
> > 
> > 2x or 3x ? 
> > 
> > 
> > Regards, 
> > 
> > Alexandre 
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/