full osd ssd cluster advise : replication 2x or 3x ?

aderumier@xxxxxxxxx (Alexandre DERUMIER) · Fri, 23 May 2014 08:22:08 +0200 (CEST)

BTW,
the new samsung PM853T SSD, announce 665 TBW for 4K random write
http://www.tomsitpro.com/articles/samsung-3-bit-nand-enterprise-ssd,1-1922.html

and price are cheaper than intel s3500. (around 450? ex vat)

(Cluster will be build next year, so I have some time before choose the good one ssd)

my main concern, is to known if it's really needed to have replication x3 (mainly for cost price).
But I can wait to have lower ssd price next year, and go to 3x if necessary.

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier at odiso.com> 
?: "Christian Balzer" <chibi at gol.com> 
Cc: ceph-users at lists.ceph.com 
Envoy?: Vendredi 23 Mai 2014 07:59:58 
Objet: Re: full osd ssd cluster advise : replication 2x or 3x ? 

>>That's not the only thing you should worry about. 
>>Aside from the higher risk there's total cost of ownership or Cost per 
>>terabyte written ($/TBW). 
>>So while the DC S3700 800GB is about $1800 and the same sized DC S3500 at 
>>about $850, the 3700 can reliably store 7300TB while the 3500 is only 
>>rated for 450TB. 
>>You do the math. ^.^ 

Yes, I known,I have already do the math. But I'm far from reach this amount of write. 

workload is (really) random, so 20% of write of 30000iops, 4k block = 25MB/s of write, 2TB each day. 
with replication 3x, 6TB each day of write. 
60x450TBW = 27000TBW / 6TB = 4500 days = 12,5 years ;) 

so with journal write, it of course less, but I think it should be enough for 5 years 

I'll also test key-value store, as no more journal, less write. 
(Not sure it works fine with rbd for the moment) 

----- Mail original ----- 

De: "Christian Balzer" <chibi at gol.com> 
?: ceph-users at lists.ceph.com 
Envoy?: Vendredi 23 Mai 2014 07:29:52 
Objet: Re: full osd ssd cluster advise : replication 2x or 3x ? 

On Fri, 23 May 2014 07:02:15 +0200 (CEST) Alexandre DERUMIER wrote: 

> >>What is your main goal for that cluster, high IOPS, high sequential 
> >>writes or reads? 
> 
> high iops, mostly random. (it's an rbd cluster, with qemu-kvm guest, 
> around 1000vms, doing smalls ios each one). 
> 
> 80%read|20% write 
> 
> I don't care about sequential workload, or bandwith. 
> 
> 
> >>Remember my "Slow IOPS on RBD..." thread, you probably shouldn't expect 
> >>more than 800 write IOPS and 4000 read IOPS per OSD (replication 2). 
> 
> Yes, that's enough for me ! I can't use spinner disk, because it's 
> really too slow. I need around 30000iops for around 20TB of storage. 
> 
> I could even go to cheaper consummer ssd (like crucial m550), I think I 
> could reach 2000-4000 iops from it. But I'm afraid of 
> durability|stability. 
> 
That's not the only thing you should worry about. 
Aside from the higher risk there's total cost of ownership or Cost per 
terabyte written ($/TBW). 
So while the DC S3700 800GB is about $1800 and the same sized DC S3500 at 
about $850, the 3700 can reliably store 7300TB while the 3500 is only 
rated for 450TB. 
You do the math. ^.^ 

Christian 
> ----- Mail original ----- 
> 
> De: "Christian Balzer" <chibi at gol.com> 
> ?: ceph-users at lists.ceph.com 
> Envoy?: Vendredi 23 Mai 2014 04:57:51 
> Objet: Re: full osd ssd cluster advise : replication 2x or 
> 3x ? 
> 
> 
> Hello, 
> 
> On Thu, 22 May 2014 18:00:56 +0200 (CEST) Alexandre DERUMIER wrote: 
> 
> > Hi, 
> > 
> > I'm looking to build a full osd ssd cluster, with this config: 
> > 
> What is your main goal for that cluster, high IOPS, high sequential 
> writes or reads? 
> 
> Remember my "Slow IOPS on RBD..." thread, you probably shouldn't expect 
> more than 800 write IOPS and 4000 read IOPS per OSD (replication 2). 
> 
> > 6 nodes, 
> > 
> > each node 10 osd/ ssd drives (dual 10gbit network). (1journal + datas 
> > on each osd) 
> > 
> Halving the write speed of the SSD, leaving you with about 2GB/s max 
> write speed per node. 
> 
> If you're after good write speeds and with a replication factor of 2 I 
> would split the network into public and cluster ones. 
> If you're however after top read speeds, use bonding for the 2 links 
> into the public network, half of your SSDs per node are able to saturate 
> that. 
> 
> > ssd drive will be entreprise grade, 
> > 
> > maybe intel sc3500 800GB (well known ssd) 
> > 
> How much write activity do you expect per OSD (remember that you in your 
> case writes are doubled)? Those drives have a total write capacity of 
> about 450TB (within 5 years). 
> 
> > or new Samsung SSD PM853T 960GB (don't have too much info about it for 
> > the moment, but price seem a little bit lower than intel) 
> > 
> 
> Looking at the specs it seems to have a better endurance (I used 
> 500GB/day, a value that seemed realistic given the 2 numbers they gave), 
> at least double that of the Intel. 
> Alas they only give a 3 year warranty, which makes me wonder. 
> Also the latencies are significantly higher than the 3500. 
> 
> > 
> > I would like to have some advise on replication level, 
> > 
> > 
> > Maybe somebody have experience with intel sc3500 failure rate ? 
> 
> I doubt many people have managed to wear out SSDs of that vintage in 
> normal usage yet. And so far none of my dozens of Intel SSDs (including 
> some ancient X25-M ones) have died. 
> 
> > How many chance to have 2 failing disks on 2 differents nodes at the 
> > same time (murphy's law ;). 
> > 
> Indeed. 
> 
> From my experience and looking at the technology I would postulate that: 
> 1. SSD failures are very rare during their guaranteed endurance 
> period/data volume. 
> 2. Once the endurance level is exceeded the probability of SSDs failing 
> within short periods of each other becomes pretty high. 
> 
> So if you're monitoring the SSDs (SMART) religiously and take measure to 
> avoid clustered failures (for example by replacing SSDs early or adding 
> new nodes gradually, like 1 every 6 months or so) you probably are OK. 
> 
> Keep in mind however that the larger this cluster grows, the more likely 
> a double failure scenario becomes. 
> Statistics and Murphy are out to get you. 
> 
> With normal disks I would use a Ceph replication of 3 or when using 
> RAID6 nothing larger than 12 disks per set. 
> 
> > 
> > I think in case of disk failure, pgs should replicate fast with 
> > 10gbits links. 
> > 
> That very much also depends on your cluster load and replication 
> settings. 
> 
> Regards, 
> 
> Christian 
> 
> > 
> > So the question is: 
> > 
> > 2x or 3x ? 
> > 
> > 
> > Regards, 
> > 
> > Alexandre 
> 
> 

-- 
Christian Balzer Network/Systems Engineer 
chibi at gol.com Global OnLine Japan/Fusion Communications 
http://www.gol.com/ 
_______________________________________________ 
ceph-users mailing list 
ceph-users at lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_______________________________________________ 
ceph-users mailing list 
ceph-users at lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com