On Fri, 23 May 2014 07:02:15 +0200 (CEST) Alexandre DERUMIER wrote: > >>What is your main goal for that cluster, high IOPS, high sequential > >>writes or reads? > > high iops, mostly random. (it's an rbd cluster, with qemu-kvm guest, > around 1000vms, doing smalls ios each one). > > 80%read|20% write > > I don't care about sequential workload, or bandwith. > > > >>Remember my "Slow IOPS on RBD..." thread, you probably shouldn't expect > >>more than 800 write IOPS and 4000 read IOPS per OSD (replication 2). > > Yes, that's enough for me ! I can't use spinner disk, because it's > really too slow. I need around 30000iops for around 20TB of storage. > > I could even go to cheaper consummer ssd (like crucial m550), I think I > could reach 2000-4000 iops from it. But I'm afraid of > durability|stability. > That's not the only thing you should worry about. Aside from the higher risk there's total cost of ownership or Cost per terabyte written ($/TBW). So while the DC S3700 800GB is about $1800 and the same sized DC S3500 at about $850, the 3700 can reliably store 7300TB while the 3500 is only rated for 450TB. You do the math. ^.^ Christian > ----- Mail original ----- > > De: "Christian Balzer" <chibi at gol.com> > ?: ceph-users at lists.ceph.com > Envoy?: Vendredi 23 Mai 2014 04:57:51 > Objet: Re: full osd ssd cluster advise : replication 2x or > 3x ? > > > Hello, > > On Thu, 22 May 2014 18:00:56 +0200 (CEST) Alexandre DERUMIER wrote: > > > Hi, > > > > I'm looking to build a full osd ssd cluster, with this config: > > > What is your main goal for that cluster, high IOPS, high sequential > writes or reads? > > Remember my "Slow IOPS on RBD..." thread, you probably shouldn't expect > more than 800 write IOPS and 4000 read IOPS per OSD (replication 2). > > > 6 nodes, > > > > each node 10 osd/ ssd drives (dual 10gbit network). (1journal + datas > > on each osd) > > > Halving the write speed of the SSD, leaving you with about 2GB/s max > write speed per node. > > If you're after good write speeds and with a replication factor of 2 I > would split the network into public and cluster ones. > If you're however after top read speeds, use bonding for the 2 links > into the public network, half of your SSDs per node are able to saturate > that. > > > ssd drive will be entreprise grade, > > > > maybe intel sc3500 800GB (well known ssd) > > > How much write activity do you expect per OSD (remember that you in your > case writes are doubled)? Those drives have a total write capacity of > about 450TB (within 5 years). > > > or new Samsung SSD PM853T 960GB (don't have too much info about it for > > the moment, but price seem a little bit lower than intel) > > > > Looking at the specs it seems to have a better endurance (I used > 500GB/day, a value that seemed realistic given the 2 numbers they gave), > at least double that of the Intel. > Alas they only give a 3 year warranty, which makes me wonder. > Also the latencies are significantly higher than the 3500. > > > > > I would like to have some advise on replication level, > > > > > > Maybe somebody have experience with intel sc3500 failure rate ? > > I doubt many people have managed to wear out SSDs of that vintage in > normal usage yet. And so far none of my dozens of Intel SSDs (including > some ancient X25-M ones) have died. > > > How many chance to have 2 failing disks on 2 differents nodes at the > > same time (murphy's law ;). > > > Indeed. > > From my experience and looking at the technology I would postulate that: > 1. SSD failures are very rare during their guaranteed endurance > period/data volume. > 2. Once the endurance level is exceeded the probability of SSDs failing > within short periods of each other becomes pretty high. > > So if you're monitoring the SSDs (SMART) religiously and take measure to > avoid clustered failures (for example by replacing SSDs early or adding > new nodes gradually, like 1 every 6 months or so) you probably are OK. > > Keep in mind however that the larger this cluster grows, the more likely > a double failure scenario becomes. > Statistics and Murphy are out to get you. > > With normal disks I would use a Ceph replication of 3 or when using > RAID6 nothing larger than 12 disks per set. > > > > > I think in case of disk failure, pgs should replicate fast with > > 10gbits links. > > > That very much also depends on your cluster load and replication > settings. > > Regards, > > Christian > > > > > So the question is: > > > > 2x or 3x ? > > > > > > Regards, > > > > Alexandre > > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/