IOPS are weird things with SSDs. In theory, you'd see 25% of the write IOPS when writing to a 4-way RAID5 device, since you write to all 4 devices in parallel. Except that's not actually true--unlike HDs where an IOP is an IOP, SSD IOPS limits are really just a function of request size. Because each operation would be ~1/3rd the size, you should see a net of about 3x the performance of one drive overall, or 75% of the sum of the drives. The CPU use will be higher, but it may or may not be a substantial hit for your use case. Journals are basically write-only, and 200G S3700s are supposed to be able to sustain around 360 MB/sec, so RAID 5 would give you somewhere around 1 GB/sec writing on paper. Depending on your access patterns, that may or may not be a win vs single SSDs; it should give you slightly lower latency for uncongested writes at the very least. It's probably worth benchmarking if you have the time. OTOH, S3700s seem to be pretty reliable, and if your cluster is big enough to handle the loss of 5 OSDs without a big hit, then the lack of complexity may be a bigger win all on its own. Scott On Sat Sep 06 2014 at 9:28:32 AM Dan Van Der Ster <daniel.vanderster at cern.ch> wrote: > RAID5... Hadn't considered it due to the IOPS penalty (it would get > 1/4th of the IOPS of separated journal devices, according to some online > raid calc). Compared to RAID10, I guess we'd get 50% more capacity, but > lower performance. > > After the anecdotes that the DCS3700 is very rarely failing, and without a > stable bcache to build upon, I'm leaning toward the usual 5 journal > partitions per SSD. But that will leave at least 100GB free per drive, so I > might try running an OSD there. > > Cheers, Dan > On Sep 6, 2014 6:07 PM, Scott Laird <scott at sigkill.org> wrote: > Backing up slightly, have you considered RAID 5 over your SSDs? > Practically speaking, there's no performance downside to RAID 5 when your > devices aren't IOPS-bound. > > On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer <chibi at gol.com> wrote: > >> On Sat, 6 Sep 2014 14:50:20 +0000 Dan van der Ster wrote: >> >> > September 6 2014 4:01 PM, "Christian Balzer" <chibi at gol.com> wrote: >> > > On Sat, 6 Sep 2014 13:07:27 +0000 Dan van der Ster wrote: >> > > >> > >> Hi Christian, >> > >> >> > >> Let's keep debating until a dev corrects us ;) >> > > >> > > For the time being, I give the recent: >> > > >> > > https://www.mail-archive.com/ceph-users at lists.ceph.com/msg12203.html >> > > >> > > And not so recent: >> > > http://www.spinics.net/lists/ceph-users/msg04152.html >> > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021 >> > > >> > > And I'm not going to use BTRFS for mainly RBD backed VM images >> > > (fragmentation city), never mind the other stability issues that crop >> > > up here ever so often. >> > >> > >> > Thanks for the links... So until I learn otherwise, I better assume the >> > OSD is lost when the journal fails. Even though I haven't understood >> > exactly why :( I'm going to UTSL to understand the consistency better. >> > An op state diagram would help, but I didn't find one yet. >> > >> Using the source as an option of last resort is always nice, having to >> actually do so for something like this feels a bit lacking in the >> documentation department (that or my google foo being weak). ^o^ >> >> > BTW, do you happen to know, _if_ we re-use an OSD after the journal has >> > failed, are any object inconsistencies going to be found by a >> > scrub/deep-scrub? >> > >> No idea. >> And really a scenario I hope to never encounter. ^^;; >> >> > >> >> > >> We have 4 servers in a 3U rack, then each of those servers is >> > >> connected to one of these enclosures with a single SAS cable. >> > >> >> > >>>> With the current config, when I dd to all drives in parallel I can >> > >>>> write at 24*74MB/s = 1776MB/s. >> > >>> >> > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0 >> > >>> lanes, so as far as that bus goes, it can do 4GB/s. >> > >>> And given your storage pod I assume it is connected with 2 mini-SAS >> > >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA >> > >>> bandwidth. >> > >> >> > >> From above, we are only using 4 lanes -- so around 2GB/s is expected. >> > > >> > > Alright, that explains that then. Any reason for not using both ports? >> > > >> > >> > Probably to minimize costs, and since the single 10Gig-E is a bottleneck >> > anyway. The whole thing is suboptimal anyway, since this hardware was >> > not purchased for Ceph to begin with. Hence retrofitting SSDs, etc... >> > >> The single 10Gb/s link is the bottleneck for sustained stuff, but when >> looking at spikes... >> Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port >> might also get some loving. ^o^ >> >> The cluster I'm currently building is based on storage nodes with 4 SSDs >> (100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8 >> HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for >> redundancy, not speed. ^^ >> >> > >>> Impressive, even given your huge cluster with 1128 OSDs. >> > >>> However that's not really answering my question, how much data is on >> > >>> an average OSD and thus gets backfilled in that hour? >> > >> >> > >> That's true -- our drives have around 300TB on them. So I guess it >> > >> will take longer - 3x longer - when the drives are 1TB full. >> > > >> > > On your slides, when the crazy user filled the cluster with 250 >> million >> > > objects and thus 1PB of data, I recall seeing a 7 hour backfill time? >> > > >> > >> > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not >> > close to 1PB. The point was that to fill the cluster with RBD, we'd need >> > 250 million (4MB) objects. So, object-count-wise this was a full >> > cluster, but for the real volume it was more like 70TB IIRC (there were >> > some other larger objects too). >> > >> Ah, I see. ^^ >> >> > In that case, the backfilling was CPU-bound, or perhaps >> > wbthrottle-bound, I don't remember... It was just that there were many >> > tiny tiny objects to synchronize. >> > >> Indeed. This is something me and others have seen as well, as in >> backfilling being much slower than the underlying HW would permit and >> being CPU intensive. >> >> > > Anyway, I guess the lesson to take away from this is that size and >> > > parallelism does indeed help, but even in a cluster like yours >> > > recovering from a 2TB loss would likely be in the 10 hour range... >> > >> > Bigger clusters probably backfill faster simply because there are more >> > OSDs involved in the backfilling. In our cluster we initially get 30-40 >> > backfills in parallel after 1 OSD fails. That's even with max backfills >> > = 1. The backfilling sorta follows an 80/20 rule -- 80% of the time is >> > spent backfilling the last 20% of the PGs, just because some OSDs >> > randomly get more new PGs than the others. >> > >> You still being on dumpling probably doesn't help that uneven distribution >> bit. >> Definitely another data point to go into a realistic recovery/reliability >> model, though. >> >> Christian >> >> > > Again, see the "Best practice K/M-parameters EC pool" thread. ^.^ >> > >> > Marked that one to read, again. >> > >> > Cheers, dan >> > >> >> >> -- >> Christian Balzer Network/Systems Engineer >> chibi at gol.com Global OnLine Japan/Fusion Communications >> http://www.gol.com/ >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140906/2a5ec52f/attachment.htm>