Backing up slightly, have you considered RAID 5 over your SSDs? Practically speaking, there's no performance downside to RAID 5 when your devices aren't IOPS-bound. On Sat Sep 06 2014 at 8:37:56 AM Christian Balzer <chibi at gol.com> wrote: > On Sat, 6 Sep 2014 14:50:20 +0000 Dan van der Ster wrote: > > > September 6 2014 4:01 PM, "Christian Balzer" <chibi at gol.com> wrote: > > > On Sat, 6 Sep 2014 13:07:27 +0000 Dan van der Ster wrote: > > > > > >> Hi Christian, > > >> > > >> Let's keep debating until a dev corrects us ;) > > > > > > For the time being, I give the recent: > > > > > > https://www.mail-archive.com/ceph-users at lists.ceph.com/msg12203.html > > > > > > And not so recent: > > > http://www.spinics.net/lists/ceph-users/msg04152.html > > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021 > > > > > > And I'm not going to use BTRFS for mainly RBD backed VM images > > > (fragmentation city), never mind the other stability issues that crop > > > up here ever so often. > > > > > > Thanks for the links... So until I learn otherwise, I better assume the > > OSD is lost when the journal fails. Even though I haven't understood > > exactly why :( I'm going to UTSL to understand the consistency better. > > An op state diagram would help, but I didn't find one yet. > > > Using the source as an option of last resort is always nice, having to > actually do so for something like this feels a bit lacking in the > documentation department (that or my google foo being weak). ^o^ > > > BTW, do you happen to know, _if_ we re-use an OSD after the journal has > > failed, are any object inconsistencies going to be found by a > > scrub/deep-scrub? > > > No idea. > And really a scenario I hope to never encounter. ^^;; > > > >> > > >> We have 4 servers in a 3U rack, then each of those servers is > > >> connected to one of these enclosures with a single SAS cable. > > >> > > >>>> With the current config, when I dd to all drives in parallel I can > > >>>> write at 24*74MB/s = 1776MB/s. > > >>> > > >>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0 > > >>> lanes, so as far as that bus goes, it can do 4GB/s. > > >>> And given your storage pod I assume it is connected with 2 mini-SAS > > >>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA > > >>> bandwidth. > > >> > > >> From above, we are only using 4 lanes -- so around 2GB/s is expected. > > > > > > Alright, that explains that then. Any reason for not using both ports? > > > > > > > Probably to minimize costs, and since the single 10Gig-E is a bottleneck > > anyway. The whole thing is suboptimal anyway, since this hardware was > > not purchased for Ceph to begin with. Hence retrofitting SSDs, etc... > > > The single 10Gb/s link is the bottleneck for sustained stuff, but when > looking at spikes... > Oh well, I guess if you ever connect that 2nd 10GbE card that 2nd port > might also get some loving. ^o^ > > The cluster I'm currently building is based on storage nodes with 4 SSDs > (100GB DC 3700s, so 800MB/s would be the absolute write speed limit) and 8 > HDDs. Connected with 40Gb/s Infiniband. Dual port, dual switch for > redundancy, not speed. ^^ > > > >>> Impressive, even given your huge cluster with 1128 OSDs. > > >>> However that's not really answering my question, how much data is on > > >>> an average OSD and thus gets backfilled in that hour? > > >> > > >> That's true -- our drives have around 300TB on them. So I guess it > > >> will take longer - 3x longer - when the drives are 1TB full. > > > > > > On your slides, when the crazy user filled the cluster with 250 million > > > objects and thus 1PB of data, I recall seeing a 7 hour backfill time? > > > > > > > Yeah that was fun :) It was 250 million (mostly) 4k objects, so not > > close to 1PB. The point was that to fill the cluster with RBD, we'd need > > 250 million (4MB) objects. So, object-count-wise this was a full > > cluster, but for the real volume it was more like 70TB IIRC (there were > > some other larger objects too). > > > Ah, I see. ^^ > > > In that case, the backfilling was CPU-bound, or perhaps > > wbthrottle-bound, I don't remember... It was just that there were many > > tiny tiny objects to synchronize. > > > Indeed. This is something me and others have seen as well, as in > backfilling being much slower than the underlying HW would permit and > being CPU intensive. > > > > Anyway, I guess the lesson to take away from this is that size and > > > parallelism does indeed help, but even in a cluster like yours > > > recovering from a 2TB loss would likely be in the 10 hour range... > > > > Bigger clusters probably backfill faster simply because there are more > > OSDs involved in the backfilling. In our cluster we initially get 30-40 > > backfills in parallel after 1 OSD fails. That's even with max backfills > > = 1. The backfilling sorta follows an 80/20 rule -- 80% of the time is > > spent backfilling the last 20% of the PGs, just because some OSDs > > randomly get more new PGs than the others. > > > You still being on dumpling probably doesn't help that uneven distribution > bit. > Definitely another data point to go into a realistic recovery/reliability > model, though. > > Christian > > > > Again, see the "Best practice K/M-parameters EC pool" thread. ^.^ > > > > Marked that one to read, again. > > > > Cheers, dan > > > > > -- > Christian Balzer Network/Systems Engineer > chibi at gol.com Global OnLine Japan/Fusion Communications > http://www.gol.com/ > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140906/e49f5bc0/attachment.htm>