Hi, Christian Balzer wrote: > For starters, make that 5 MONs. > It won't really help you with your problem of keeping a quorum when > loosing a DC, but being able to loose more than 1 monitor will come in > handy. > Note that MONs don't really need to be dedicated nodes, if you know what > you're doing and have enough resources (most importantly fast I/O aka SSD > for the leveldb) on another machine. Ok, I keep that in my head. >> In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk. >> Journals in SSD, there are 2 SSD so 3 journals per SSD. >> In DC2: the same config. >> > Out of curiosity, is that a 1U case with 8 2.5" bays, or why that > (relatively low) density per node? Sorry I have no idea because, in fact, it was just an example to be concrete. So I have taken a (imaginary) server with 8 disks and 2 SSDs (among the 8 disks, 2 for the OS in RAID1 soft). Currently, I can't be precise about hardware because were are absolutely not fixed about the budget (if we get it!), there are lot of uncertainties. > 4 nodes make a pretty small cluster, if you loose a SSD or a whole node > your cluster will get rather busy and may run out of space if you filled > it more than 50%. Yes indeed, it's a relevant remark. If the cluster is ~50% filled and if a node crashes in a DC, the other node in the same DC will be 100% filled and the cluster will be blocked. Indeed, the cluster is probably too small. > Unless you OSDs are RAID1s, a replica of 2 is basically asking Murphy to > "bless" you with a double disk failure. A very distinct probability with > 24 HDDs. The probability of a *simultaneous* disk failure in DC1 and in DC2 seems to me relatively low. For instance, if a disk fails in DC1 and if the rebalancing of data takes ~ 1 or 2 hours, it seems to me acceptable. But maybe I'm too optimistic... ;) > With OSDs backed by plain HDDs you really want a replica size of 3. But the "2-DCs" topology isn't really suitable for a replica size of 3, no? Is the replica size of 2 so risky? > Normally you'd configure Ceph to NOT set OSDs out automatically if a DC > fails (mon_osd_down_out_subtree_limit) I didn't known this option. In the online doc, the explanations are not clear enough for me and I'm not sure to understand its meaning. If I set: mon_osd_down_out_subtree_limit = datacenter what are the consequences? - If all OSDs in DC2 are unreachable, these OSDs will not be marked out - and if only several OSDs in DC2 are unreachable but not all in DC2, these OSDs will be marked out. Am I correct? > but in the case of a prolonged DC > outage you'll want to restore redundancy and set those OSDs out. > Which means you will need 3 times the actual data capacity on your > surviving 2 nodes. > In other words, if your 24 OSDs are 2TB each you can "safely" only store > 8TB in your cluster (48TB/3(replica)/2(DCs). I see but my idea was just to have a long enough disaster in DC1 so that I must restart the cluster in degraded mode in DC2, but not long enough so that I must restore a total redundancy in DC2. Personally I didn't consider this case and, unfortunately, I think we will never have a budget to be able to restore a total redundancy in just one datacenter. I'm afraid that it a unreachable whealth for us. > Fiber isn't magical FTL (faster than light) communications and the latency > depends (mostly) on the length (which you may or may not control) and the > protocol used. > A 2m long GbE link has a much worse latency than the same length in > Infiniband. In our case, if we can implement this infrastructure (if we have the budget etc.), the connection would be probably 2 dark fiber with 10km between DC1 and DC2. And we'll use Ethernet switchs with SFP transceivers (if you have good references of switchs, I'm interested). I suppose it could be possible to have low latencies in this case, no? > You will of course need "enough" bandwidth, but what is going to kill > (making it rather slow) your cluster will be the latency between those DCs. > > Each write will have to be acknowledged and this is where every ms less of > latency will make a huge difference. Yes indeed, I understand. >> For instance, I suppose the OSD disks in DC1 (and in DC2) has >> a throughput equal to 150 MB/s, so with 12 OSD disk in each DC, >> I have: >> >> 12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps >> >> So, in the fiber, I need to have 14.4 Mbs. Is it correct? > > How do you get from 1.8 GigaByte/s to 14.4 Megabit/s? Sorry, it was a misprint, I wanted to write 14.4 Gb/s of course. ;) > You to multiply, not divide. > And assuming 10 bits (not 8) for a Byte when serialized never hurts. > So that's 18 Gb/s. Yes, indeed. So the "naive" estimation gives 18 Gb/s (Ok for 10 bits instead of 8). >> Maybe is it too naive reasoning? > > Very much so. Your disks (even with SSD journals) will not write 150MB/s, > because Ceph doesn't do long sequential writes (though 4MB blobs are > better than nothing) and more importantly these writes are concurrently. > So while one client is writing to an object at one end of your HDD another > one may write to a very different, distant location. Seeking delays. > With more than one client, you'd be lucky to see 50-70MB/s per HDD. Ok, but if I follow your explanations, the throughput obtained with the "naive" estimation is too big. In fact, I could just have: 12 x 70 = 840 MB/s ie 0.840 GB/s => 8.4 Gb/s Correct? >> Furthermore I have not taken into account the SSD. How evaluate the >> needed throughput more precisely? >> > You need to consider the speed of the devices, their local bus (hopefully > fast enough) and the network. > > All things considered you probably want a redundant link (but with > bandwidth aggregation if both links are up). > 10Gb/s per link would do, but 40Gb/s links (or your storage network on > something other than Ethernet) will have less latency on top of the > capacity for future expansion. Ok, thanks for your help Christian. -- François Lafont _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com