Hello, On Sun, 19 Apr 2015 06:22:44 +0200 Francois Lafont wrote: > Hi, > > We are thinking about a ceph infrastructure and I have questions. > Here is the conceived (but not yet implemented) infrastructure: > (please, be careful to read the schema with a monospace font ;)) > > > +---------+ > | users | > |(browser)| > +----+----+ > | > | > +----+----+ > | | > +----------+ WAN +------------+ > | | | | > | +---------+ | > | | > | | > +-----+-----+ +-----+-----+ > | | | | > | monitor-1 | | monitor-3 | > | monitor-2 | | | > | | Fiber connection | | > | +---------------------+ | > | OSD-1 | | OSD-13 | > | OSD-2 | | OSD-14 | > | ... | | ... | > | OSD-12 | | OSD-24 | > | | | | > | client-a1 | | client-a2 | > | client-b1 | | client-b2 | > | | | | > +-----------+ +-----------+ > Datacenter1 Datacenter2 > (DC1) (DC2) > For starters, make that 5 MONs. It won't really help you with your problem of keeping a quorum when loosing a DC, but being able to loose more than 1 monitor will come in handy. Note that MONs don't really need to be dedicated nodes, if you know what you're doing and have enough resources (most importantly fast I/O aka SSD for the leveldb) on another machine. > In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk. > Journals in SSD, there are 2 SSD so 3 journals per SSD. > In DC2: the same config. > Out of curiosity, is that a 1U case with 8 2.5" bays, or why that (relatively low) density per node? 4 nodes make a pretty small cluster, if you loose a SSD or a whole node your cluster will get rather busy and may run out of space if you filled it more than 50%. > You can imagine for instance that: > - client-a1 and client-a2 are radosgw > - client-b1 and client-b2 are web servers which use the Cephfs of the > cluster. > > And of course, the principle is to have data dispatched in DC1 and > DC2 (size == 2, one copy of the object in DC1, the other in DC2). > Unless you OSDs are RAID1s, a replica of 2 is basically asking Murphy to "bless" you with a double disk failure. A very distinct probability with 24 HDDs. With OSDs backed by plain HDDs you really want a replica size of 3. Normally you'd configure Ceph to NOT set OSDs out automatically if a DC fails (mon_osd_down_out_subtree_limit) but in the case of a prolonged DC outage you'll want to restore redundancy and set those OSDs out. Which means you will need 3 times the actual data capacity on your surviving 2 nodes. In other words, if your 24 OSDs are 2TB each you can "safely" only store 8TB in your cluster (48TB/3(replica)/2(DCs). > > 1. If I suppose that the latency between DC1 and DC2 (via the fiber > connection) is ok, I would like to know which throughput do I need to > avoid network bottleneck? Is there a rule to compute the needed > throughput? I suppose it depends on the disk throughputs? > Fiber isn't magical FTL (faster than light) communications and the latency depends (mostly) on the length (which you may or may not control) and the protocol used. A 2m long GbE link has a much worse latency than the same length in Infiniband. You will of course need "enough" bandwidth, but what is going to kill (making it rather slow) your cluster will be the latency between those DCs. Each write will have to be acknowledged and this is where every ms less of latency will make a huge difference. > For instance, I suppose the OSD disks in DC1 (and in DC2) has > a throughput equal to 150 MB/s, so with 12 OSD disk in each DC, > I have: > > 12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps > > So, in the fiber, I need to have 14.4 Mbs. Is it correct? How do you get from 1.8 GigaByte/s to 14.4 Megabit/s? You to multiply, not divide. And assuming 10 bits (not 8) for a Byte when serialized never hurts. So that's 18 Gb/s. >Maybe is it > too naive reasoning? > Very much so. Your disks (even with SSD journals) will not write 150MB/s, because Ceph doesn't do long sequential writes (though 4MB blobs are better than nothing) and more importantly these writes are concurrently. So while one client is writing to an object at one end of your HDD another one may write to a very different, distant location. Seeking delays. With more than one client, you'd be lucky to see 50-70MB/s per HDD. > Furthermore I have not taken into account the SSD. How evaluate the > needed throughput more precisely? > You need to consider the speed of the devices, their local bus (hopefully fast enough) and the network. All things considered you probably want a redundant link (but with bandwidth aggregation if both links are up). 10Gb/s per link would do, but 40Gb/s links (or your storage network on something other than Ethernet) will have less latency on top of the capacity for future expansion. I'll leave the failover bits to others. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com