Re: Questions about an example of ceph infrastructure

Christian Balzer <chibi@xxxxxxx> · Mon, 20 Apr 2015 12:52:26 +0900

Hello,

On Mon, 20 Apr 2015 04:16:01 +0200 Francois Lafont wrote:

> Hi,
> 
> Christian Balzer wrote:
> 
> > For starters, make that 5 MONs. 
> > It won't really help you with your problem of keeping a quorum when
> > loosing a DC, but being able to loose more than 1 monitor will come in
> > handy.
> > Note that MONs don't really need to be dedicated nodes, if you know
> > what you're doing and have enough resources (most importantly fast I/O
> > aka SSD for the leveldb) on another machine.
> 
> Ok, I keep that in my head.
> 
> >> In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk.
> >>         Journals in SSD, there are 2 SSD so 3 journals per SSD.
> >> In DC2: the same config.
> >>
> > Out of curiosity, is that a 1U case with 8 2.5" bays, or why that
> > (relatively low) density per node?
> 
> Sorry I have no idea because, in fact, it was just an example to be
> concrete. So I have taken a (imaginary) server with 8 disks and 2 SSDs
> (among the 8 disks, 2 for the OS in RAID1 soft). Currently, I can't be
> precise about hardware because were are absolutely not fixed about the
> budget (if we get it!), there are lot of uncertainties.
>
I'd recommend (with 3.5" HDDs) a 2U case with 12 bays. 
Depending on your needs and budget, either with 8HDDs (OSDs) and 4 SSDs
for journals and OS or something denser, see below. 
Having your OS on 4 SSD RAID10 (MD) and the journals on the same SSDs is
perfectly fine.

If you want/need more density and can take the loss of 6 OSDs at the the
same time (not likely with 4 nodes), then a similar chassis with 2 SSD
bays in the back for journals and OS will do the trick.

> > 4 nodes make a pretty small cluster, if you loose a SSD or a whole node
> > your cluster will get rather busy and may run out of space if you
> > filled it more than 50%.
> 
> Yes indeed, it's a relevant remark. If the cluster is ~50% filled and if
> a node crashes in a DC, the other node in the same DC will be 100%
> filled and the cluster will be blocked. Indeed, the cluster is probably
> too small.
> 
You can set the limit for automatic rebuilds at the host level, too.
It's something I do, since I figure that actually fixing a node is likely
to be faster than the resulting data distribution.

> > Unless you OSDs are RAID1s, a replica of 2 is basically asking Murphy
> > to "bless" you with a double disk failure. A very distinct probability
> > with 24 HDDs. 
> 
> The probability of a *simultaneous* disk failure in DC1 and in DC2 seems
> to me relatively low. For instance, if a disk fails in DC1 and if the
> rebalancing of data takes ~ 1 or 2 hours, it seems to me acceptable. But
> maybe I'm too optimistic... ;)
> 
Very much so.
I've had 2 8 disk RAID5s fail with double disk failures, resulting in data
loss. 
1-2 hours is also pretty optimistic, depending on how much data is on that
OSD and how busy your cluster is at that time (and how high you can set
certain values before it becomes unusable during recovery). 
If you assume a 70MB/s recovery rate and 2TB of data, that's over 7 hours.
And if you plunk those numbers into this handy tool (a replica 2 Ceph
cluster is akin to a RAID5):
https://www.memset.com/tools/raid-calculator/
you get a 1:35 chance of data loss per year. 
That's some great odds, if you'd be gambling...

> > With OSDs backed by plain HDDs you really want a replica size of 3.
> 
> But the "2-DCs" topology isn't really suitable for a replica size of 3,
> no? Is the replica size of 2 so risky?
> 
A replica of 3 will do the most important part, prevent data loss from HDD
failures.
A correct CRUSH map will assure that at least one replica is in a
different DC, there have been threads about a scenario like this in the
past, search the archives.

> > Normally you'd configure Ceph to NOT set OSDs out automatically if a DC
> > fails (mon_osd_down_out_subtree_limit)
> 
> I didn't known this option. In the online doc, the explanations are not
> clear enough for me and I'm not sure to understand its meaning. If I set:
> 
>     mon_osd_down_out_subtree_limit = datacenter
> 
> what are the consequences?
> 
>     - If all OSDs in DC2 are unreachable, these OSDs will not be marked
> out
>     - and if only several OSDs in DC2 are unreachable but not all in DC2,
>       these OSDs will be marked out.
> 
Individual OSD failures will be handled as usual (automatically), yes.

> Am I correct?
> 
> > but in the case of a prolonged DC
> > outage you'll want to restore redundancy and set those OSDs out. 
> > Which means you will need 3 times the actual data capacity on your
> > surviving 2 nodes.
> > In other words, if your 24 OSDs are 2TB each you can "safely" only
> > store 8TB in your cluster (48TB/3(replica)/2(DCs).
> 
> I see but my idea was just to have a long enough disaster in DC1 so that 
> I must restart the cluster in degraded mode in DC2, but not long enough
> so that I must restore a total redundancy in DC2. Personally I didn't
> consider this case and, unfortunately, I think we will never have a
> budget to be able to restore a total redundancy in just one datacenter.
> I'm afraid that it a unreachable whealth for us.
> 
Just make sure that you and whoever pays for this understands the
limitations. 
Over here (Japan) a DC loss would likely be a big quake and thus another
surviving DC may have HDD losses cause by the vibrations/shaking as well.

> > Fiber isn't magical FTL (faster than light) communications and the
> > latency depends (mostly) on the length (which you may or may not
> > control) and the protocol used. 
> > A 2m long GbE link has a much worse latency than the same length in
> > Infiniband.
> 
> In our case, if we can implement this infrastructure (if we have the
> budget etc.), the connection would be probably 2 dark fiber with 10km
> between DC1 and DC2. And we'll use Ethernet switchs with SFP transceivers
> (if you have good references of switchs, I'm interested). I suppose it
> could be possible to have low latencies in this case, no?
> 
I'm not primarily a network guy, other people here deal with hardware.
You say Ethernet, I assume 10GbE?

It will of course depend a lot on the equipment and _amount_ of
switches/routers it has to go through, besides the distance.
The fewer, the better. At 10km you're at the edge of what can usually be
accomplished without repeaters or other gear.

There are number of calculators for the distance part like this one:
http://www.numion.com/calculators/Distance.html

In short, you'll probably see about HALF of the bandwidth due to increased
latency (based on 9000 byte packets) at 10km than with a local link.
Individual transactions (RTT) I'd expected to take 4-6 times as long as
local ones.

> > You will of course need "enough" bandwidth, but what is going to kill
> > (making it rather slow) your cluster will be the latency between those
> > DCs.
> > 
> > Each write will have to be acknowledged and this is where every ms
> > less of latency will make a huge difference.
> 
> Yes indeed, I understand.
>  
> >> For instance, I suppose the OSD disks in DC1 (and in DC2) has
> >> a throughput equal to 150 MB/s, so with 12 OSD disk in each DC,
> >> I have:
> >>
> >>     12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps
> >>
> >> So, in the fiber, I need to have 14.4 Mbs. Is it correct? 
> > 
> > How do you get from 1.8 GigaByte/s to 14.4 Megabit/s?
> 
> Sorry, it was a misprint, I wanted to write 14.4 Gb/s of course. ;)
> 
> > You to multiply, not divide. 
> > And assuming 10 bits (not 8) for a Byte when serialized never hurts. 
> > So that's 18 Gb/s.
> 
> Yes, indeed. So the "naive" estimation gives 18 Gb/s (Ok for 10 bits
> instead of 8).
> 
> >> Maybe is it too naive
> >> reasoning?http://www.numion.com/calculators/Distance.html
> >
> > Very much so. Your disks (even with SSD journals) will not write
> > 150MB/s, because Ceph doesn't do long sequential writes (though 4MB
> > blobs are better than nothing) and more importantly these writes are
> > concurrently. So while one client is writing to an object at one end
> > of your HDD another one may write to a very different, distant
> > location. Seeking delays. With more than one client, you'd be lucky to
> > see 50-70MB/s per HDD.
> 
> Ok, but if I follow your explanations, the throughput obtained with the
> "naive" estimation is too big. In fact, I could just have:
> 
>    12 x 70 = 840 MB/s ie 0.840 GB/s => 8.4 Gb/s
> 
That's about right, but you'll still want redundancy, so 2x 10GbE it is.

Christian

> Correct?
> 
> >> Furthermore I have not taken into account the SSD. How evaluate the
> >> needed throughput more precisely?
> >>
> > You need to consider the speed of the devices, their local bus
> > (hopefully fast enough) and the network.
> > 
> > All things considered you probably want a redundant link (but with
> > bandwidth aggregation if both links are up). 
> > 10Gb/s per link would do, but 40Gb/s links (or your storage network on
> > something other than Ethernet) will have less latency on top of the
> > capacity for future expansion.
> 
> Ok, thanks for your help Christian.
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com