Re: Minimum failure domain

J David <j.david.lists@xxxxxxxxx> · Tue, 20 Oct 2015 13:54:38 -0400

On Mon, Oct 19, 2015 at 7:09 PM, John Wilkins <jowilkin@xxxxxxxxxx> wrote:
> The classic case is when you are just trying Ceph out on a laptop (e.g.,
> using file directories for OSDs, setting the replica size to 2, and setting
> osd_crush_chooseleaf_type to 0).

Sure, but the text isn’t really applicable in that situation, is it?
It’s specifically calling out the SSD as a single point of failure
when it’s being used to journal multiple OSDs, like that’s an
important consideration in determining the minimum failure domain.
Which, for single-node testing, the minimum failure domain ship has
already pretty much sailed, and on any non-single-node deployment,
testing or otherwise, a node is already realistically already the
minimum failure domain. (And isn’t it the default anyway?)

Likewise, if you’re doing a single-node test with a bunch of OSDs on
one drive, that drive is already a shared failure component, whether
or not journalling is being done to a separate SSD.

> The statement is a guideline. You could, in fact, create a CRUSH hierachy
> consisting of OSD/journal groups within a host too. However, capturing the
> host as a failure domain is preferred if you need to power down the host to
> change a drive (assuming it’s not hot-swappable).

The particular example given is of a single SSD for the entire node.
Inside a given host/node, there are all sorts of single points of
failure.

> There are cases with high density systems where you have multiple nodes in
> the same chassis. So you might opt for a higher minimum failure domain in a
> case like that.

Sure, my question was a bit unclear in that regard.  There are plenty
of cases where you the minimum failure domain might be *larger* than a
node (and you identified several good ones).  Mainly I meant to ask
under what circumstances the minimum failure domain might be *smaller*
than a node.  The only valid answer to that appears to be “testing.”

In light of that, perhaps the text as written is emphasizing on
minimum failure domain unnecessarily, applicable as that is only to
testing, and only to a very specific hardware configuration that
(probably) isn’t very common in testing.  (And, when it is, the
realities of the testing environment where it can come up essentially
require going against the advice given anyway.)

Perhaps the text would be of more benefit to a larger group of readers
if that callout reflected instead the other practicial considerations
of packing multiple journals on one SSD: namely that your cluster must
be designed to withstand the simultaneous failure of all OSDs that
journal to that device, both in terms of excess capacity and
rebalancing throughput.

Thanks!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com