I've been
trying to wrap my head around crush rules, and I need some
help/advice. I'm thinking of using erasure coding instead
of
replication, and trying to understand the possibilities for
planning for
failure cases.
For a simplified example, consider a 2 level topology, OSDs
live on
hosts, and hosts live in racks. I'd like to set up a rule
for a 6+3
erasure code that would put at most 1 of the 9 chunks on a
host, and no
more than 3 chunks in a rack (so in case the rack is lost,
we still have
a way to recover). Some racks may not have 3 hosts in them,
so they
could potentially accept only 1 or 2 chunks then. How can
something
like this be implemented as a crush rule? Or, if not
exactly this,
something in this spirit? I don't want to say that all
chunks need to
live in a separate rack because that is too restrictive
(some racks may
be much bigger than others, or there might not even be 9
racks).
Unfortunately what you describe here is a little too
detailed in ways CRUSH can't easily specify. You should
think of a CRUSH rule as a sequence of steps that start out
at a root (the "take" step), and incrementally specify more
detail about which piece of the CRUSH hierarchy they run on,
but run the *same* rule on every piece they select.
So the simplest thing that comes close to what you
suggest is:
(forgive me if my syntax is slightly off, I'm doing this
from memory)
step take default
step chooseleaf n type=rack
step emit
That would start at the default root, select "n" racks
(9, in your case) and then for each rack find an OSD within
it. (chooseleaf is special and more flexibly than most of
the CRUSH language; it's nice because if it can't find an
OSD in one of the selected racks, it will pick another
rack).
But a rule that's more illustrative of how things work
is:
step take default
step choose 3 type=rack
step chooseleaf 3 type=host
step emit
That one selects three racks, then selects three OSDs
within different hosts *in each rack*. (You'll note that it
doesn't necessarily work out so well if you don't want 9
OSDs!) If one of the racks it selected doesn't have 3
separate hosts...well, tough, it tried to do what you told
it. :/
If you were dedicated, you could split up your racks into
equivalently-sized units — let's say rows. Then you could do
step take default
step choose 3 type=row
step chooseleaf 3 type=host
step emit
Assuming you have 3+ rows of good size, that'll get you 9
OSDs which are all on different hosts.
-Greg
Thanks,
Andras
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com