Temporary degradation when adding OSD's

erik@xxxxxxxxxxxxx (Erik Logtenberg) · Mon, 07 Jul 2014 16:03:37 +0200

Hi,

If you add an OSD to an existing cluster, ceph will move some existing
data around so the new OSD gets its respective share of usage right away.

Now I noticed that during this moving around, ceph reports the relevant
PG's as degraded. I can more or less understand the logic here: if a
piece of data is supposed to be in a certain place (the new OSD), but it
is not yet there, it's degraded.

However I would hope that the movement of data is executed in such a way
that first a new copy is made on the new OSD and only after successfully
doing that, one of the existing copies is removed. If so, there is never
actually any "degradation" of that PG.

More to the point, if I have a PG replicated over three OSD's: 1, 2 and
3; now I add an OSD 4, and ceph decides to move the copy of OSD 3 to the
new OSD 4; if it turns out that ceph can't read the copies on OSD 1 and
2 due to some disk error, I would assume that ceph would still use the
copy that exists on OSD 3 to populate the copy on OSD 4. Is that indeed
the case?

I have a very similar question about removing an OSD. You can tell ceph
to mark an OSD as "out" before physically removing it. The OSD is still
"up" but ceph will no longer assign PG's to it, and will make new copies
of the PG's that are on this OSD to other OSD's.
Now again ceph will report degradation, even though the "out" OSD is
still "up", so the existing copies are not actually lost. Does ceph use
the OSD that is marked "out" as a source for making the new copies on
other OSD's?

Thanks,

Erik.