Re: Degraded objects afte: ceph osd in $osd

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 26 Nov 2018 08:10:12 -0500

On Mon, Nov 26, 2018 at 3:30 AM Janne Johansson <icepic.dz@xxxxxxxxx> wrote:
Den sön 25 nov. 2018 kl 22:10 skrev Stefan Kooman <stefan@xxxxxx>:

>

> Hi List,

>

> Another interesting and unexpected thing we observed during cluster

> expansion is the following. After we added  extra disks to the cluster,

> while "norebalance" flag was set, we put the new OSDs "IN". As soon as

> we did that a couple of hundered objects would become degraded. During

> that time no OSD crashed or restarted. Every "ceph osd crush add $osd

> weight host=$storage-node" would cause extra degraded objects.

>

> I don't expect objects to become degraded when extra OSDs are added.

> Misplaced, yes. Degraded, no

>

> Someone got an explantion for this?

>

Yes, when you add a drive (or 10), some PGs decide they should have one or more

replicas on the new drives, a new empty PG is created there, and

_then_ that replica

will make that PG get into the "degraded" mode, meaning if it had 3

fine active+clean

replicas before, it now has 2 active+clean and one needing backfill to

get into shape.

It is a slight mistake in reporting it in the same way as an error,

even if it looks to the

cluster just as if it was in error and needs fixing. This gives the

new ceph admins a

sense of urgency or danger whereas it should be perfectly normal to add space to

a cluster. Also, it could have chosen to add a fourth PG in a repl=3

PG and fill from

the one going out into the new empty PG and somehow keep itself with 3 working

replicas, but ceph chooses to first discard one replica, then backfill

into the empty

one, leading to this kind of "error" report.

See, that's the thing: Ceph is designed *not* to reduce data reliability this way; it shouldn't do that; and so far as I've been able to establish so far it doesn't actually do that. Which makes these degraded object reports a bit perplexing.

What we have worked out is that sometimes objects can be degraded because the log-based recovery takes a while after the primary juggles around PG set membership, and I suspect that's what is turning up here. The exact cause still eludes me a bit, but I assume it's a consequence of the backfill and recovery throttling we've added over the years.
If a whole PG was missing then you'd expect to see very large degraded object counts (as opposed to the 2 that Marco reported).

-Greg

-- 

May the most significant bit of your life be positive.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com