Re: Degraded objects afte: ceph osd in $osd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Mon, Nov 26, 2018 at 3:30 AM Janne Johansson <icepic.dz@xxxxxxxxx> wrote:
Den sön 25 nov. 2018 kl 22:10 skrev Stefan Kooman <stefan@xxxxxx>:
>
> Hi List,
>
> Another interesting and unexpected thing we observed during cluster
> expansion is the following. After we added  extra disks to the cluster,
> while "norebalance" flag was set, we put the new OSDs "IN". As soon as
> we did that a couple of hundered objects would become degraded. During
> that time no OSD crashed or restarted. Every "ceph osd crush add $osd
> weight host=$storage-node" would cause extra degraded objects.
>
> I don't expect objects to become degraded when extra OSDs are added.
> Misplaced, yes. Degraded, no
>
> Someone got an explantion for this?
>

Yes, when you add a drive (or 10), some PGs decide they should have one or more
replicas on the new drives, a new empty PG is created there, and
_then_ that replica
will make that PG get into the "degraded" mode, meaning if it had 3
fine active+clean
replicas before, it now has 2 active+clean and one needing backfill to
get into shape.

It is a slight mistake in reporting it in the same way as an error,
even if it looks to the
cluster just as if it was in error and needs fixing. This gives the
new ceph admins a
sense of urgency or danger whereas it should be perfectly normal to add space to
a cluster. Also, it could have chosen to add a fourth PG in a repl=3
PG and fill from
the one going out into the new empty PG and somehow keep itself with 3 working
replicas, but ceph chooses to first discard one replica, then backfill
into the empty
one, leading to this kind of "error" report.

See, that's the thing: Ceph is designed *not* to reduce data reliability this way; it shouldn't do that; and so far as I've been able to establish so far it doesn't actually do that. Which makes these degraded object reports a bit perplexing.

What we have worked out is that sometimes objects can be degraded because the log-based recovery takes a while after the primary juggles around PG set membership, and I suspect that's what is turning up here. The exact cause still eludes me a bit, but I assume it's a consequence of the backfill and recovery throttling we've added over the years.
If a whole PG was missing then you'd expect to see very large degraded object counts (as opposed to the 2 that Marco reported).

-Greg
 

--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux