Re: No recovery when "norebalance" flag set

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 26 Nov 2018 08:14:32 -0500

On Sun, Nov 25, 2018 at 2:41 PM Stefan Kooman <stefan@xxxxxx> wrote:
Hi list,

During cluster expansion (adding extra disks to existing hosts) some

OSDs failed (FAILED assert(0 == "unexpected error", _txc_add_transaction

error (39) Directory not empty not handled on operation 21 (op 1,

counting from 0), full details: https://8n1.org/14078/c534). We had

"norebalance", "nobackfill", and "norecover" flags set. After we unset

nobackfill and norecover (to let Ceph fix the degraded PGs) it would

recover all but 12 objects (2 PGs). We queried the PGs and the OSDs that

were supposed to have a copy of them, and they were already "probed".  A

day later (~24 hours) it would still not have recovered the degraded

objects.  After we unset the "norebalance" flag it would start

rebalancing, backfilling and recovering. The 12 degraded objects were

recovered.

Is this expected behaviour? I would expect Ceph to always try to fix

degraded things first and foremost. Even "pg force-recover" and "pg

force-backfill" could not force recovery.

I haven't dug into how the norebalance flag works, but I think this is expected — it presumably prevents OSDs from creating new copies of PGs, which is what needed to happen here.
-Greg

Gr. Stefan

-- 

| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351

| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com