No recovery when "norebalance" flag set

Stefan Kooman <stefan@xxxxxx> · Sun, 25 Nov 2018 20:41:47 +0100

Hi list,

During cluster expansion (adding extra disks to existing hosts) some
OSDs failed (FAILED assert(0 == "unexpected error", _txc_add_transaction
error (39) Directory not empty not handled on operation 21 (op 1,
counting from 0), full details: https://8n1.org/14078/c534). We had
"norebalance", "nobackfill", and "norecover" flags set. After we unset
nobackfill and norecover (to let Ceph fix the degraded PGs) it would
recover all but 12 objects (2 PGs). We queried the PGs and the OSDs that
were supposed to have a copy of them, and they were already "probed".  A
day later (~24 hours) it would still not have recovered the degraded
objects.  After we unset the "norebalance" flag it would start
rebalancing, backfilling and recovering. The 12 degraded objects were
recovered.

Is this expected behaviour? I would expect Ceph to always try to fix
degraded things first and foremost. Even "pg force-recover" and "pg
force-backfill" could not force recovery.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com