On Sun, Nov 25, 2018 at 2:41 PM Stefan Kooman <stefan@xxxxxx> wrote:
Hi list,
During cluster expansion (adding extra disks to existing hosts) some
OSDs failed (FAILED assert(0 == "unexpected error", _txc_add_transaction
error (39) Directory not empty not handled on operation 21 (op 1,
counting from 0), full details: https://8n1.org/14078/c534). We had
"norebalance", "nobackfill", and "norecover" flags set. After we unset
nobackfill and norecover (to let Ceph fix the degraded PGs) it would
recover all but 12 objects (2 PGs). We queried the PGs and the OSDs that
were supposed to have a copy of them, and they were already "probed". A
day later (~24 hours) it would still not have recovered the degraded
objects. After we unset the "norebalance" flag it would start
rebalancing, backfilling and recovering. The 12 degraded objects were
recovered.
Is this expected behaviour? I would expect Ceph to always try to fix
degraded things first and foremost. Even "pg force-recover" and "pg
force-backfill" could not force recovery.
I haven't dug into how the norebalance flag works, but I think this is expected — it presumably prevents OSDs from creating new copies of PGs, which is what needed to happen here.
-Greg
Gr. Stefan
--
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com