pgs backfill_toofull after removing OSD from CRUSH map

Kristof Coucke <kristof.coucke@xxxxxxxxx> · Thu, 19 Dec 2019 10:00:21 +0100

Hi all,
We are facing a strange symptom here.
We're testing our recovery procedures. Short description of our environment:
1. 10 OSD host nodes, each 13 disks + 2 NVMe's
2. 3 monitor nodes
3. 1 management node
4. 2 RGW's
5. 1 Client

Ceph version: Nautilus version 14.2.4

=> We are testing to "nicely" eliminate 1 OSD host.
As a first step, we've removed the OSD's by running "ceph osd out osd.<id>". 
System went in error with a few messages that backfill was too full, but this was more or less expected.

However, after leaving the system recovering, everything went back to normal. Health did not indicate any warnings nor errors.
Running the Ceph OSD safe to destroy command indicated disks could be safely removed.

So far so good, no problem...
Then we decided to properly removed the disks from the crush map, and now the whole story starts again. Backfill_toofull errors and recovery is running again.

Why?
The disks were already marked out and no PG's have been on them.

Is this caused by the fact that the CRUSH map is modified and recalculation is happening causing the PG's automatically to be linked to different OSD's? It does seem a strange behaviour to be honest.

Any feedback is greatly appreciated!

Regards,

Kristof Coucke
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com