Degraded PG does not discover remapped data on originating OSD

Jonas Jelten <jelten@xxxxxxxxx> · Wed, 28 Nov 2018 15:38:53 +0100

Hi!

There seems to be an issue that an OSD is not queried for missing object ec-parts that were remapped, but the OSD for
this is up. This happened in two different scenarios for us. In both, data is stored in EC pools (8+3).

Scenario 0

To remove a broken disk (e.g. osd.22), it is weighted to 0 with ceph osd out 22. Objects are remapped normally. During
object movement, osd.22 is restarted (or crashes and then starts again). Now the bug shows up: Objects will become
degraded and stay degraded, because osd.22 is not queried, but it it up and running. ceph pq query shows:

    "might_have_unfound": [
      {
        "osd": "22(3)",
        "status": "not queried"
      }
    ],

A workaround is to in the broken-disk osd temporarily. The osd is then queried and missing object ec-parts are
discovered. Then, out the osd again. No objects are degraded any more and disk will be emptied.

Scenario 1

Add new disks to the cluster. Data is remapped to be transferred from the old disks (e.g. osd.19) to new disks (e.g. >
osd.42).
When there is a restart an OSD of the old disks (or it restarts because of a crash), objects become degraded. The
missing object ec-part-data is on the osd.19 but again it is not queried. ceph pg query shows:

    "might_have_unfound": [
      {
        "osd": "19(6)",
        "status": "not queried"
      }
    ],

Only remapped data seems to be undiscovered: If osd.19 is taken down, much more data is degraded. Mind that osd.19 is
missing in the acting set in the current state of this PG:

    "up": [38, 36, 28, 17, 13, 39, 48, 10, 29, 5, 47],
    "acting": [36, 15, 28, 17, 13, 32, 2147483647, 10, 29, 5, 20],
    "backfill_targets": [
        "36(1)",
        "38(0)",
        "39(5)",
        "47(10)",
        "48(6)"
    ],
    "acting_recovery_backfill": [
        "5(9)",
        "10(7)",
        "13(4)",
        "15(1)",
        "17(3)",
        "20(10)",
        "28(2)",
        "29(8)",
        "32(5)",
        "36(0)",
        "36(1)",
        "38(0)",
        "39(5)",
        "47(10)",
        "48(6)"
    ],

For this scenario, I have not found a workaround yet. The cluster remains degraded until it has recovered by restoring
the data.

So, overall I suspect there is a bug which prevents remapped pg data to be discovered. The PG already knows which OSD is
the correct candidate, but does not query it.

I can try fixing this myself, but I'd need some hints from the developers to relevant code parts.

The OSD is stored correctly in pg->might_have_unfound, and I think it should be queried in PG::discover_all_missing, but
I'm lost there. I'd appreciate any help tracking this down.

-- Jonas