Recover datas from pg incomplete

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
After a major crash in which we lost few osds, we are stucked with incomplete pgs.
At first, peering was blocked with peering_blocked_by_history_les_bound.
Thus we set osd_find_best_info_ignore_history_les true for all osds involved in the pg and set the primary osd down to force repeering. It worked for one pg which is in a replica 3 pool, but for the 2 others pgs which are in a erasurce coding (3+2) pool, it didn't worked... and the pgs are still incomplete.

We know that we will have data lost, but we would like to minimize it and save as much as possible. Also because this pg is part of the data pool of a cephfs filesystem and it seems that files are spread among a lot of pgs and loosing objects in a pg of the datapool means the loss of a huge number of files !

According to https://www.spinics.net/lists/ceph-devel/msg41665.html
a way would be to :
- stop each osd involved in that pg
- export the shards with ceph-objectstore-tool
- compare the size of the shards and select the biggest one (alternatively maybe we can also look at the num_objects returned by ceph pg query ?)
- Mark it as complete
- restart the osd
- Wait for recover and finally get rid of the missing objects with ceph pg 10.2 mark_unfound_lost delete

But on this other source https://github.com/TheJJ/ceph-cheatsheet/blob/master/README.md or here https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1 it's suggested to remove the other parts  (but I am not sure these threads are really related to EC pools).

Could you confirm that we could follow this procedure (or correct it or suggests anything else) ?
Thanks for your advices.
F.

PS: Here is a part of the ceph pg 10.2 query return :

    "state": "incomplete",
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "epoch": 434321,
    "up": [
        78,
        105,
        90,
        4,
        41
    ],
    "acting": [
        78,
        105,
        90,
        4,
        41
    ],
    "info": {
        "pgid": "10.2s0",
            "state": "incomplete",
            "last_peered": "2020-04-22 09:58:42.505638",
            "last_became_peered": "2020-04-20 11:06:07.701833",
                "num_objects": 161314,
                "num_objects_missing_on_primary": 0,
                "num_objects_missing": 0,
                "num_objects_degraded": 0,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 0,
                "num_objects_dirty": 161314,
                "num_objects_recovered": 1290285,
    "peer_info": [
            "peer": "4(3)",
            "pgid": "10.2s3",
                "state": "active+undersized+degraded+remapped+backfilling",
                "last_peered": "2020-04-25 13:25:12.860435",
                "last_became_peered": "2020-04-22 10:45:45.520125",
                    "num_objects": 162869,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 85071,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 162869,
                    "num_objects_recovered": 1368082,
            "peer": "9(2)",
            "pgid": "10.2s2",
                "state": "down",
                "last_peered": "2020-04-25 13:25:12.860435",
                "last_became_peered": "2020-04-22 10:45:45.520125",
                    "num_objects": 162869,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 162869,
                    "num_objects_recovered": 1368082,
            "peer": "41(4)",
            "pgid": "10.2s4",
                "state": "unknown",
                "last_peered": "0.000000",
                "last_became_peered": "0.000000",
                    "num_objects": 0,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 0,
                    "num_objects_recovered": 0,
            "peer": "46(4)",
            "pgid": "10.2s4",
                "state": "down",
                "last_peered": "2020-04-25 13:25:12.860435",
                "last_became_peered": "2020-04-22 10:45:45.520125",
                    "num_objects": 162869,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 162869,
                    "num_objects_recovered": 1368082,
            "peer": "52(3)",
            "pgid": "10.2s3",
                "state": "unknown",
                "last_peered": "0.000000",
                "last_became_peered": "0.000000",
                    "num_objects": 0,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 0,
                    "num_objects_recovered": 0,
            "peer": "69(0)",
            "pgid": "10.2s0",
                "state": "unknown",
                "last_peered": "0.000000",
                "last_became_peered": "0.000000",
                    "num_objects": 84063,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 78807,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 84063,
                    "num_objects_recovered": 0,
            "peer": "90(2)",
            "pgid": "10.2s2",
                "state": "down",
                "last_peered": "0.000000",
                "last_became_peered": "0.000000",
                    "num_objects": 0,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 0,
                    "num_objects_recovered": 0,
            "peer": "105(1)",
            "pgid": "10.2s1",
                "state": "incomplete",
                "last_peered": "0.000000",
                "last_became_peered": "0.000000",
                    "num_objects": 0,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 0,
                    "num_objects_recovered": 0,
    "recovery_state": [
            "peering_blocked_by": []
    "agent_state": {}
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux