Can't fix down+incomplete PG

Scott Laird <scott@xxxxxxxxxxx> · Wed, 10 Feb 2016 05:21:46 +0000

I lost a few OSDs recently.  Now my cell is unhealthy and I can't figure out how to get it healthy again.
OSD 3, 7, 10, and 40 died in a power outage.  Now I have 10 PGs that are down+incomplete, but all of them seem like they should have surviving replicas of all data.

I'm running 9.2.0.

$ ceph health detail | grep down
pg 18.c1 is down+incomplete, acting [11,18,9]
pg 18.47 is down+incomplete, acting [11,9,22]
pg 18.1d7 is down+incomplete, acting [5,31,24]
pg 18.1d6 is down+incomplete, acting [22,11,5]
pg 18.2af is down+incomplete, acting [19,24,18]
pg 18.2dd is down+incomplete, acting [15,11,22]
pg 18.2de is down+incomplete, acting [15,17,11]
pg 18.3e is down+incomplete, acting [25,8,18]
pg 18.3d6 is down+incomplete, acting [22,39,24]
pg 18.3e6 is down+incomplete, acting [9,23,8]

$ ceph pg 18.c1 query
{
    "state": "down+incomplete",
    "snap_trimq": "[]",
    "epoch": 960905,
    "up": [
        11,
        18,
        9
    ],
    "acting": [
        11,
        18,
        9
    ],
    "info": {
        "pgid": "18.c1",
        "last_update": "0'0",
        "last_complete": "0'0",
        "log_tail": "0'0",
        "last_user_version": 0,
        "last_backfill": "MAX",
        "last_backfill_bitwise": 0,
        "purged_snaps": "[]",
        "history": {
            "epoch_created": 595523,
            "last_epoch_started": 954170,
            "last_epoch_clean": 954170,
            "last_epoch_split": 0,
            "last_epoch_marked_full": 0,
            "same_up_since": 959988,
            "same_interval_since": 959988,
            "same_primary_since": 959988,
            "last_scrub": "613947'7736",
            "last_scrub_stamp": "2015-11-11 21:18:35.118057",
            "last_deep_scrub": "613947'7736",
            "last_deep_scrub_stamp": "2015-11-11 21:18:35.118057",
            "last_clean_scrub_stamp": "2015-11-11 21:18:35.118057"
        },
...
            "probing_osds": [
                "9",
                "11",
                "18",
                "23",
                "25"
            ],
            "down_osds_we_would_probe": [
                7,
                10
            ],
            "peering_blocked_by": []
        },
        {
            "name": "Started",
            "enter_time": "2016-02-09 20:35:57.627376"
        }
    ],
    "agent_state": {}
}

I tried replacing disks. I created a new OSD 3 and 7 but neither will start up; the ceph-osd task starts but never actually makes it to 'up' with nothing obvious in the logs.  I can post logs if that helps.  Since the OSDs were removed a few days ago, 'ceph osd lost' doesn't seem to help.

Is there a way to fix these PGs and get my cluster healthy again?

Scott
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com