Re: Can't fix down+incomplete PG

Scott Laird <scott@xxxxxxxxxxx> · Wed, 10 Feb 2016 13:52:39 +0000

Ah, I should have mentioned--size=3, min_size=1.
I'm pretty sure that 'down_osds_we_would_probe' is the problem, but it's not clear if there's a way to fix that.

On Tue, Feb 9, 2016 at 11:30 PM Arvydas Opulskis <Arvydas.Opulskis@xxxxxxxxxx> wrote:

Hi,

What is min_size for this pool? Maybe you need to decrease it for cluster to start recovering.

Arvydas

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Scott Laird

Sent: Wednesday, February 10, 2016 7:22 AM

To: 'ceph-users@xxxxxxxxxxxxxx' (ceph-users@xxxxxxxxxxxxxx) <ceph-users@xxxxxxxxxxxxxx>

Subject:  Can't fix down+incomplete PG

I lost a few OSDs recently.  Now my cell is unhealthy and I can't figure out how to get it healthy again.

OSD 3, 7, 10, and 40 died in a power outage.  Now I have 10 PGs that are down+incomplete, but all of them seem like they should have surviving replicas of all data.

I'm running 9.2.0.

$ ceph health detail | grep down

pg 18.c1 is down+incomplete, acting [11,18,9]

pg 18.47 is down+incomplete, acting [11,9,22]

pg 18.1d7 is down+incomplete, acting [5,31,24]

pg 18.1d6 is down+incomplete, acting [22,11,5]

pg 18.2af is down+incomplete, acting [19,24,18]

pg 18.2dd is down+incomplete, acting [15,11,22]

pg 18.2de is down+incomplete, acting [15,17,11]

pg 18.3e is down+incomplete, acting [25,8,18]

pg 18.3d6 is down+incomplete, acting [22,39,24]

pg 18.3e6 is down+incomplete, acting [9,23,8]

$ ceph pg 18.c1 query

{

    "state": "down+incomplete",

    "snap_trimq": "[]",

    "epoch": 960905,

    "up": [

        11,

        18,

        9

    ],

    "acting": [

        11,

        18,

        9

    ],

    "info": {

        "pgid": "18.c1",

        "last_update": "0'0",

        "last_complete": "0'0",

        "log_tail": "0'0",

        "last_user_version": 0,

        "last_backfill": "MAX",

        "last_backfill_bitwise": 0,

        "purged_snaps": "[]",

        "history": {

            "epoch_created": 595523,

            "last_epoch_started": 954170,

            "last_epoch_clean": 954170,

            "last_epoch_split": 0,

            "last_epoch_marked_full": 0,

            "same_up_since": 959988,

            "same_interval_since": 959988,

            "same_primary_since": 959988,

            "last_scrub": "613947'7736",

            "last_scrub_stamp": "2015-11-11 21:18:35.118057",

            "last_deep_scrub": "613947'7736",

            "last_deep_scrub_stamp": "2015-11-11 21:18:35.118057",

            "last_clean_scrub_stamp": "2015-11-11 21:18:35.118057"

        },

...

            "probing_osds": [

                "9",

                "11",

                "18",

                "23",

                "25"

            ],

            "down_osds_we_would_probe": [

                7,

                10

            ],

            "peering_blocked_by": []

        },

        {

            "name": "Started",

            "enter_time": "2016-02-09 20:35:57.627376"

        }

    ],

    "agent_state": {}

}

I tried replacing disks. I created a new OSD 3 and 7 but neither will start up; the ceph-osd task starts but never actually makes it to 'up' with nothing obvious in the logs.  I can post logs if that helps.  Since the OSDs were removed
 a few days ago, 'ceph osd lost' doesn't seem to help.

Is there a way to fix these PGs and get my cluster healthy again?

Scott

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com