Re: pgs stuck inactive

Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> · Fri, 10 Mar 2017 10:09:04 +0200

Hello,

I was informed that due to a networking issue the ceph cluster network was affected. There was a huge packet loss, and network interfaces were flipping. That's all I got.
This outage has lasted a longer period of time. So I assume that some OSD may have been considered dead and the data from them has been moved away to other PGs (this is what ceph is supposed to do if I'm correct). Probably that was the point when the listed PGs have appeared into the picture.
From the query we can see this for one of those OSDs:
        {
            "peer": "14",
            "pgid": "3.367",
            "last_update": "0'0",
            "last_complete": "0'0",
            "log_tail": "0'0",
            "last_user_version": 0,
            "last_backfill": "MAX",
            "purged_snaps": "[]",
            "history": {
                "epoch_created": 4,
                "last_epoch_started": 54899,
                "last_epoch_clean": 55143,
                "last_epoch_split": 0,
                "same_up_since": 60603,
                "same_interval_since": 60603,
                "same_primary_since": 60593,
                "last_scrub": "2852'33528",
                "last_scrub_stamp": "2017-02-26 02:36:55.210150",
                "last_deep_scrub": "2852'16480",
                "last_deep_scrub_stamp": "2017-02-21 00:14:08.866448",
                "last_clean_scrub_stamp": "2017-02-26 02:36:55.210150"
            },
            "stats": {
                "version": "0'0",
                "reported_seq": "14",
                "reported_epoch": "59779",
                "state": "down+peering",
                "last_fresh": "2017-02-27 16:30:16.230519",
                "last_change": "2017-02-27 16:30:15.267995",
                "last_active": "0.000000",
                "last_peered": "0.000000",
                "last_clean": "0.000000",
                "last_became_active": "0.000000",
                "last_became_peered": "0.000000",
                "last_unstale": "2017-02-27 16:30:16.230519",
                "last_undegraded": "2017-02-27 16:30:16.230519",
                "last_fullsized": "2017-02-27 16:30:16.230519",
                "mapping_epoch": 60601,
                "log_start": "0'0",
                "ondisk_log_start": "0'0",
                "created": 4,
                "last_epoch_clean": 55143,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "2852'33528",
                "last_scrub_stamp": "2017-02-26 02:36:55.210150",
                "last_deep_scrub": "2852'16480",
                "last_deep_scrub_stamp": "2017-02-21 00:14:08.866448",
                "last_clean_scrub_stamp": "2017-02-26 02:36:55.210150",
                "log_size": 0,
                "ondisk_log_size": 0,
                "stats_invalid": "0",
                "stat_sum": {
                    "num_bytes": 0,
                    "num_objects": 0,
                    "num_object_clones": 0,
                    "num_object_copies": 0,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 0,
                    "num_whiteouts": 0,
                    "num_read": 0,
                    "num_read_kb": 0,
                    "num_write": 0,
                    "num_write_kb": 0,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 0,
                    "num_bytes_recovered": 0,
                    "num_keys_recovered": 0,
                    "num_objects_omap": 0,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0
                },
                "up": [
                    28,
                    35,
                    2
                ],
                "acting": [
                    28,
                    35,
                    2
                ],
                "blocked_by": [],
                "up_primary": 28,
                "acting_primary": 28
            },
            "empty": 1,
            "dne": 0,
            "incomplete": 0,
            "last_epoch_started": 0,
            "hit_set_history": {
                "current_last_update": "0'0",
                "current_last_stamp": "0.000000",
                "current_info": {
                    "begin": "0.000000",
                    "end": "0.000000",
                    "version": "0'0",
                    "using_gmt": "1"
                },
                "history": []
            }
        },

Where can I read more about the meaning of each parameter, some of them have quite self explanatory names, but not all (or probably we need a deeper knowledge to understand them).
Isn't there any parameter that would say when was that OSD assigned to the given PG? Also the stat_sum shows 0 for all its parameters. Why is it blocking then?

Is there a way to tell the PG to forget about that OSD?

Thank you,
Laszlo

On 10.03.2017 03:05, Brad Hubbard wrote:
Can you explain more about what happened?

The query shows progress is blocked by the following OSDs.

                "blocked_by": [
                    14,
                    17,
                    51,
                    58,
                    63,
                    64,
                    68,
                    70
                ],

Some of these OSDs are marked as "dne" (Does Not Exist).

peer": "17",
"dne": 1,
"peer": "51",
"dne": 1,
"peer": "58",
"dne": 1,
"peer": "64",
"dne": 1,
"peer": "70",
"dne": 1,

Can we get a complete background here please?

On Thu, Mar 9, 2017 at 10:53 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote:
Hello,

After a major network outage our ceph cluster ended up with an inactive PG:

# ceph health detail
HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1
requests are blocked > 32 sec; 1 osds have slow requests
pg 3.367 is stuck inactive for 912263.766607, current state incomplete, last
acting [28,35,2]
pg 3.367 is stuck unclean for 912263.766688, current state incomplete, last
acting [28,35,2]
pg 3.367 is incomplete, acting [28,35,2]
1 ops are blocked > 268435 sec
1 ops are blocked > 268435 sec on osd.28
1 osds have slow requests

# ceph -s
    cluster 6713d1b8-83da-11e6-aa79-525400d98c5a
     health HEALTH_WARN
            1 pgs incomplete
            1 pgs stuck inactive
            1 pgs stuck unclean
            1 requests are blocked > 32 sec
     monmap e3: 3 mons at
{tv-dl360-1=10.12.193.73:6789/0,tv-dl360-2=10.12.193.74:6789/0,tv-dl360-3=10.12.193.75:6789/0}
            election epoch 72, quorum 0,1,2 tv-dl360-1,tv-dl360-2,tv-dl360-3
     osdmap e60609: 72 osds: 72 up, 72 in
      pgmap v3670252: 4864 pgs, 11 pools, 134 GB data, 23778 objects
            490 GB used, 130 TB / 130 TB avail
                4863 active+clean
                   1 incomplete
  client io 0 B/s rd, 38465 B/s wr, 2 op/s

ceph pg repair doesn't change anything. What should I try to recover it?
Attached is the result of ceph pg query on the problem PG.

Thank you,
Laszlo

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com