Re: pgs stuck inactive

Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> · Sun, 12 Mar 2017 11:51:12 +0200

Hello,

I have already done the export with ceph_objectstore_tool. I just have to decide which OSDs to keep.
Can you tell me why the directory structure in the OSDs is different for the same PG when checking on different OSDs?
For instance, in OSD 2 and 63 there are NO subdirectories in the 3.367__head, while OSD 28, 35 contains
./DIR_7/DIR_6/DIR_B/
./DIR_7/DIR_6/DIR_3/

When are these subdirectories created?

The files are identical on all the OSDs, only the way how these are stored is different. It would be enough if you could point me to some documentation that explain these, I'll read it. So far, searching for the architecture of an OSD, I could not find the gory details about these directories.

Kind regards,
Laszlo

On 12.03.2017 02:12, Brad Hubbard wrote:
On Sat, Mar 11, 2017 at 7:43 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote:
Hello,

Thank you for your answer.

indeed the min_size is 1:

# ceph osd pool get volumes size
size: 3
# ceph osd pool get volumes min_size
min_size: 1
#
I'm gonna try to find the mentioned discussions on the mailing lists, and
read them. If you have a link at hand, that would be nice if you would send
it to me.

This thread is one example, there are lots more.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html

In the attached file you can see the contents of the directory containing PG
data on the different OSDs (all that have appeared in the pg query).
According to the md5sums the files are identical. What bothers me is the
directory structure (you can see the ls -R in each dir that contains files).

So I mixed up 63 and 68, my list should have read 2, 28, 35 and 63
since 68 is listed as empty in the pg query.

Where can I read about how/why those DIR# subdirectories have appeared?

Given that the files themselves are identical on the "current" OSDs
belonging to the PG, and as the osd.63 (currently not belonging to the PG)
has the same files, is it safe to stop the OSD.2, remove the 3.367_head dir,
and then restart the OSD? (all these with the noout flag set of course)

*You* need to decide which is the "good" copy and then follow the
instructions in the links I provided to try and recover the pg. Back
those known copies on 2, 28, 35 and 63 up with the
ceph_objectstore_tool before proceeding. They may well be identical
but the peering process still needs to "see" the relevant logs and
currently something is stopping it doing so.

Kind regards,
Laszlo

On 11.03.2017 00:32, Brad Hubbard wrote:

So this is why it happened I guess.

pool 3 'volumes' replicated size 3 min_size 1

min_size = 1 is a recipe for disasters like this and there are plenty
of ML threads about not setting it below 2.

The past intervals in the pg query show several intervals where a
single OSD may have gone rw.

How important is this data?

I would suggest checking which of these OSDs actually have the data
for this pg. From the pg query it looks like 2, 35 and 68 and possibly
28 since it's the primary. Check all OSDs in the pg query output. I
would then back up all copies and work out which copy, if any, you
want to keep and then attempt something like the following.

https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg17820.html

If you want to abandon the pg see

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012778.html
for a possible solution.

http://ceph.com/community/incomplete-pgs-oh-my/ may also give some ideas.

On Fri, Mar 10, 2017 at 9:44 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx>
wrote:

The OSDs are all there.

$ sudo ceph osd stat
     osdmap e60609: 72 osds: 72 up, 72 in

an I have attached the result of ceph osd tree, and ceph osd dump
commands.
I got some extra info about the network problem. A faulty network device
has
flooded the network eating up all the bandwidth so the OSDs were not able
to
properly communicate with each other. This has lasted for almost 1 day.

Thank you,
Laszlo

On 10.03.2017 12:19, Brad Hubbard wrote:

To me it looks like someone may have done an "rm" on these OSDs but
not removed them from the crushmap. This does not happen
automatically.

Do these OSDs show up in "ceph osd tree" and "ceph osd dump" ? If so,
paste the output.

Without knowing what exactly happened here it may be difficult to work
out how to proceed.

In order to go clean the primary needs to communicate with multiple
OSDs, some of which are marked DNE and seem to be uncontactable.

This seems to be more than a network issue (unless the outage is still
happening).

http://docs.ceph.com/docs/master/rados/operations/pg-states/?highlight=incomplete

On Fri, Mar 10, 2017 at 6:09 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx>
wrote:

Hello,

I was informed that due to a networking issue the ceph cluster network
was
affected. There was a huge packet loss, and network interfaces were
flipping. That's all I got.
This outage has lasted a longer period of time. So I assume that some
OSD
may have been considered dead and the data from them has been moved
away
to
other PGs (this is what ceph is supposed to do if I'm correct).
Probably
that was the point when the listed PGs have appeared into the picture.
From the query we can see this for one of those OSDs:
        {
            "peer": "14",
            "pgid": "3.367",
            "last_update": "0'0",
            "last_complete": "0'0",
            "log_tail": "0'0",
            "last_user_version": 0,
            "last_backfill": "MAX",
            "purged_snaps": "[]",
            "history": {
                "epoch_created": 4,
                "last_epoch_started": 54899,
                "last_epoch_clean": 55143,
                "last_epoch_split": 0,
                "same_up_since": 60603,
                "same_interval_since": 60603,
                "same_primary_since": 60593,
                "last_scrub": "2852'33528",
                "last_scrub_stamp": "2017-02-26 02:36:55.210150",
                "last_deep_scrub": "2852'16480",
                "last_deep_scrub_stamp": "2017-02-21 00:14:08.866448",
                "last_clean_scrub_stamp": "2017-02-26 02:36:55.210150"
            },
            "stats": {
                "version": "0'0",
                "reported_seq": "14",
                "reported_epoch": "59779",
                "state": "down+peering",
                "last_fresh": "2017-02-27 16:30:16.230519",
                "last_change": "2017-02-27 16:30:15.267995",
                "last_active": "0.000000",
                "last_peered": "0.000000",
                "last_clean": "0.000000",
                "last_became_active": "0.000000",
                "last_became_peered": "0.000000",
                "last_unstale": "2017-02-27 16:30:16.230519",
                "last_undegraded": "2017-02-27 16:30:16.230519",
                "last_fullsized": "2017-02-27 16:30:16.230519",
                "mapping_epoch": 60601,
                "log_start": "0'0",
                "ondisk_log_start": "0'0",
                "created": 4,
                "last_epoch_clean": 55143,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "2852'33528",
                "last_scrub_stamp": "2017-02-26 02:36:55.210150",
                "last_deep_scrub": "2852'16480",
                "last_deep_scrub_stamp": "2017-02-21 00:14:08.866448",
                "last_clean_scrub_stamp": "2017-02-26 02:36:55.210150",
                "log_size": 0,
                "ondisk_log_size": 0,
                "stats_invalid": "0",
                "stat_sum": {
                    "num_bytes": 0,
                    "num_objects": 0,
                    "num_object_clones": 0,
                    "num_object_copies": 0,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 0,
                    "num_whiteouts": 0,
                    "num_read": 0,
                    "num_read_kb": 0,
                    "num_write": 0,
                    "num_write_kb": 0,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 0,
                    "num_bytes_recovered": 0,
                    "num_keys_recovered": 0,
                    "num_objects_omap": 0,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0
                },
                "up": [
                    28,
                    35,
                    2
                ],
                "acting": [
                    28,
                    35,
                    2
                ],
                "blocked_by": [],
                "up_primary": 28,
                "acting_primary": 28
            },
            "empty": 1,
            "dne": 0,
            "incomplete": 0,
            "last_epoch_started": 0,
            "hit_set_history": {
                "current_last_update": "0'0",
                "current_last_stamp": "0.000000",
                "current_info": {
                    "begin": "0.000000",
                    "end": "0.000000",
                    "version": "0'0",
                    "using_gmt": "1"
                },
                "history": []
            }
        },

Where can I read more about the meaning of each parameter, some of them
have
quite self explanatory names, but not all (or probably we need a deeper
knowledge to understand them).
Isn't there any parameter that would say when was that OSD assigned to
the
given PG? Also the stat_sum shows 0 for all its parameters. Why is it
blocking then?

Is there a way to tell the PG to forget about that OSD?

Thank you,
Laszlo

On 10.03.2017 03:05, Brad Hubbard wrote:

Can you explain more about what happened?

The query shows progress is blocked by the following OSDs.

                "blocked_by": [
                    14,
                    17,
                    51,
                    58,
                    63,
                    64,
                    68,
                    70
                ],

Some of these OSDs are marked as "dne" (Does Not Exist).

peer": "17",
"dne": 1,
"peer": "51",
"dne": 1,
"peer": "58",
"dne": 1,
"peer": "64",
"dne": 1,
"peer": "70",
"dne": 1,

Can we get a complete background here please?

On Thu, Mar 9, 2017 at 10:53 PM, Laszlo Budai
<laszlo@xxxxxxxxxxxxxxxx>
wrote:

Hello,

After a major network outage our ceph cluster ended up with an
inactive
PG:

# ceph health detail
HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck
unclean;
1
requests are blocked > 32 sec; 1 osds have slow requests
pg 3.367 is stuck inactive for 912263.766607, current state
incomplete,
last
acting [28,35,2]
pg 3.367 is stuck unclean for 912263.766688, current state
incomplete,
last
acting [28,35,2]
pg 3.367 is incomplete, acting [28,35,2]
1 ops are blocked > 268435 sec
1 ops are blocked > 268435 sec on osd.28
1 osds have slow requests

# ceph -s
    cluster 6713d1b8-83da-11e6-aa79-525400d98c5a
     health HEALTH_WARN
            1 pgs incomplete
            1 pgs stuck inactive
            1 pgs stuck unclean
            1 requests are blocked > 32 sec
     monmap e3: 3 mons at

{tv-dl360-1=10.12.193.73:6789/0,tv-dl360-2=10.12.193.74:6789/0,tv-dl360-3=10.12.193.75:6789/0}
            election epoch 72, quorum 0,1,2
tv-dl360-1,tv-dl360-2,tv-dl360-3
     osdmap e60609: 72 osds: 72 up, 72 in
      pgmap v3670252: 4864 pgs, 11 pools, 134 GB data, 23778 objects
            490 GB used, 130 TB / 130 TB avail
                4863 active+clean
                   1 incomplete
  client io 0 B/s rd, 38465 B/s wr, 2 op/s

ceph pg repair doesn't change anything. What should I try to recover
it?
Attached is the result of ceph pg query on the problem PG.

Thank you,
Laszlo

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com