Re: pgs stuck inactive

Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> · Wed, 15 Mar 2017 16:02:55 +0200

Hello,

the ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35 command crashes.

~# ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35
*** Caught signal (Segmentation fault) **
 in thread 7f85b60e28c0
 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
 1: ceph-objectstore-tool() [0xaeeaba]
 2: (()+0x10330) [0x7f85b4dca330]
 3: (()+0xa2324) [0x7f85b1cd7324]
 4: (()+0x7d23e) [0x7f85b1cb223e]
 5: (()+0x7d478) [0x7f85b1cb2478]
 6: (rados_ioctx_create()+0x32) [0x7f85b1c89f92]
 7: (librados::Rados::ioctx_create(char const*, librados::IoCtx&)+0x15) [0x7f85b1c8a0e5]
 8: (do_import_rados(std::string, bool)+0xb7c) [0x68199c]
 9: (main()+0x1294) [0x651134]
 10: (__libc_start_main()+0xf5) [0x7f85b0c69f45]
 11: ceph-objectstore-tool() [0x66f8b7]
2017-03-15 14:57:05.567987 7f85b60e28c0 -1 *** Caught signal (Segmentation fault) **
 in thread 7f85b60e28c0

 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
 1: ceph-objectstore-tool() [0xaeeaba]
 2: (()+0x10330) [0x7f85b4dca330]
 3: (()+0xa2324) [0x7f85b1cd7324]
 4: (()+0x7d23e) [0x7f85b1cb223e]
 5: (()+0x7d478) [0x7f85b1cb2478]
 6: (rados_ioctx_create()+0x32) [0x7f85b1c89f92]
 7: (librados::Rados::ioctx_create(char const*, librados::IoCtx&)+0x15) [0x7f85b1c8a0e5]
 8: (do_import_rados(std::string, bool)+0xb7c) [0x68199c]
 9: (main()+0x1294) [0x651134]
 10: (__libc_start_main()+0xf5) [0x7f85b0c69f45]
 11: ceph-objectstore-tool() [0x66f8b7]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -14> 2017-03-15 14:57:05.557743 7f85b60e28c0  5 asok(0x5632000) register_command perfcounters_dump hook 0x55e6130
   -13> 2017-03-15 14:57:05.557807 7f85b60e28c0  5 asok(0x5632000) register_command 1 hook 0x55e6130
   -12> 2017-03-15 14:57:05.557818 7f85b60e28c0  5 asok(0x5632000) register_command perf dump hook 0x55e6130
   -11> 2017-03-15 14:57:05.557828 7f85b60e28c0  5 asok(0x5632000) register_command perfcounters_schema hook 0x55e6130
   -10> 2017-03-15 14:57:05.557836 7f85b60e28c0  5 asok(0x5632000) register_command 2 hook 0x55e6130
    -9> 2017-03-15 14:57:05.557841 7f85b60e28c0  5 asok(0x5632000) register_command perf schema hook 0x55e6130
    -8> 2017-03-15 14:57:05.557851 7f85b60e28c0  5 asok(0x5632000) register_command perf reset hook 0x55e6130
    -7> 2017-03-15 14:57:05.557855 7f85b60e28c0  5 asok(0x5632000) register_command config show hook 0x55e6130
    -6> 2017-03-15 14:57:05.557864 7f85b60e28c0  5 asok(0x5632000) register_command config set hook 0x55e6130
    -5> 2017-03-15 14:57:05.557868 7f85b60e28c0  5 asok(0x5632000) register_command config get hook 0x55e6130
    -4> 2017-03-15 14:57:05.557877 7f85b60e28c0  5 asok(0x5632000) register_command config diff hook 0x55e6130
    -3> 2017-03-15 14:57:05.557880 7f85b60e28c0  5 asok(0x5632000) register_command log flush hook 0x55e6130
    -2> 2017-03-15 14:57:05.557888 7f85b60e28c0  5 asok(0x5632000) register_command log dump hook 0x55e6130
    -1> 2017-03-15 14:57:05.557892 7f85b60e28c0  5 asok(0x5632000) register_command log reopen hook 0x55e6130
     0> 2017-03-15 14:57:05.567987 7f85b60e28c0 -1 *** Caught signal (Segmentation fault) **
 in thread 7f85b60e28c0

 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
 1: ceph-objectstore-tool() [0xaeeaba]
 2: (()+0x10330) [0x7f85b4dca330]
 3: (()+0xa2324) [0x7f85b1cd7324]
 4: (()+0x7d23e) [0x7f85b1cb223e]
 5: (()+0x7d478) [0x7f85b1cb2478]
 6: (rados_ioctx_create()+0x32) [0x7f85b1c89f92]
 7: (librados::Rados::ioctx_create(char const*, librados::IoCtx&)+0x15) [0x7f85b1c8a0e5]
 8: (do_import_rados(std::string, bool)+0xb7c) [0x68199c]
 9: (main()+0x1294) [0x651134]
 10: (__libc_start_main()+0xf5) [0x7f85b0c69f45]
 11: ceph-objectstore-tool() [0x66f8b7]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent       500
  max_new         1000
  log_file
--- end dump of recent events ---
Segmentation fault (core dumped)
#

Any ideas what to try?

Thank you.
Laszlo

On 15.03.2017 04:27, Brad Hubbard wrote:
Decide which copy you want to keep and export that with ceph-objectstore-tool

Delete all copies on all OSDs with ceph-objectstore-tool (not by
deleting the directory on the disk).

Use force_create_pg to recreate the pg empty.

Use ceph-objectstore-tool to do a rados import on the exported pg copy.

On Wed, Mar 15, 2017 at 12:00 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote:
Hello,

I have tried to recover the pg using the following steps:
Preparation:
1. set noout
2. stop osd.2
3. use ceph-objectstore-tool to export from osd2
4. start osd.2
5. repeat step 2-4 on osd 35,28, 63 (I've done these hoping to be able to
use one of those exports to recover the PG)

First attempt:

1. stop osd.2
2. remove the 3.367_head directory
3. start osd.2
Here I was hoping that the cluster will recover the pg from the 2 other
identical osds. It did NOT. So I have tried the following commands on the
PG:
ceph pg repair
ceph pg scrub
ceph pg deep-scrub
ceph pg force_create_pg
 nothing changed. My PG was still incomplete. So I tried to remove all the
OSDs that were referenced in the pg query:

1. stop osd.2
2. delete the 3.367_head directory
3. start osd2
4 repeat steps 6-8 for all the OSDs that were listed in the pg query
5. did an import from one of the exports. -> I was able again to query the
pg (that was impossible when all the 3.367_head dirs were deleted) and the
stats were saying that the number of objects is 6 the size is 21M (all
correct values according to the files I was able to see before starting the
procedure) But the PG is still incomplete.

What else can I try?

Thank you,
Laszlo

On 12.03.2017 13:06, Brad Hubbard wrote:

On Sun, Mar 12, 2017 at 7:51 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx>
wrote:

Hello,

I have already done the export with ceph_objectstore_tool. I just have to
decide which OSDs to keep.
Can you tell me why the directory structure in the OSDs is different for
the
same PG when checking on different OSDs?
For instance, in OSD 2 and 63 there are NO subdirectories in the
3.367__head, while OSD 28, 35 contains
./DIR_7/DIR_6/DIR_B/
./DIR_7/DIR_6/DIR_3/

When are these subdirectories created?

The files are identical on all the OSDs, only the way how these are
stored
is different. It would be enough if you could point me to some
documentation
that explain these, I'll read it. So far, searching for the architecture
of
an OSD, I could not find the gory details about these directories.

https://github.com/ceph/ceph/blob/master/src/os/filestore/HashIndex.h

Kind regards,
Laszlo

On 12.03.2017 02:12, Brad Hubbard wrote:

On Sat, Mar 11, 2017 at 7:43 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx>
wrote:

Hello,

Thank you for your answer.

indeed the min_size is 1:

# ceph osd pool get volumes size
size: 3
# ceph osd pool get volumes min_size
min_size: 1
#
I'm gonna try to find the mentioned discussions on the mailing lists,
and
read them. If you have a link at hand, that would be nice if you would
send
it to me.

This thread is one example, there are lots more.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html

In the attached file you can see the contents of the directory
containing
PG
data on the different OSDs (all that have appeared in the pg query).
According to the md5sums the files are identical. What bothers me is
the
directory structure (you can see the ls -R in each dir that contains
files).

So I mixed up 63 and 68, my list should have read 2, 28, 35 and 63
since 68 is listed as empty in the pg query.

Where can I read about how/why those DIR# subdirectories have appeared?

Given that the files themselves are identical on the "current" OSDs
belonging to the PG, and as the osd.63 (currently not belonging to the
PG)
has the same files, is it safe to stop the OSD.2, remove the 3.367_head
dir,
and then restart the OSD? (all these with the noout flag set of course)

*You* need to decide which is the "good" copy and then follow the
instructions in the links I provided to try and recover the pg. Back
those known copies on 2, 28, 35 and 63 up with the
ceph_objectstore_tool before proceeding. They may well be identical
but the peering process still needs to "see" the relevant logs and
currently something is stopping it doing so.

Kind regards,
Laszlo

On 11.03.2017 00:32, Brad Hubbard wrote:

So this is why it happened I guess.

pool 3 'volumes' replicated size 3 min_size 1

min_size = 1 is a recipe for disasters like this and there are plenty
of ML threads about not setting it below 2.

The past intervals in the pg query show several intervals where a
single OSD may have gone rw.

How important is this data?

I would suggest checking which of these OSDs actually have the data
for this pg. From the pg query it looks like 2, 35 and 68 and possibly
28 since it's the primary. Check all OSDs in the pg query output. I
would then back up all copies and work out which copy, if any, you
want to keep and then attempt something like the following.

https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg17820.html

If you want to abandon the pg see

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012778.html
for a possible solution.

http://ceph.com/community/incomplete-pgs-oh-my/ may also give some
ideas.

On Fri, Mar 10, 2017 at 9:44 PM, Laszlo Budai
<laszlo@xxxxxxxxxxxxxxxx>
wrote:

The OSDs are all there.

$ sudo ceph osd stat
     osdmap e60609: 72 osds: 72 up, 72 in

an I have attached the result of ceph osd tree, and ceph osd dump
commands.
I got some extra info about the network problem. A faulty network
device
has
flooded the network eating up all the bandwidth so the OSDs were not
able
to
properly communicate with each other. This has lasted for almost 1
day.

Thank you,
Laszlo

On 10.03.2017 12:19, Brad Hubbard wrote:

To me it looks like someone may have done an "rm" on these OSDs but
not removed them from the crushmap. This does not happen
automatically.

Do these OSDs show up in "ceph osd tree" and "ceph osd dump" ? If
so,
paste the output.

Without knowing what exactly happened here it may be difficult to
work
out how to proceed.

In order to go clean the primary needs to communicate with multiple
OSDs, some of which are marked DNE and seem to be uncontactable.

This seems to be more than a network issue (unless the outage is
still
happening).

http://docs.ceph.com/docs/master/rados/operations/pg-states/?highlight=incomplete

On Fri, Mar 10, 2017 at 6:09 PM, Laszlo Budai
<laszlo@xxxxxxxxxxxxxxxx>
wrote:

Hello,

I was informed that due to a networking issue the ceph cluster
network
was
affected. There was a huge packet loss, and network interfaces were
flipping. That's all I got.
This outage has lasted a longer period of time. So I assume that
some
OSD
may have been considered dead and the data from them has been moved
away
to
other PGs (this is what ceph is supposed to do if I'm correct).
Probably
that was the point when the listed PGs have appeared into the
picture.
From the query we can see this for one of those OSDs:
        {
            "peer": "14",
            "pgid": "3.367",
            "last_update": "0'0",
            "last_complete": "0'0",
            "log_tail": "0'0",
            "last_user_version": 0,
            "last_backfill": "MAX",
            "purged_snaps": "[]",
            "history": {
                "epoch_created": 4,
                "last_epoch_started": 54899,
                "last_epoch_clean": 55143,
                "last_epoch_split": 0,
                "same_up_since": 60603,
                "same_interval_since": 60603,
                "same_primary_since": 60593,
                "last_scrub": "2852'33528",
                "last_scrub_stamp": "2017-02-26 02:36:55.210150",
                "last_deep_scrub": "2852'16480",
                "last_deep_scrub_stamp": "2017-02-21
00:14:08.866448",
                "last_clean_scrub_stamp": "2017-02-26
02:36:55.210150"
            },
            "stats": {
                "version": "0'0",
                "reported_seq": "14",
                "reported_epoch": "59779",
                "state": "down+peering",
                "last_fresh": "2017-02-27 16:30:16.230519",
                "last_change": "2017-02-27 16:30:15.267995",
                "last_active": "0.000000",
                "last_peered": "0.000000",
                "last_clean": "0.000000",
                "last_became_active": "0.000000",
                "last_became_peered": "0.000000",
                "last_unstale": "2017-02-27 16:30:16.230519",
                "last_undegraded": "2017-02-27 16:30:16.230519",
                "last_fullsized": "2017-02-27 16:30:16.230519",
                "mapping_epoch": 60601,
                "log_start": "0'0",
                "ondisk_log_start": "0'0",
                "created": 4,
                "last_epoch_clean": 55143,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "2852'33528",
                "last_scrub_stamp": "2017-02-26 02:36:55.210150",
                "last_deep_scrub": "2852'16480",
                "last_deep_scrub_stamp": "2017-02-21
00:14:08.866448",
                "last_clean_scrub_stamp": "2017-02-26
02:36:55.210150",
                "log_size": 0,
                "ondisk_log_size": 0,
                "stats_invalid": "0",
                "stat_sum": {
                    "num_bytes": 0,
                    "num_objects": 0,
                    "num_object_clones": 0,
                    "num_object_copies": 0,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 0,
                    "num_whiteouts": 0,
                    "num_read": 0,
                    "num_read_kb": 0,
                    "num_write": 0,
                    "num_write_kb": 0,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 0,
                    "num_bytes_recovered": 0,
                    "num_keys_recovered": 0,
                    "num_objects_omap": 0,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0
                },
                "up": [
                    28,
                    35,
                    2
                ],
                "acting": [
                    28,
                    35,
                    2
                ],
                "blocked_by": [],
                "up_primary": 28,
                "acting_primary": 28
            },
            "empty": 1,
            "dne": 0,
            "incomplete": 0,
            "last_epoch_started": 0,
            "hit_set_history": {
                "current_last_update": "0'0",
                "current_last_stamp": "0.000000",
                "current_info": {
                    "begin": "0.000000",
                    "end": "0.000000",
                    "version": "0'0",
                    "using_gmt": "1"
                },
                "history": []
            }
        },

Where can I read more about the meaning of each parameter, some of
them
have
quite self explanatory names, but not all (or probably we need a
deeper
knowledge to understand them).
Isn't there any parameter that would say when was that OSD assigned
to
the
given PG? Also the stat_sum shows 0 for all its parameters. Why is
it
blocking then?

Is there a way to tell the PG to forget about that OSD?

Thank you,
Laszlo

On 10.03.2017 03:05, Brad Hubbard wrote:

Can you explain more about what happened?

The query shows progress is blocked by the following OSDs.

                "blocked_by": [
                    14,
                    17,
                    51,
                    58,
                    63,
                    64,
                    68,
                    70
                ],

Some of these OSDs are marked as "dne" (Does Not Exist).

peer": "17",
"dne": 1,
"peer": "51",
"dne": 1,
"peer": "58",
"dne": 1,
"peer": "64",
"dne": 1,
"peer": "70",
"dne": 1,

Can we get a complete background here please?

On Thu, Mar 9, 2017 at 10:53 PM, Laszlo Budai
<laszlo@xxxxxxxxxxxxxxxx>
wrote:

Hello,

After a major network outage our ceph cluster ended up with an
inactive
PG:

# ceph health detail
HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck
unclean;
1
requests are blocked > 32 sec; 1 osds have slow requests
pg 3.367 is stuck inactive for 912263.766607, current state
incomplete,
last
acting [28,35,2]
pg 3.367 is stuck unclean for 912263.766688, current state
incomplete,
last
acting [28,35,2]
pg 3.367 is incomplete, acting [28,35,2]
1 ops are blocked > 268435 sec
1 ops are blocked > 268435 sec on osd.28
1 osds have slow requests

# ceph -s
    cluster 6713d1b8-83da-11e6-aa79-525400d98c5a
     health HEALTH_WARN
            1 pgs incomplete
            1 pgs stuck inactive
            1 pgs stuck unclean
            1 requests are blocked > 32 sec
     monmap e3: 3 mons at

{tv-dl360-1=10.12.193.73:6789/0,tv-dl360-2=10.12.193.74:6789/0,tv-dl360-3=10.12.193.75:6789/0}
            election epoch 72, quorum 0,1,2
tv-dl360-1,tv-dl360-2,tv-dl360-3
     osdmap e60609: 72 osds: 72 up, 72 in
      pgmap v3670252: 4864 pgs, 11 pools, 134 GB data, 23778
objects
            490 GB used, 130 TB / 130 TB avail
                4863 active+clean
                   1 incomplete
  client io 0 B/s rd, 38465 B/s wr, 2 op/s

ceph pg repair doesn't change anything. What should I try to
recover
it?
Attached is the result of ceph pg query on the problem PG.

Thank you,
Laszlo

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com