Re: pgs stuck inactive

Brad Hubbard <bhubbard@xxxxxxxxxx> · Wed, 15 Mar 2017 12:27:57 +1000

Decide which copy you want to keep and export that with ceph-objectstore-tool

Delete all copies on all OSDs with ceph-objectstore-tool (not by
deleting the directory on the disk).

Use force_create_pg to recreate the pg empty.

Use ceph-objectstore-tool to do a rados import on the exported pg copy.

On Wed, Mar 15, 2017 at 12:00 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote:
> Hello,
>
> I have tried to recover the pg using the following steps:
> Preparation:
> 1. set noout
> 2. stop osd.2
> 3. use ceph-objectstore-tool to export from osd2
> 4. start osd.2
> 5. repeat step 2-4 on osd 35,28, 63 (I've done these hoping to be able to
> use one of those exports to recover the PG)
>
>
> First attempt:
>
> 1. stop osd.2
> 2. remove the 3.367_head directory
> 3. start osd.2
> Here I was hoping that the cluster will recover the pg from the 2 other
> identical osds. It did NOT. So I have tried the following commands on the
> PG:
> ceph pg repair
> ceph pg scrub
> ceph pg deep-scrub
> ceph pg force_create_pg
>  nothing changed. My PG was still incomplete. So I tried to remove all the
> OSDs that were referenced in the pg query:
>
>
> 1. stop osd.2
> 2. delete the 3.367_head directory
> 3. start osd2
> 4 repeat steps 6-8 for all the OSDs that were listed in the pg query
> 5. did an import from one of the exports. -> I was able again to query the
> pg (that was impossible when all the 3.367_head dirs were deleted) and the
> stats were saying that the number of objects is 6 the size is 21M (all
> correct values according to the files I was able to see before starting the
> procedure) But the PG is still incomplete.
>
> What else can I try?
>
> Thank you,
> Laszlo
>
>
>
>
>
> On 12.03.2017 13:06, Brad Hubbard wrote:
>>
>> On Sun, Mar 12, 2017 at 7:51 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx>
>> wrote:
>>>
>>> Hello,
>>>
>>> I have already done the export with ceph_objectstore_tool. I just have to
>>> decide which OSDs to keep.
>>> Can you tell me why the directory structure in the OSDs is different for
>>> the
>>> same PG when checking on different OSDs?
>>> For instance, in OSD 2 and 63 there are NO subdirectories in the
>>> 3.367__head, while OSD 28, 35 contains
>>> ./DIR_7/DIR_6/DIR_B/
>>> ./DIR_7/DIR_6/DIR_3/
>>>
>>> When are these subdirectories created?
>>>
>>> The files are identical on all the OSDs, only the way how these are
>>> stored
>>> is different. It would be enough if you could point me to some
>>> documentation
>>> that explain these, I'll read it. So far, searching for the architecture
>>> of
>>> an OSD, I could not find the gory details about these directories.
>>
>>
>> https://github.com/ceph/ceph/blob/master/src/os/filestore/HashIndex.h
>>
>>>
>>> Kind regards,
>>> Laszlo
>>>
>>>
>>> On 12.03.2017 02:12, Brad Hubbard wrote:
>>>>
>>>>
>>>> On Sat, Mar 11, 2017 at 7:43 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx>
>>>> wrote:
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> Thank you for your answer.
>>>>>
>>>>> indeed the min_size is 1:
>>>>>
>>>>> # ceph osd pool get volumes size
>>>>> size: 3
>>>>> # ceph osd pool get volumes min_size
>>>>> min_size: 1
>>>>> #
>>>>> I'm gonna try to find the mentioned discussions on the mailing lists,
>>>>> and
>>>>> read them. If you have a link at hand, that would be nice if you would
>>>>> send
>>>>> it to me.
>>>>
>>>>
>>>>
>>>> This thread is one example, there are lots more.
>>>>
>>>>
>>>>
>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
>>>>
>>>>>
>>>>> In the attached file you can see the contents of the directory
>>>>> containing
>>>>> PG
>>>>> data on the different OSDs (all that have appeared in the pg query).
>>>>> According to the md5sums the files are identical. What bothers me is
>>>>> the
>>>>> directory structure (you can see the ls -R in each dir that contains
>>>>> files).
>>>>
>>>>
>>>>
>>>> So I mixed up 63 and 68, my list should have read 2, 28, 35 and 63
>>>> since 68 is listed as empty in the pg query.
>>>>
>>>>>
>>>>> Where can I read about how/why those DIR# subdirectories have appeared?
>>>>>
>>>>> Given that the files themselves are identical on the "current" OSDs
>>>>> belonging to the PG, and as the osd.63 (currently not belonging to the
>>>>> PG)
>>>>> has the same files, is it safe to stop the OSD.2, remove the 3.367_head
>>>>> dir,
>>>>> and then restart the OSD? (all these with the noout flag set of course)
>>>>
>>>>
>>>>
>>>> *You* need to decide which is the "good" copy and then follow the
>>>> instructions in the links I provided to try and recover the pg. Back
>>>> those known copies on 2, 28, 35 and 63 up with the
>>>> ceph_objectstore_tool before proceeding. They may well be identical
>>>> but the peering process still needs to "see" the relevant logs and
>>>> currently something is stopping it doing so.
>>>>
>>>>>
>>>>> Kind regards,
>>>>> Laszlo
>>>>>
>>>>>
>>>>> On 11.03.2017 00:32, Brad Hubbard wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> So this is why it happened I guess.
>>>>>>
>>>>>> pool 3 'volumes' replicated size 3 min_size 1
>>>>>>
>>>>>> min_size = 1 is a recipe for disasters like this and there are plenty
>>>>>> of ML threads about not setting it below 2.
>>>>>>
>>>>>> The past intervals in the pg query show several intervals where a
>>>>>> single OSD may have gone rw.
>>>>>>
>>>>>> How important is this data?
>>>>>>
>>>>>> I would suggest checking which of these OSDs actually have the data
>>>>>> for this pg. From the pg query it looks like 2, 35 and 68 and possibly
>>>>>> 28 since it's the primary. Check all OSDs in the pg query output. I
>>>>>> would then back up all copies and work out which copy, if any, you
>>>>>> want to keep and then attempt something like the following.
>>>>>>
>>>>>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg17820.html
>>>>>>
>>>>>> If you want to abandon the pg see
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012778.html
>>>>>> for a possible solution.
>>>>>>
>>>>>> http://ceph.com/community/incomplete-pgs-oh-my/ may also give some
>>>>>> ideas.
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 10, 2017 at 9:44 PM, Laszlo Budai
>>>>>> <laszlo@xxxxxxxxxxxxxxxx>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The OSDs are all there.
>>>>>>>
>>>>>>> $ sudo ceph osd stat
>>>>>>>      osdmap e60609: 72 osds: 72 up, 72 in
>>>>>>>
>>>>>>> an I have attached the result of ceph osd tree, and ceph osd dump
>>>>>>> commands.
>>>>>>> I got some extra info about the network problem. A faulty network
>>>>>>> device
>>>>>>> has
>>>>>>> flooded the network eating up all the bandwidth so the OSDs were not
>>>>>>> able
>>>>>>> to
>>>>>>> properly communicate with each other. This has lasted for almost 1
>>>>>>> day.
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Laszlo
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10.03.2017 12:19, Brad Hubbard wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> To me it looks like someone may have done an "rm" on these OSDs but
>>>>>>>> not removed them from the crushmap. This does not happen
>>>>>>>> automatically.
>>>>>>>>
>>>>>>>> Do these OSDs show up in "ceph osd tree" and "ceph osd dump" ? If
>>>>>>>> so,
>>>>>>>> paste the output.
>>>>>>>>
>>>>>>>> Without knowing what exactly happened here it may be difficult to
>>>>>>>> work
>>>>>>>> out how to proceed.
>>>>>>>>
>>>>>>>> In order to go clean the primary needs to communicate with multiple
>>>>>>>> OSDs, some of which are marked DNE and seem to be uncontactable.
>>>>>>>>
>>>>>>>> This seems to be more than a network issue (unless the outage is
>>>>>>>> still
>>>>>>>> happening).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://docs.ceph.com/docs/master/rados/operations/pg-states/?highlight=incomplete
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Mar 10, 2017 at 6:09 PM, Laszlo Budai
>>>>>>>> <laszlo@xxxxxxxxxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I was informed that due to a networking issue the ceph cluster
>>>>>>>>> network
>>>>>>>>> was
>>>>>>>>> affected. There was a huge packet loss, and network interfaces were
>>>>>>>>> flipping. That's all I got.
>>>>>>>>> This outage has lasted a longer period of time. So I assume that
>>>>>>>>> some
>>>>>>>>> OSD
>>>>>>>>> may have been considered dead and the data from them has been moved
>>>>>>>>> away
>>>>>>>>> to
>>>>>>>>> other PGs (this is what ceph is supposed to do if I'm correct).
>>>>>>>>> Probably
>>>>>>>>> that was the point when the listed PGs have appeared into the
>>>>>>>>> picture.
>>>>>>>>> From the query we can see this for one of those OSDs:
>>>>>>>>>         {
>>>>>>>>>             "peer": "14",
>>>>>>>>>             "pgid": "3.367",
>>>>>>>>>             "last_update": "0'0",
>>>>>>>>>             "last_complete": "0'0",
>>>>>>>>>             "log_tail": "0'0",
>>>>>>>>>             "last_user_version": 0,
>>>>>>>>>             "last_backfill": "MAX",
>>>>>>>>>             "purged_snaps": "[]",
>>>>>>>>>             "history": {
>>>>>>>>>                 "epoch_created": 4,
>>>>>>>>>                 "last_epoch_started": 54899,
>>>>>>>>>                 "last_epoch_clean": 55143,
>>>>>>>>>                 "last_epoch_split": 0,
>>>>>>>>>                 "same_up_since": 60603,
>>>>>>>>>                 "same_interval_since": 60603,
>>>>>>>>>                 "same_primary_since": 60593,
>>>>>>>>>                 "last_scrub": "2852'33528",
>>>>>>>>>                 "last_scrub_stamp": "2017-02-26 02:36:55.210150",
>>>>>>>>>                 "last_deep_scrub": "2852'16480",
>>>>>>>>>                 "last_deep_scrub_stamp": "2017-02-21
>>>>>>>>> 00:14:08.866448",
>>>>>>>>>                 "last_clean_scrub_stamp": "2017-02-26
>>>>>>>>> 02:36:55.210150"
>>>>>>>>>             },
>>>>>>>>>             "stats": {
>>>>>>>>>                 "version": "0'0",
>>>>>>>>>                 "reported_seq": "14",
>>>>>>>>>                 "reported_epoch": "59779",
>>>>>>>>>                 "state": "down+peering",
>>>>>>>>>                 "last_fresh": "2017-02-27 16:30:16.230519",
>>>>>>>>>                 "last_change": "2017-02-27 16:30:15.267995",
>>>>>>>>>                 "last_active": "0.000000",
>>>>>>>>>                 "last_peered": "0.000000",
>>>>>>>>>                 "last_clean": "0.000000",
>>>>>>>>>                 "last_became_active": "0.000000",
>>>>>>>>>                 "last_became_peered": "0.000000",
>>>>>>>>>                 "last_unstale": "2017-02-27 16:30:16.230519",
>>>>>>>>>                 "last_undegraded": "2017-02-27 16:30:16.230519",
>>>>>>>>>                 "last_fullsized": "2017-02-27 16:30:16.230519",
>>>>>>>>>                 "mapping_epoch": 60601,
>>>>>>>>>                 "log_start": "0'0",
>>>>>>>>>                 "ondisk_log_start": "0'0",
>>>>>>>>>                 "created": 4,
>>>>>>>>>                 "last_epoch_clean": 55143,
>>>>>>>>>                 "parent": "0.0",
>>>>>>>>>                 "parent_split_bits": 0,
>>>>>>>>>                 "last_scrub": "2852'33528",
>>>>>>>>>                 "last_scrub_stamp": "2017-02-26 02:36:55.210150",
>>>>>>>>>                 "last_deep_scrub": "2852'16480",
>>>>>>>>>                 "last_deep_scrub_stamp": "2017-02-21
>>>>>>>>> 00:14:08.866448",
>>>>>>>>>                 "last_clean_scrub_stamp": "2017-02-26
>>>>>>>>> 02:36:55.210150",
>>>>>>>>>                 "log_size": 0,
>>>>>>>>>                 "ondisk_log_size": 0,
>>>>>>>>>                 "stats_invalid": "0",
>>>>>>>>>                 "stat_sum": {
>>>>>>>>>                     "num_bytes": 0,
>>>>>>>>>                     "num_objects": 0,
>>>>>>>>>                     "num_object_clones": 0,
>>>>>>>>>                     "num_object_copies": 0,
>>>>>>>>>                     "num_objects_missing_on_primary": 0,
>>>>>>>>>                     "num_objects_degraded": 0,
>>>>>>>>>                     "num_objects_misplaced": 0,
>>>>>>>>>                     "num_objects_unfound": 0,
>>>>>>>>>                     "num_objects_dirty": 0,
>>>>>>>>>                     "num_whiteouts": 0,
>>>>>>>>>                     "num_read": 0,
>>>>>>>>>                     "num_read_kb": 0,
>>>>>>>>>                     "num_write": 0,
>>>>>>>>>                     "num_write_kb": 0,
>>>>>>>>>                     "num_scrub_errors": 0,
>>>>>>>>>                     "num_shallow_scrub_errors": 0,
>>>>>>>>>                     "num_deep_scrub_errors": 0,
>>>>>>>>>                     "num_objects_recovered": 0,
>>>>>>>>>                     "num_bytes_recovered": 0,
>>>>>>>>>                     "num_keys_recovered": 0,
>>>>>>>>>                     "num_objects_omap": 0,
>>>>>>>>>                     "num_objects_hit_set_archive": 0,
>>>>>>>>>                     "num_bytes_hit_set_archive": 0
>>>>>>>>>                 },
>>>>>>>>>                 "up": [
>>>>>>>>>                     28,
>>>>>>>>>                     35,
>>>>>>>>>                     2
>>>>>>>>>                 ],
>>>>>>>>>                 "acting": [
>>>>>>>>>                     28,
>>>>>>>>>                     35,
>>>>>>>>>                     2
>>>>>>>>>                 ],
>>>>>>>>>                 "blocked_by": [],
>>>>>>>>>                 "up_primary": 28,
>>>>>>>>>                 "acting_primary": 28
>>>>>>>>>             },
>>>>>>>>>             "empty": 1,
>>>>>>>>>             "dne": 0,
>>>>>>>>>             "incomplete": 0,
>>>>>>>>>             "last_epoch_started": 0,
>>>>>>>>>             "hit_set_history": {
>>>>>>>>>                 "current_last_update": "0'0",
>>>>>>>>>                 "current_last_stamp": "0.000000",
>>>>>>>>>                 "current_info": {
>>>>>>>>>                     "begin": "0.000000",
>>>>>>>>>                     "end": "0.000000",
>>>>>>>>>                     "version": "0'0",
>>>>>>>>>                     "using_gmt": "1"
>>>>>>>>>                 },
>>>>>>>>>                 "history": []
>>>>>>>>>             }
>>>>>>>>>         },
>>>>>>>>>
>>>>>>>>> Where can I read more about the meaning of each parameter, some of
>>>>>>>>> them
>>>>>>>>> have
>>>>>>>>> quite self explanatory names, but not all (or probably we need a
>>>>>>>>> deeper
>>>>>>>>> knowledge to understand them).
>>>>>>>>> Isn't there any parameter that would say when was that OSD assigned
>>>>>>>>> to
>>>>>>>>> the
>>>>>>>>> given PG? Also the stat_sum shows 0 for all its parameters. Why is
>>>>>>>>> it
>>>>>>>>> blocking then?
>>>>>>>>>
>>>>>>>>> Is there a way to tell the PG to forget about that OSD?
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Laszlo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10.03.2017 03:05, Brad Hubbard wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Can you explain more about what happened?
>>>>>>>>>>
>>>>>>>>>> The query shows progress is blocked by the following OSDs.
>>>>>>>>>>
>>>>>>>>>>                 "blocked_by": [
>>>>>>>>>>                     14,
>>>>>>>>>>                     17,
>>>>>>>>>>                     51,
>>>>>>>>>>                     58,
>>>>>>>>>>                     63,
>>>>>>>>>>                     64,
>>>>>>>>>>                     68,
>>>>>>>>>>                     70
>>>>>>>>>>                 ],
>>>>>>>>>>
>>>>>>>>>> Some of these OSDs are marked as "dne" (Does Not Exist).
>>>>>>>>>>
>>>>>>>>>> peer": "17",
>>>>>>>>>> "dne": 1,
>>>>>>>>>> "peer": "51",
>>>>>>>>>> "dne": 1,
>>>>>>>>>> "peer": "58",
>>>>>>>>>> "dne": 1,
>>>>>>>>>> "peer": "64",
>>>>>>>>>> "dne": 1,
>>>>>>>>>> "peer": "70",
>>>>>>>>>> "dne": 1,
>>>>>>>>>>
>>>>>>>>>> Can we get a complete background here please?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 9, 2017 at 10:53 PM, Laszlo Budai
>>>>>>>>>> <laszlo@xxxxxxxxxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> After a major network outage our ceph cluster ended up with an
>>>>>>>>>>> inactive
>>>>>>>>>>> PG:
>>>>>>>>>>>
>>>>>>>>>>> # ceph health detail
>>>>>>>>>>> HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck
>>>>>>>>>>> unclean;
>>>>>>>>>>> 1
>>>>>>>>>>> requests are blocked > 32 sec; 1 osds have slow requests
>>>>>>>>>>> pg 3.367 is stuck inactive for 912263.766607, current state
>>>>>>>>>>> incomplete,
>>>>>>>>>>> last
>>>>>>>>>>> acting [28,35,2]
>>>>>>>>>>> pg 3.367 is stuck unclean for 912263.766688, current state
>>>>>>>>>>> incomplete,
>>>>>>>>>>> last
>>>>>>>>>>> acting [28,35,2]
>>>>>>>>>>> pg 3.367 is incomplete, acting [28,35,2]
>>>>>>>>>>> 1 ops are blocked > 268435 sec
>>>>>>>>>>> 1 ops are blocked > 268435 sec on osd.28
>>>>>>>>>>> 1 osds have slow requests
>>>>>>>>>>>
>>>>>>>>>>> # ceph -s
>>>>>>>>>>>     cluster 6713d1b8-83da-11e6-aa79-525400d98c5a
>>>>>>>>>>>      health HEALTH_WARN
>>>>>>>>>>>             1 pgs incomplete
>>>>>>>>>>>             1 pgs stuck inactive
>>>>>>>>>>>             1 pgs stuck unclean
>>>>>>>>>>>             1 requests are blocked > 32 sec
>>>>>>>>>>>      monmap e3: 3 mons at
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> {tv-dl360-1=10.12.193.73:6789/0,tv-dl360-2=10.12.193.74:6789/0,tv-dl360-3=10.12.193.75:6789/0}
>>>>>>>>>>>             election epoch 72, quorum 0,1,2
>>>>>>>>>>> tv-dl360-1,tv-dl360-2,tv-dl360-3
>>>>>>>>>>>      osdmap e60609: 72 osds: 72 up, 72 in
>>>>>>>>>>>       pgmap v3670252: 4864 pgs, 11 pools, 134 GB data, 23778
>>>>>>>>>>> objects
>>>>>>>>>>>             490 GB used, 130 TB / 130 TB avail
>>>>>>>>>>>                 4863 active+clean
>>>>>>>>>>>                    1 incomplete
>>>>>>>>>>>   client io 0 B/s rd, 38465 B/s wr, 2 op/s
>>>>>>>>>>>
>>>>>>>>>>> ceph pg repair doesn't change anything. What should I try to
>>>>>>>>>>> recover
>>>>>>>>>>> it?
>>>>>>>>>>> Attached is the result of ceph pg query on the problem PG.
>>>>>>>>>>>
>>>>>>>>>>> Thank you,
>>>>>>>>>>> Laszlo
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com