Re: pgs stuck inactive

Brad Hubbard <bhubbard@xxxxxxxxxx> · Sun, 12 Mar 2017 21:06:16 +1000



On Sun, Mar 12, 2017 at 7:51 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote:
> Hello,
>
> I have already done the export with ceph_objectstore_tool. I just have to
> decide which OSDs to keep.
> Can you tell me why the directory structure in the OSDs is different for the
> same PG when checking on different OSDs?
> For instance, in OSD 2 and 63 there are NO subdirectories in the
> 3.367__head, while OSD 28, 35 contains
> ./DIR_7/DIR_6/DIR_B/
> ./DIR_7/DIR_6/DIR_3/
>
> When are these subdirectories created?
>
> The files are identical on all the OSDs, only the way how these are stored
> is different. It would be enough if you could point me to some documentation
> that explain these, I'll read it. So far, searching for the architecture of
> an OSD, I could not find the gory details about these directories.

https://github.com/ceph/ceph/blob/master/src/os/filestore/HashIndex.h

>
> Kind regards,
> Laszlo
>
>
> On 12.03.2017 02:12, Brad Hubbard wrote:
>>
>> On Sat, Mar 11, 2017 at 7:43 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx>
>> wrote:
>>>
>>> Hello,
>>>
>>> Thank you for your answer.
>>>
>>> indeed the min_size is 1:
>>>
>>> # ceph osd pool get volumes size
>>> size: 3
>>> # ceph osd pool get volumes min_size
>>> min_size: 1
>>> #
>>> I'm gonna try to find the mentioned discussions on the mailing lists, and
>>> read them. If you have a link at hand, that would be nice if you would
>>> send
>>> it to me.
>>
>>
>> This thread is one example, there are lots more.
>>
>>
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
>>
>>>
>>> In the attached file you can see the contents of the directory containing
>>> PG
>>> data on the different OSDs (all that have appeared in the pg query).
>>> According to the md5sums the files are identical. What bothers me is the
>>> directory structure (you can see the ls -R in each dir that contains
>>> files).
>>
>>
>> So I mixed up 63 and 68, my list should have read 2, 28, 35 and 63
>> since 68 is listed as empty in the pg query.
>>
>>>
>>> Where can I read about how/why those DIR# subdirectories have appeared?
>>>
>>> Given that the files themselves are identical on the "current" OSDs
>>> belonging to the PG, and as the osd.63 (currently not belonging to the
>>> PG)
>>> has the same files, is it safe to stop the OSD.2, remove the 3.367_head
>>> dir,
>>> and then restart the OSD? (all these with the noout flag set of course)
>>
>>
>> *You* need to decide which is the "good" copy and then follow the
>> instructions in the links I provided to try and recover the pg. Back
>> those known copies on 2, 28, 35 and 63 up with the
>> ceph_objectstore_tool before proceeding. They may well be identical
>> but the peering process still needs to "see" the relevant logs and
>> currently something is stopping it doing so.
>>
>>>
>>> Kind regards,
>>> Laszlo
>>>
>>>
>>> On 11.03.2017 00:32, Brad Hubbard wrote:
>>>>
>>>>
>>>> So this is why it happened I guess.
>>>>
>>>> pool 3 'volumes' replicated size 3 min_size 1
>>>>
>>>> min_size = 1 is a recipe for disasters like this and there are plenty
>>>> of ML threads about not setting it below 2.
>>>>
>>>> The past intervals in the pg query show several intervals where a
>>>> single OSD may have gone rw.
>>>>
>>>> How important is this data?
>>>>
>>>> I would suggest checking which of these OSDs actually have the data
>>>> for this pg. From the pg query it looks like 2, 35 and 68 and possibly
>>>> 28 since it's the primary. Check all OSDs in the pg query output. I
>>>> would then back up all copies and work out which copy, if any, you
>>>> want to keep and then attempt something like the following.
>>>>
>>>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg17820.html
>>>>
>>>> If you want to abandon the pg see
>>>>
>>>>
>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012778.html
>>>> for a possible solution.
>>>>
>>>> http://ceph.com/community/incomplete-pgs-oh-my/ may also give some
>>>> ideas.
>>>>
>>>>
>>>> On Fri, Mar 10, 2017 at 9:44 PM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx>
>>>> wrote:
>>>>>
>>>>>
>>>>> The OSDs are all there.
>>>>>
>>>>> $ sudo ceph osd stat
>>>>>      osdmap e60609: 72 osds: 72 up, 72 in
>>>>>
>>>>> an I have attached the result of ceph osd tree, and ceph osd dump
>>>>> commands.
>>>>> I got some extra info about the network problem. A faulty network
>>>>> device
>>>>> has
>>>>> flooded the network eating up all the bandwidth so the OSDs were not
>>>>> able
>>>>> to
>>>>> properly communicate with each other. This has lasted for almost 1 day.
>>>>>
>>>>> Thank you,
>>>>> Laszlo
>>>>>
>>>>>
>>>>>
>>>>> On 10.03.2017 12:19, Brad Hubbard wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> To me it looks like someone may have done an "rm" on these OSDs but
>>>>>> not removed them from the crushmap. This does not happen
>>>>>> automatically.
>>>>>>
>>>>>> Do these OSDs show up in "ceph osd tree" and "ceph osd dump" ? If so,
>>>>>> paste the output.
>>>>>>
>>>>>> Without knowing what exactly happened here it may be difficult to work
>>>>>> out how to proceed.
>>>>>>
>>>>>> In order to go clean the primary needs to communicate with multiple
>>>>>> OSDs, some of which are marked DNE and seem to be uncontactable.
>>>>>>
>>>>>> This seems to be more than a network issue (unless the outage is still
>>>>>> happening).
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://docs.ceph.com/docs/master/rados/operations/pg-states/?highlight=incomplete
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 10, 2017 at 6:09 PM, Laszlo Budai
>>>>>> <laszlo@xxxxxxxxxxxxxxxx>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I was informed that due to a networking issue the ceph cluster
>>>>>>> network
>>>>>>> was
>>>>>>> affected. There was a huge packet loss, and network interfaces were
>>>>>>> flipping. That's all I got.
>>>>>>> This outage has lasted a longer period of time. So I assume that some
>>>>>>> OSD
>>>>>>> may have been considered dead and the data from them has been moved
>>>>>>> away
>>>>>>> to
>>>>>>> other PGs (this is what ceph is supposed to do if I'm correct).
>>>>>>> Probably
>>>>>>> that was the point when the listed PGs have appeared into the
>>>>>>> picture.
>>>>>>> From the query we can see this for one of those OSDs:
>>>>>>>         {
>>>>>>>             "peer": "14",
>>>>>>>             "pgid": "3.367",
>>>>>>>             "last_update": "0'0",
>>>>>>>             "last_complete": "0'0",
>>>>>>>             "log_tail": "0'0",
>>>>>>>             "last_user_version": 0,
>>>>>>>             "last_backfill": "MAX",
>>>>>>>             "purged_snaps": "[]",
>>>>>>>             "history": {
>>>>>>>                 "epoch_created": 4,
>>>>>>>                 "last_epoch_started": 54899,
>>>>>>>                 "last_epoch_clean": 55143,
>>>>>>>                 "last_epoch_split": 0,
>>>>>>>                 "same_up_since": 60603,
>>>>>>>                 "same_interval_since": 60603,
>>>>>>>                 "same_primary_since": 60593,
>>>>>>>                 "last_scrub": "2852'33528",
>>>>>>>                 "last_scrub_stamp": "2017-02-26 02:36:55.210150",
>>>>>>>                 "last_deep_scrub": "2852'16480",
>>>>>>>                 "last_deep_scrub_stamp": "2017-02-21
>>>>>>> 00:14:08.866448",
>>>>>>>                 "last_clean_scrub_stamp": "2017-02-26
>>>>>>> 02:36:55.210150"
>>>>>>>             },
>>>>>>>             "stats": {
>>>>>>>                 "version": "0'0",
>>>>>>>                 "reported_seq": "14",
>>>>>>>                 "reported_epoch": "59779",
>>>>>>>                 "state": "down+peering",
>>>>>>>                 "last_fresh": "2017-02-27 16:30:16.230519",
>>>>>>>                 "last_change": "2017-02-27 16:30:15.267995",
>>>>>>>                 "last_active": "0.000000",
>>>>>>>                 "last_peered": "0.000000",
>>>>>>>                 "last_clean": "0.000000",
>>>>>>>                 "last_became_active": "0.000000",
>>>>>>>                 "last_became_peered": "0.000000",
>>>>>>>                 "last_unstale": "2017-02-27 16:30:16.230519",
>>>>>>>                 "last_undegraded": "2017-02-27 16:30:16.230519",
>>>>>>>                 "last_fullsized": "2017-02-27 16:30:16.230519",
>>>>>>>                 "mapping_epoch": 60601,
>>>>>>>                 "log_start": "0'0",
>>>>>>>                 "ondisk_log_start": "0'0",
>>>>>>>                 "created": 4,
>>>>>>>                 "last_epoch_clean": 55143,
>>>>>>>                 "parent": "0.0",
>>>>>>>                 "parent_split_bits": 0,
>>>>>>>                 "last_scrub": "2852'33528",
>>>>>>>                 "last_scrub_stamp": "2017-02-26 02:36:55.210150",
>>>>>>>                 "last_deep_scrub": "2852'16480",
>>>>>>>                 "last_deep_scrub_stamp": "2017-02-21
>>>>>>> 00:14:08.866448",
>>>>>>>                 "last_clean_scrub_stamp": "2017-02-26
>>>>>>> 02:36:55.210150",
>>>>>>>                 "log_size": 0,
>>>>>>>                 "ondisk_log_size": 0,
>>>>>>>                 "stats_invalid": "0",
>>>>>>>                 "stat_sum": {
>>>>>>>                     "num_bytes": 0,
>>>>>>>                     "num_objects": 0,
>>>>>>>                     "num_object_clones": 0,
>>>>>>>                     "num_object_copies": 0,
>>>>>>>                     "num_objects_missing_on_primary": 0,
>>>>>>>                     "num_objects_degraded": 0,
>>>>>>>                     "num_objects_misplaced": 0,
>>>>>>>                     "num_objects_unfound": 0,
>>>>>>>                     "num_objects_dirty": 0,
>>>>>>>                     "num_whiteouts": 0,
>>>>>>>                     "num_read": 0,
>>>>>>>                     "num_read_kb": 0,
>>>>>>>                     "num_write": 0,
>>>>>>>                     "num_write_kb": 0,
>>>>>>>                     "num_scrub_errors": 0,
>>>>>>>                     "num_shallow_scrub_errors": 0,
>>>>>>>                     "num_deep_scrub_errors": 0,
>>>>>>>                     "num_objects_recovered": 0,
>>>>>>>                     "num_bytes_recovered": 0,
>>>>>>>                     "num_keys_recovered": 0,
>>>>>>>                     "num_objects_omap": 0,
>>>>>>>                     "num_objects_hit_set_archive": 0,
>>>>>>>                     "num_bytes_hit_set_archive": 0
>>>>>>>                 },
>>>>>>>                 "up": [
>>>>>>>                     28,
>>>>>>>                     35,
>>>>>>>                     2
>>>>>>>                 ],
>>>>>>>                 "acting": [
>>>>>>>                     28,
>>>>>>>                     35,
>>>>>>>                     2
>>>>>>>                 ],
>>>>>>>                 "blocked_by": [],
>>>>>>>                 "up_primary": 28,
>>>>>>>                 "acting_primary": 28
>>>>>>>             },
>>>>>>>             "empty": 1,
>>>>>>>             "dne": 0,
>>>>>>>             "incomplete": 0,
>>>>>>>             "last_epoch_started": 0,
>>>>>>>             "hit_set_history": {
>>>>>>>                 "current_last_update": "0'0",
>>>>>>>                 "current_last_stamp": "0.000000",
>>>>>>>                 "current_info": {
>>>>>>>                     "begin": "0.000000",
>>>>>>>                     "end": "0.000000",
>>>>>>>                     "version": "0'0",
>>>>>>>                     "using_gmt": "1"
>>>>>>>                 },
>>>>>>>                 "history": []
>>>>>>>             }
>>>>>>>         },
>>>>>>>
>>>>>>> Where can I read more about the meaning of each parameter, some of
>>>>>>> them
>>>>>>> have
>>>>>>> quite self explanatory names, but not all (or probably we need a
>>>>>>> deeper
>>>>>>> knowledge to understand them).
>>>>>>> Isn't there any parameter that would say when was that OSD assigned
>>>>>>> to
>>>>>>> the
>>>>>>> given PG? Also the stat_sum shows 0 for all its parameters. Why is it
>>>>>>> blocking then?
>>>>>>>
>>>>>>> Is there a way to tell the PG to forget about that OSD?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Laszlo
>>>>>>>
>>>>>>>
>>>>>>> On 10.03.2017 03:05, Brad Hubbard wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Can you explain more about what happened?
>>>>>>>>
>>>>>>>> The query shows progress is blocked by the following OSDs.
>>>>>>>>
>>>>>>>>                 "blocked_by": [
>>>>>>>>                     14,
>>>>>>>>                     17,
>>>>>>>>                     51,
>>>>>>>>                     58,
>>>>>>>>                     63,
>>>>>>>>                     64,
>>>>>>>>                     68,
>>>>>>>>                     70
>>>>>>>>                 ],
>>>>>>>>
>>>>>>>> Some of these OSDs are marked as "dne" (Does Not Exist).
>>>>>>>>
>>>>>>>> peer": "17",
>>>>>>>> "dne": 1,
>>>>>>>> "peer": "51",
>>>>>>>> "dne": 1,
>>>>>>>> "peer": "58",
>>>>>>>> "dne": 1,
>>>>>>>> "peer": "64",
>>>>>>>> "dne": 1,
>>>>>>>> "peer": "70",
>>>>>>>> "dne": 1,
>>>>>>>>
>>>>>>>> Can we get a complete background here please?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 9, 2017 at 10:53 PM, Laszlo Budai
>>>>>>>> <laszlo@xxxxxxxxxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> After a major network outage our ceph cluster ended up with an
>>>>>>>>> inactive
>>>>>>>>> PG:
>>>>>>>>>
>>>>>>>>> # ceph health detail
>>>>>>>>> HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck
>>>>>>>>> unclean;
>>>>>>>>> 1
>>>>>>>>> requests are blocked > 32 sec; 1 osds have slow requests
>>>>>>>>> pg 3.367 is stuck inactive for 912263.766607, current state
>>>>>>>>> incomplete,
>>>>>>>>> last
>>>>>>>>> acting [28,35,2]
>>>>>>>>> pg 3.367 is stuck unclean for 912263.766688, current state
>>>>>>>>> incomplete,
>>>>>>>>> last
>>>>>>>>> acting [28,35,2]
>>>>>>>>> pg 3.367 is incomplete, acting [28,35,2]
>>>>>>>>> 1 ops are blocked > 268435 sec
>>>>>>>>> 1 ops are blocked > 268435 sec on osd.28
>>>>>>>>> 1 osds have slow requests
>>>>>>>>>
>>>>>>>>> # ceph -s
>>>>>>>>>     cluster 6713d1b8-83da-11e6-aa79-525400d98c5a
>>>>>>>>>      health HEALTH_WARN
>>>>>>>>>             1 pgs incomplete
>>>>>>>>>             1 pgs stuck inactive
>>>>>>>>>             1 pgs stuck unclean
>>>>>>>>>             1 requests are blocked > 32 sec
>>>>>>>>>      monmap e3: 3 mons at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> {tv-dl360-1=10.12.193.73:6789/0,tv-dl360-2=10.12.193.74:6789/0,tv-dl360-3=10.12.193.75:6789/0}
>>>>>>>>>             election epoch 72, quorum 0,1,2
>>>>>>>>> tv-dl360-1,tv-dl360-2,tv-dl360-3
>>>>>>>>>      osdmap e60609: 72 osds: 72 up, 72 in
>>>>>>>>>       pgmap v3670252: 4864 pgs, 11 pools, 134 GB data, 23778
>>>>>>>>> objects
>>>>>>>>>             490 GB used, 130 TB / 130 TB avail
>>>>>>>>>                 4863 active+clean
>>>>>>>>>                    1 incomplete
>>>>>>>>>   client io 0 B/s rd, 38465 B/s wr, 2 op/s
>>>>>>>>>
>>>>>>>>> ceph pg repair doesn't change anything. What should I try to
>>>>>>>>> recover
>>>>>>>>> it?
>>>>>>>>> Attached is the result of ceph pg query on the problem PG.
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Laszlo
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>


-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com