Re: multiple OSD crash, unfound objects

Frank Schilder <frans@xxxxxx> · Fri, 16 Oct 2020 17:35:53 +0000

Dear Michael,

this is a bit of a nut. I can't see anything obvious. I have two hypotheses that you might consider testing.

1) Problem with 1 incomplete PG.

In the shadow hierarchy for your cluster I can see quite a lot of nodes like

        {
            "id": -135,
            "name": "node229~hdd",
            "type_id": 1,
            "type_name": "host",
            "weight": 0,
            "alg": "straw2",
            "hash": "rjenkins1",
            "items": []
        },

I would have expected that hosts without a device of a certain device class are *excluded* completely from a tree instead of having weight 0. I'm wondering if this could lead to the crush algorithm fail in the way described here: https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon . This might be a long shot, but could you export your crush map and play with the tunables as described under this link to see if more tries lead to a valid mapping? Note that testing this is harmless and does not change anything on the cluster.

The hypothesis here is that buckets with weight 0 are not excluded from drawing a-priori, but a-posteriori. If there are too many draws of an empty bucket, a mapping fails. Allowing more tries should then lead to success. We should at least rule out this possibility.

2) About the incomplete PG.

I'm wondering if the problem is that the pool has exactly 1 PG. I don't have a test pool with Nautilus and cannot try this out. Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD mapping? If not, can you then increase pg_num and pgp_num to, say, 10 and see if this has any effect?

I'm wondering here if there needs to be a minimum number >1 of PGs in a pool. Again, this is more about ruling out a possibility than expecting success. As an extension to this test, you could increase pg_num and pgp_num of the pool device_health_metrics to see if this has any effect.

The crush rules and crush tree look OK to me. I can't really see why the missing OSDs are not assigned to the two PGs 1.0 and 7.39d.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 16 October 2020 15:41:29
To: Michael Thomas; ceph-users@xxxxxxx
Subject:  Re: multiple OSD crash, unfound objects

Dear Michael,

> Please mark OSD 41 as "in" again and wait for some slow ops to show up.

I forgot. "wait for some slow ops to show up" ... and then what?

Could you please go to the host of the affected OSD and look at the output of "ceph daemon osd.ID ops" or "ceph daemon osd.ID dump_historic_slow_ops" and check what type of operations get stuck? I'm wondering if its administrative, like peering attempts.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder
Sent: 16 October 2020 15:09:20
To: Michael Thomas; ceph-users@xxxxxxx
Subject: Re:  Re: multiple OSD crash, unfound objects

Dear Michael,

thanks for this initial work. I will need to look through the files you posted in more detail. In the meantime:

Please mark OSD 41 as "in" again and wait for some slow ops to show up. As far as I can see, marking it "out" might have cleared hanging slow ops (there were 1000 before), but they then started piling up again. From the OSD log it looks like an operation that is sent to/from PG 1.0, which doesn't respond because it is inactive. Hence, getting PG 1.0 active should resolve this issue (later).

Its a bit strange that I see slow ops for OSD 41 in the latest health detail (https://pastebin.com/3G3ij9ui). Was the OSD still out when this health report was created?

I think we might have misunderstood my question 6. My question was whether or not each host bucket corresponds to a physical host and vice versa, that is, each physical host has exactly 1 host bucket. I'm asking because it is possible to have multiple host buckets assigned to a single physical host and this has implications on how to manage things.

Coming back to PG 1.0 (the only PG in pool device_health_metrics as far as I can see), the problem is that is has no OSDs assigned. I need to look a bit longer at the data you uploaded to find out why. I can't see anything obvious.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michael Thomas <wart@xxxxxxxxxxx>
Sent: 16 October 2020 02:08:01
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Re: multiple OSD crash, unfound objects

On 10/14/20 3:49 PM, Frank Schilder wrote:
> Hi Michael,
>
> it doesn't look too bad. All degraded objects are due to the undersized PG. If this is an EC pool with m>=2, data is currently not in danger.
>
> I see a few loose ends to pick up, let's hope this is something simple. For any of the below, before attempting the next step, please wait until all induced recovery IO has completed before continuing.
>
> 1) Could you please paste the output of the following commands to pastebin (bash syntax):
>
>    ceph osd pool get device_health_metrics all

https://pastebin.com/6D83mjsV

>    ceph osd pool get fs.data.archive.frames all

https://pastebin.com/7XAaQcpC

>    ceph pg dump |& grep -i -e PG_STAT -e "^7.39d"

https://pastebin.com/tBLaq63Q

>    ceph osd crush rule ls

https://pastebin.com/6f5B778G

>    ceph osd erasure-code-profile ls

https://pastebin.com/uhAaMH1c

>    ceph osd crush dump # this is a big one, please be careful with copy-paste (see point 3 below)

https://pastebin.com/u92D23jV

> 2) I don't see any IO reported (neither user nor recovery). Could you please confirm that the command outputs were taken during a zero-IO period?

That's correct, there was no activity at this time.  Access to the
cephfs filesystem is very bursty, varying from completely idle to
multiple GB/s (read).

> 3) Something is wrong with osd.41. Can you check its health status with smartctl? If it is reported healthy, give it one more clean restart. If the slow ops do not disappear, it could be a disk fail that is not detected by health monitoring. You could set it to "out" and see if the cluster recovers to a healthy state (modulo the currently degraded objects) with no slow ops. If so, I would replace the disk.

smartctl reports no problems.

osd.41 (and osd.0) was one of the original OSDs used for the
device_health_metrics pool.  Early on, before I knew better, I had
removed this OSD (and osd.0) from the cluster, and the OSD ids got
recycled when new disks were later added.  This is when the slow ops on
osd.0 and osd.41 started getting reported.  On advice from another user
on ceph-users, I updated my crush map to remap the device_health_metrics
pool to a different set of OSDs (and the slow ops persisted).

osd.0 usually also shows slow ops.  I was a little surprised that it
didn't when I took this snapshot, but now it does.

I have now run 'ceph osd out 41', and the recovery I/O has finished.
With the exception of one less OSD marked in, the output of 'ceph
status' looks the same.

The last few lines of the osd.41 logfile are here:

https://pastebin.com/k06aArW4

How long does it take for ceph to clear the slow ops status?

> 4) In the output of "df tree" node141 shows up twice. Could you confirm that this is a copy-paste error or is this node indeed twice in the output? This is easiest to see in the pastebin when switching to "raw" view.

This was a copy/paste error.

> 5) The crush tree contains an empty host bucket (node308). Please delete this host bucket (ceph osd crush rm node308) for now and let me know if this caused any data movements (recovery IO).

This did not cause any data movement, according to 'ceph status'.

> 6) The crush tree looks a bit exotic. Do the nodes with a single OSD correspond to a physical host with 1 OSD disk? If not, could you please state how the host buckets are mapped onto physical hosts?

Each OSD corresponds to a single physical disk.  Hosts may have 1, 2 or
3 OSDs of varying types (HDD, SSD, or SSD+NVME).  There are a few
different crush types used in the cluster:

3 x replicated nvme - used for cephfs metadata
3 x replicated SSD - used for ovirt block storage
EC HDD - used for the bulk of the experiment data
EC SSD - used for frequently accessed experiment data

> 7) In case there was a change to the health status, could you please include an updated "ceph health detail"?

Looks like the only difference is a new slow MDS op, and one PG that
hasn't been deep scrubbed in the last week:

https://pastebin.com/3G3ij9ui

--Mike

> I don't expect to get the incomplete PG resolved with the above, but it will move some issues out of the way before proceeding.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Michael Thomas <wart@xxxxxxxxxxx>
> Sent: 14 October 2020 20:52:10
> To: Andreas John; ceph-users@xxxxxxx
> Subject:  Re: multiple OSD crash, unfound objects
>
> Hello,
>
> The original cause of the OSD instability has already been fixed.  It
> was due to user jobs (via condor) consuming too much memory and causing
> the machine to swap.  The OSDs didn't actually crash, but weren't
> responding in time and were being flagged as down.
>
> In most cases, the problematic OSD servers were also not responding on
> the console and had to be physically power cycled to recover.
>
> Since adding additional memory limits to user jobs, we have only had 1
> or 2 unstable OSDs that were fixed by killing the remaining rogue user jobs.
>
> Regards,
>
> --Mike
>
> On 10/10/20 9:22 AM, Andreas John wrote:
>> Hello Mike,
>>
>> do your OSDs go down from time to time? I once has an issue with
>> unrecoverable objects, because I had only n+1 (size 2) redundancy and
>> ceph wasn't able to decide, what's the correct copy of the object. In my
>> case there half-deleted snapshots  in one of the copies. I used
>> ceph-objectstoretool to remove the "wrong" part. Did you check you OSD
>> logs? Do the osd go down wirth an obscure stacktrace (and maybe they are
>> restartet by systemd ...)
>>
>> rgds,
>>
>> j.
>>
>>
>>
>> On 09.10.20 22:33, Michael Thomas wrote:
>>> Hi Frank,
>>>
>>> That was a good tip.  I was able to move the broken files out of the
>>> way and restore them for users.  However, after 2 weeks I'm still left
>>> with unfound objects.  Even more annoying, I now have 82k objects
>>> degraded (up from 74), which hasn't changed in over a week.
>>>
>>> I'm ready to claim that the auto-repair capabilities of ceph are not
>>> able to fix my particular issues, and will have to continue to
>>> investigate alternate ways to clean this up, including a pg
>>> export/import (as you suggested) and perhaps a mds backward scrub
>>> (after testing in a junk pool first).
>>>
>>> I have other tasks I need to perform on the filesystem (removing OSDs,
>>> adding new OSDs, increasing PG count), but I feel like I need to
>>> address these degraded/lost objects before risking any more damage.
>>>
>>> One particular PG is in a curious state:
>>>
>>> 7.39d    82163     82165     246734        1  344060777807            0
>>>    0   2139  active+recovery_unfound+undersized+degraded+remapped 23m
>>> 50755'112549   50766:960500       [116,72,122,48,45,131,73,81]p116
>>>        [71,109,99,48,45,90,73,NONE]p71  2020-08-13T23:02:34.325887-0500
>>> 2020-08-07T11:01:45.657036-0500
>>>
>>> Note the 'NONE' in the acting set.  I do not know which OSD this may
>>> have been, nor how to find out.  I suspect (without evidence) that
>>> this is part of the cause of no action on the degraded and misplaced
>>> objects.
>>>
>>> --Mike
>>>
>>> On 9/18/20 11:26 AM, Frank Schilder wrote:
>>>> Dear Michael,
>>>>
>>>> maybe there is a way to restore access for users and solve the issues
>>>> later. Someone else with a lost/unfound object was able to move the
>>>> affected file (or directory containing the file) to a separate
>>>> location and restore the now missing data from backup. This will
>>>> "park" the problem of cluster health for later fixing.
>>>>
>>>> Best regads,
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>>
>>>> ________________________________________
>>>> From: Frank Schilder <frans@xxxxxx>
>>>> Sent: 18 September 2020 15:38:51
>>>> To: Michael Thomas; ceph-users@xxxxxxx
>>>> Subject:  Re: multiple OSD crash, unfound objects
>>>>
>>>> Dear Michael,
>>>>
>>>>> I disagree with the statement that trying to recover health by deleting
>>>>> data is a contradiction.  In some cases (such as mine), the data in
>>>>> ceph
>>>>> is backed up in another location (eg tape library).  Restoring a few
>>>>> files from tape is a simple and cheap operation that takes a minute, at
>>>>> most.
>>>>
>>>> I would agree with that if the data was deleted using the appropriate
>>>> high-level operation. Deleting an unfound object is like marking a
>>>> sector on a disk as bad with smartctl. How should the file system
>>>> react to that? Purging an OSD is like removing a disk from a raid
>>>> set. Such operations increase inconsistencies/degradation rather than
>>>> resolving them. Cleaning this up also requires to execute other
>>>> operations to remove all references to the object and, finally, the
>>>> file inode itself.
>>>>
>>>> The ls on a dir with corrupted file(s) hangs if ls calls stat on
>>>> every file. For example, when coloring is enabled, ls will stat every
>>>> file in the dir to be able to choose the color according to
>>>> permissions. If one then disables coloring, a plain "ls" will return
>>>> all names while an "ls -l" will hang due to stat calls.
>>>>
>>>> An "rm" or "rm -f" should succeed if the folder permissions allow
>>>> that. It should not stat the file itself, so it sounds a bit odd that
>>>> its hanging. I guess in some situations it does, like "rm -i", which
>>>> will ask before removing read-only files. How does "unlink FILE" behave?
>>>>
>>>> Most admin commands on ceph are asynchronous. A command like "pg
>>>> repair" or "osd scrub" only schedules an operation. The command "ceph
>>>> pg 7.1fb mark_unfound_lost delete" does probably just the same.
>>>> Unfortunately, I don't know how to check that a scheduled operation
>>>> has started/completed/succeeded/failed. I asked this in an earlier
>>>> thread (about PG repair) and didn't get an answer. On our cluster,
>>>> the actual repair happened ca. 6-12 hours after scheduling (on a
>>>> healthy cluster!). I would conclude that (some of) these operations
>>>> have very low priority and will not start at least as long as there
>>>> is recovery going on. One might want to consider the possibility that
>>>> some of the scheduled commands have not been executed yet.
>>>>
>>>> The output of "pg query" contains the IDs of the missing objects (in
>>>> mimic) and each of these objects is on one of the peer OSDs of the PG
>>>> (I think object here refers to shard or copy). It should be possible
>>>> to find the corresponding OSD (or at least obtain confirmation that
>>>> the object is really gone) and move the object to a place where it is
>>>> expected to be found. This can probably be achieved with "PG export"
>>>> and "PG import". I don't know of any other way(s).
>>>>
>>>> I guess, in the current situation, sitting it out a bit longer might
>>>> be a good strategy. I don't know how many asynchronous commands you
>>>> executed and giving the cluster time to complete these jobs might
>>>> improve the situation.
>>>>
>>>> Sorry that I can't be of more help here. However, if you figure out a
>>>> solution (ideally non-destructive), please post it here.
>>>>
>>>> Best regards,
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>>
>>>> ________________________________________
>>>> From: Michael Thomas <wart@xxxxxxxxxxx>
>>>> Sent: 18 September 2020 14:15:53
>>>> To: Frank Schilder; ceph-users@xxxxxxx
>>>> Subject: Re:  multiple OSD crash, unfound objects
>>>>
>>>> Hi Frank,
>>>>
>>>> On 9/18/20 2:50 AM, Frank Schilder wrote:
>>>>> Dear Michael,
>>>>>
>>>>> firstly, I'm a bit confused why you started deleting data. The
>>>>> objects were unfound, but still there. That's a small issue. Now the
>>>>> data might be gone and that's a real issue.
>>>>>
>>>>> ----------------------------
>>>>> Interval:
>>>>>
>>>>> Anyone reading this: I have seen many threads where ceph admins
>>>>> started deleting objects or PGs or even purging OSDs way too early
>>>>> from a cluster. Trying to recover health by deleting data is a
>>>>> contradiction. Ceph has bugs and sometimes it needs some help
>>>>> finding everything again. As far as I know, for most of these bugs
>>>>> there are workarounds that allow full recovery with a bit of work.
>>>>
>>>> I disagree with the statement that trying to recover health by deleting
>>>> data is a contradiction.  In some cases (such as mine), the data in ceph
>>>> is backed up in another location (eg tape library).  Restoring a few
>>>> files from tape is a simple and cheap operation that takes a minute, at
>>>> most.  For the sake of expediency, sometimes it's quicker and easier to
>>>> simply delete the affected files and restore them from the backup
>>>> system.
>>>>
>>>> This procedure has worked fine with our previous distributed filesystem
>>>> (hdfs), so I (naively?) thought that it could be used with ceph as well.
>>>>     I was a bit surprised that cephs behavior was to indefinitely block
>>>> the 'rm' operation so that the affected file could not even be removed.
>>>>
>>>> Since I have 25 unfound objects spread across 9 PGs, I used a PG with a
>>>> single unfound object to test this alternate recovery procedure.
>>>>
>>>>> First question is, did you delete the entire object or just a shard
>>>>> on one disk? Are there OSDs that might still have a copy?
>>>>
>>>> Per the troubleshooting guide
>>>> (https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/),
>>>>
>>>> I ran:
>>>>
>>>> ceph pg 7.1fb mark_unfound_lost delete
>>>>
>>>> So I presume that the entire object has been deleted.
>>>>
>>>>> If the object is gone for good, the file references something that
>>>>> doesn't exist - its like a bad sector. You probably need to delete
>>>>> the file. Bit strange that the operation does not err out with a
>>>>> read error. Maybe it doesn't because it waits for the unfound
>>>>> objects state to be resolved?
>>>>
>>>> Even before the object was removed, all read operations on the file
>>>> would hang.  Even worse, attempts to stat() the file with commands such
>>>> as 'ls' or 'rm' would hang.  Even worse, attempts to 'ls' in the
>>>> directory itself would hang.  This hasn't changed after removing the
>>>> object.
>>>>
>>>> *Update*: The stat() operations may not be hanging indefinitely.  It
>>>> seems to hang for somewhere between 10 minutes and 8 hours.
>>>>
>>>>> For all the other unfound objects, they are there somewhere - you
>>>>> didn't loose a disk or something. Try pushing ceph to scan the
>>>>> correct OSDs, for example, by restarting the newly added OSDs one by
>>>>> one or something similar. Sometimes exporting and importing a PG
>>>>> from one OSD to another forces a re-scan and subsequent discovery of
>>>>> unfound objects. It is also possible that ceph will find these
>>>>> objects along the way of recovery or when OSDs scrub or check for
>>>>> objects that can be deleted.
>>>>
>>>> I have restarted the new OSDs countless times.  I've used three
>>>> different methods to restart the OSD:
>>>>
>>>> * systemctl restart ceph-osd@120
>>>>
>>>> * init 6
>>>>
>>>> * ceph osd out 120
>>>>      ...wait for repeering to finish...
>>>>      systemctl restart ceph-osd@120
>>>>      ceph osd in 120
>>>>
>>>> I've done this for all OSDs that a PG has listed in the 'not queried'
>>>> state in 'ceph pg $pgid detail'.  But even when all OSDs in the PG are
>>>> back to the 'already probed' state, the missing objects remain.
>>>>
>>>> Over 90% of my PGs have not been deep scrubbed recently, due to the
>>>> amount of backfilling and importing of data into the ceph cluster.  I
>>>> plan to leave the cluster mostly idle over the weekend so that hopefully
>>>> the deep scrubs can catch up and possibly locate any missing objects.
>>>>
>>>> --Mike
>>>>
>>>>> Best regards,
>>>>> =================
>>>>> Frank Schilder
>>>>> AIT Risø Campus
>>>>> Bygning 109, rum S14
>>>>>
>>>>> ________________________________________
>>>>> From: Michael Thomas <wart@xxxxxxxxxxx>
>>>>> Sent: 17 September 2020 22:27:47
>>>>> To: Frank Schilder; ceph-users@xxxxxxx
>>>>> Subject: Re:  multiple OSD crash, unfound objects
>>>>>
>>>>> Hi Frank,
>>>>>
>>>>> Yes, it does sounds similar to your ticket.
>>>>>
>>>>> I've tried a few things to restore the failed files:
>>>>>
>>>>> * Locate a missing object with 'ceph pg $pgid list_unfound'
>>>>>
>>>>> * Convert the hex oid to a decimal inode number
>>>>>
>>>>> * Identify the affected file with 'find /ceph -inum $inode'
>>>>>
>>>>> At this point, I know which file is affected by the missing object.  As
>>>>> expected, attempts to read the file simply hang.  Unexpectedly,
>>>>> attempts
>>>>> to 'ls' the file or its containing directory also hang.  I presume from
>>>>> this that the stat() system call needs some information that is
>>>>> contained in the missing object, and is waiting for the object to
>>>>> become
>>>>> available.
>>>>>
>>>>> Next I tried to remove the affected object with:
>>>>>
>>>>> * ceph pg $pgid mark_unfound_lost delete
>>>>>
>>>>> Now 'ceph status' shows one fewer missing objects, but attempts to 'ls'
>>>>> or 'rm' the affected file continue to hang.
>>>>>
>>>>> Finally, I ran a scrub over the part of the filesystem containing the
>>>>> affected file:
>>>>>
>>>>> ceph tell mds.ceph4 scrub start /frames/postO3/hoft recursive
>>>>>
>>>>> Nothing seemed to come up during the scrub:
>>>>>
>>>>> 2020-09-17T14:56:15.208-0500 7f39bca24700  1 mds.ceph4 asok_command:
>>>>> scrub status {prefix=scrub status} (starting...)
>>>>> 2020-09-17T14:58:58.013-0500 7f39bca24700  1 mds.ceph4 asok_command:
>>>>> scrub start {path=/frames/postO3/hoft,prefix=scrub
>>>>> start,scrubops=[recursive]} (starting...)
>>>>> 2020-09-17T14:58:58.013-0500 7f39b5215700  0 log_channel(cluster) log
>>>>> [INF] : scrub summary: active
>>>>> 2020-09-17T14:58:58.014-0500 7f39b5215700  0 log_channel(cluster) log
>>>>> [INF] : scrub queued for path: /frames/postO3/hoft
>>>>> 2020-09-17T14:58:58.014-0500 7f39b5215700  0 log_channel(cluster) log
>>>>> [INF] : scrub summary: active [paths:/frames/postO3/hoft]
>>>>> 2020-09-17T14:59:02.535-0500 7f39bca24700  1 mds.ceph4 asok_command:
>>>>> scrub status {prefix=scrub status} (starting...)
>>>>> 2020-09-17T15:00:12.520-0500 7f39bca24700  1 mds.ceph4 asok_command:
>>>>> scrub status {prefix=scrub status} (starting...)
>>>>> 2020-09-17T15:02:32.944-0500 7f39b5215700  0 log_channel(cluster) log
>>>>> [INF] : scrub summary: idle
>>>>> 2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log
>>>>> [INF] : scrub complete with tag '1405e5c7-3ecf-4754-918e-129e9d101f7a'
>>>>> 2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log
>>>>> [INF] : scrub completed for path: /frames/postO3/hoft
>>>>> 2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log
>>>>> [INF] : scrub summary: idle
>>>>>
>>>>>
>>>>> After the scrub completed, access to the file (ls or rm) continue to
>>>>> hang.  The MDS reports slow reads:
>>>>>
>>>>> 2020-09-17T15:11:05.654-0500 7f39b9a1e700  0 log_channel(cluster) log
>>>>> [WRN] : slow request 481.867381 seconds old, received at
>>>>> 2020-09-17T15:03:03.788058-0500: client_request(client.451432:11309
>>>>> getattr pAsLsXsFs #0x1000005b1c0 2020-09-17T15:03:03.787602-0500
>>>>> caller_uid=0, caller_gid=0{}) currently dispatched
>>>>>
>>>>> Does anyone have any suggestions on how else to clean up from a
>>>>> permanently lost object?
>>>>>
>>>>> --Mike
>>>>>
>>>>> On 9/16/20 2:03 AM, Frank Schilder wrote:
>>>>>> Sounds similar to this one: https://tracker.ceph.com/issues/46847
>>>>>>
>>>>>> If you have or can reconstruct the crush map from before adding the
>>>>>> OSDs, you might be able to discover everything with the temporary
>>>>>> reversal of the crush map method.
>>>>>>
>>>>>> Not sure if there is another method, i never got a reply to my
>>>>>> question in the tracker.
>>>>>>
>>>>>> Best regards,
>>>>>> =================
>>>>>> Frank Schilder
>>>>>> AIT Risø Campus
>>>>>> Bygning 109, rum S14
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Michael Thomas <wart@xxxxxxxxxxx>
>>>>>> Sent: 16 September 2020 01:27:19
>>>>>> To: ceph-users@xxxxxxx
>>>>>> Subject:  multiple OSD crash, unfound objects
>>>>>>
>>>>>> Over the weekend I had multiple OSD servers in my Octopus cluster
>>>>>> (15.2.4) crash and reboot at nearly the same time.  The OSDs are
>>>>>> part of
>>>>>> an erasure coded pool.  At the time the cluster had been busy with a
>>>>>> long-running (~week) remapping of a large number of PGs after I
>>>>>> incrementally added more OSDs to the cluster.  After bringing all
>>>>>> of the
>>>>>> OSDs back up, I have 25 unfound objects and 75 degraded objects.
>>>>>> There
>>>>>> are other problems reported, but I'm primarily concerned with these
>>>>>> unfound/degraded objects.
>>>>>>
>>>>>> The pool with the missing objects is a cephfs pool.  The files
>>>>>> stored in
>>>>>> the pool are backed up on tape, so I can easily restore individual
>>>>>> files
>>>>>> as needed (though I would not want to restore the entire filesystem).
>>>>>>
>>>>>> I tried following the guide at
>>>>>> https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg/#unfound-objects.
>>>>>>
>>>>>>       I found a number of OSDs that are still 'not queried'.
>>>>>> Restarting a
>>>>>> sampling of these OSDs changed the state from 'not queried' to
>>>>>> 'already
>>>>>> probed', but that did not recover any of the unfound or degraded
>>>>>> objects.
>>>>>>
>>>>>> I have also tried 'ceph pg deep-scrub' on the affected PGs, but never
>>>>>> saw them get scrubbed.  I also tried doing a 'ceph pg
>>>>>> force-recovery' on
>>>>>> the affected PGs, but only one seems to have been tagged accordingly
>>>>>> (see ceph -s output below).
>>>>>>
>>>>>> The guide also says "Sometimes it simply takes some time for the
>>>>>> cluster
>>>>>> to query possible locations."  I'm not sure how long "some time" might
>>>>>> take, but it hasn't changed after several hours.
>>>>>>
>>>>>> My questions are:
>>>>>>
>>>>>> * Is there a way to force the cluster to query the possible locations
>>>>>> sooner?
>>>>>>
>>>>>> * Is it possible to identify the files in cephfs that are affected, so
>>>>>> that I could delete only the affected files and restore them from
>>>>>> backup
>>>>>> tapes?
>>>>>>
>>>>>> --Mike
>>>>>>
>>>>>> ceph -s:
>>>>>>
>>>>>>        cluster:
>>>>>>          id:     066f558c-6789-4a93-aaf1-5af1ba01a3ad
>>>>>>          health: HEALTH_ERR
>>>>>>                  1 clients failing to respond to capability release
>>>>>>                  1 MDSs report slow requests
>>>>>>                  25/78520351 objects unfound (0.000%)
>>>>>>                  2 nearfull osd(s)
>>>>>>                  Reduced data availability: 1 pg inactive
>>>>>>                  Possible data damage: 9 pgs recovery_unfound
>>>>>>                  Degraded data redundancy: 75/626645098 objects
>>>>>> degraded
>>>>>> (0.000%), 9 pgs degraded
>>>>>>                  1013 pgs not deep-scrubbed in time
>>>>>>                  1013 pgs not scrubbed in time
>>>>>>                  2 pool(s) nearfull
>>>>>>                  1 daemons have recently crashed
>>>>>>                  4 slow ops, oldest one blocked for 77939 sec, daemons
>>>>>> [osd.0,osd.41] have slow ops.
>>>>>>
>>>>>>        services:
>>>>>>          mon: 4 daemons, quorum ceph1,ceph2,ceph3,ceph4 (age 9d)
>>>>>>          mgr: ceph3(active, since 11d), standbys: ceph2, ceph4, ceph1
>>>>>>          mds: archive:1 {0=ceph4=up:active} 3 up:standby
>>>>>>          osd: 121 osds: 121 up (since 6m), 121 in (since 101m); 4
>>>>>> remapped pgs
>>>>>>
>>>>>>        task status:
>>>>>>          scrub status:
>>>>>>              mds.ceph4: idle
>>>>>>
>>>>>>        data:
>>>>>>          pools:   9 pools, 2433 pgs
>>>>>>          objects: 78.52M objects, 298 TiB
>>>>>>          usage:   412 TiB used, 545 TiB / 956 TiB avail
>>>>>>          pgs:     0.041% pgs unknown
>>>>>>                   75/626645098 objects degraded (0.000%)
>>>>>>                   135224/626645098 objects misplaced (0.022%)
>>>>>>                   25/78520351 objects unfound (0.000%)
>>>>>>                   2421 active+clean
>>>>>>                   5    active+recovery_unfound+degraded
>>>>>>                   3    active+recovery_unfound+degraded+remapped
>>>>>>                   2    active+clean+scrubbing+deep
>>>>>>                   1    unknown
>>>>>>                   1    active+forced_recovery+recovery_unfound+degraded
>>>>>>
>>>>>>        progress:
>>>>>>          PG autoscaler decreasing pool 7 PGs from 1024 to 512 (5d)
>>>>>>            [............................]
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx