Re: multiple OSD crash, unfound objects

Michael Thomas <wart@xxxxxxxxxxx> · Fri, 9 Oct 2020 15:33:46 -0500

Hi Frank,

That was a good tip.  I was able to move the broken files out of the way 
and restore them for users.  However, after 2 weeks I'm still left with 
unfound objects.  Even more annoying, I now have 82k objects degraded 
(up from 74), which hasn't changed in over a week.

I'm ready to claim that the auto-repair capabilities of ceph are not 
able to fix my particular issues, and will have to continue to 
investigate alternate ways to clean this up, including a pg 
export/import (as you suggested) and perhaps a mds backward scrub (after 
testing in a junk pool first).

I have other tasks I need to perform on the filesystem (removing OSDs, 
adding new OSDs, increasing PG count), but I feel like I need to address 
these degraded/lost objects before risking any more damage.

One particular PG is in a curious state:

7.39d    82163     82165     246734        1  344060777807            0 

  0   2139  active+recovery_unfound+undersized+degraded+remapped 
23m  50755'112549   50766:960500       [116,72,122,48,45,131,73,81]p116 
      [71,109,99,48,45,90,73,NONE]p71  2020-08-13T23:02:34.325887-0500 
2020-08-07T11:01:45.657036-0500

Note the 'NONE' in the acting set.  I do not know which OSD this may 
have been, nor how to find out.  I suspect (without evidence) that this 
is part of the cause of no action on the degraded and misplaced objects.

--Mike

On 9/18/20 11:26 AM, Frank Schilder wrote:
Dear Michael,

maybe there is a way to restore access for users and solve the issues later. Someone else with a lost/unfound object was able to move the affected file (or directory containing the file) to a separate location and restore the now missing data from backup. This will "park" the problem of cluster health for later fixing.

Best regads,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 18 September 2020 15:38:51
To: Michael Thomas; ceph-users@xxxxxxx
Subject:  Re: multiple OSD crash, unfound objects

Dear Michael,

I disagree with the statement that trying to recover health by deleting
data is a contradiction.  In some cases (such as mine), the data in ceph
is backed up in another location (eg tape library).  Restoring a few
files from tape is a simple and cheap operation that takes a minute, at
most.

I would agree with that if the data was deleted using the appropriate high-level operation. Deleting an unfound object is like marking a sector on a disk as bad with smartctl. How should the file system react to that? Purging an OSD is like removing a disk from a raid set. Such operations increase inconsistencies/degradation rather than resolving them. Cleaning this up also requires to execute other operations to remove all references to the object and, finally, the file inode itself.

The ls on a dir with corrupted file(s) hangs if ls calls stat on every file. For example, when coloring is enabled, ls will stat every file in the dir to be able to choose the color according to permissions. If one then disables coloring, a plain "ls" will return all names while an "ls -l" will hang due to stat calls.

An "rm" or "rm -f" should succeed if the folder permissions allow that. It should not stat the file itself, so it sounds a bit odd that its hanging. I guess in some situations it does, like "rm -i", which will ask before removing read-only files. How does "unlink FILE" behave?

Most admin commands on ceph are asynchronous. A command like "pg repair" or "osd scrub" only schedules an operation. The command "ceph pg 7.1fb mark_unfound_lost delete" does probably just the same. Unfortunately, I don't know how to check that a scheduled operation has started/completed/succeeded/failed. I asked this in an earlier thread (about PG repair) and didn't get an answer. On our cluster, the actual repair happened ca. 6-12 hours after scheduling (on a healthy cluster!). I would conclude that (some of) these operations have very low priority and will not start at least as long as there is recovery going on. One might want to consider the possibility that some of the scheduled commands have not been executed yet.

The output of "pg query" contains the IDs of the missing objects (in mimic) and each of these objects is on one of the peer OSDs of the PG (I think object here refers to shard or copy). It should be possible to find the corresponding OSD (or at least obtain confirmation that the object is really gone) and move the object to a place where it is expected to be found. This can probably be achieved with "PG export" and "PG import". I don't know of any other way(s).

I guess, in the current situation, sitting it out a bit longer might be a good strategy. I don't know how many asynchronous commands you executed and giving the cluster time to complete these jobs might improve the situation.

Sorry that I can't be of more help here. However, if you figure out a solution (ideally non-destructive), please post it here.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michael Thomas <wart@xxxxxxxxxxx>
Sent: 18 September 2020 14:15:53
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  multiple OSD crash, unfound objects

Hi Frank,

On 9/18/20 2:50 AM, Frank Schilder wrote:
Dear Michael,

firstly, I'm a bit confused why you started deleting data. The objects were unfound, but still there. That's a small issue. Now the data might be gone and that's a real issue.

----------------------------
Interval:

Anyone reading this: I have seen many threads where ceph admins started deleting objects or PGs or even purging OSDs way too early from a cluster. Trying to recover health by deleting data is a contradiction. Ceph has bugs and sometimes it needs some help finding everything again. As far as I know, for most of these bugs there are workarounds that allow full recovery with a bit of work.

I disagree with the statement that trying to recover health by deleting
data is a contradiction.  In some cases (such as mine), the data in ceph
is backed up in another location (eg tape library).  Restoring a few
files from tape is a simple and cheap operation that takes a minute, at
most.  For the sake of expediency, sometimes it's quicker and easier to
simply delete the affected files and restore them from the backup system.

This procedure has worked fine with our previous distributed filesystem
(hdfs), so I (naively?) thought that it could be used with ceph as well.
   I was a bit surprised that cephs behavior was to indefinitely block
the 'rm' operation so that the affected file could not even be removed.

Since I have 25 unfound objects spread across 9 PGs, I used a PG with a
single unfound object to test this alternate recovery procedure.

First question is, did you delete the entire object or just a shard on one disk? Are there OSDs that might still have a copy?

Per the troubleshooting guide
(https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/),
I ran:

ceph pg 7.1fb mark_unfound_lost delete

So I presume that the entire object has been deleted.

If the object is gone for good, the file references something that doesn't exist - its like a bad sector. You probably need to delete the file. Bit strange that the operation does not err out with a read error. Maybe it doesn't because it waits for the unfound objects state to be resolved?

Even before the object was removed, all read operations on the file
would hang.  Even worse, attempts to stat() the file with commands such
as 'ls' or 'rm' would hang.  Even worse, attempts to 'ls' in the
directory itself would hang.  This hasn't changed after removing the object.

*Update*: The stat() operations may not be hanging indefinitely.  It
seems to hang for somewhere between 10 minutes and 8 hours.

For all the other unfound objects, they are there somewhere - you didn't loose a disk or something. Try pushing ceph to scan the correct OSDs, for example, by restarting the newly added OSDs one by one or something similar. Sometimes exporting and importing a PG from one OSD to another forces a re-scan and subsequent discovery of unfound objects. It is also possible that ceph will find these objects along the way of recovery or when OSDs scrub or check for objects that can be deleted.

I have restarted the new OSDs countless times.  I've used three
different methods to restart the OSD:

* systemctl restart ceph-osd@120

* init 6

* ceph osd out 120
    ...wait for repeering to finish...
    systemctl restart ceph-osd@120
    ceph osd in 120

I've done this for all OSDs that a PG has listed in the 'not queried'
state in 'ceph pg $pgid detail'.  But even when all OSDs in the PG are
back to the 'already probed' state, the missing objects remain.

Over 90% of my PGs have not been deep scrubbed recently, due to the
amount of backfilling and importing of data into the ceph cluster.  I
plan to leave the cluster mostly idle over the weekend so that hopefully
the deep scrubs can catch up and possibly locate any missing objects.

--Mike

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michael Thomas <wart@xxxxxxxxxxx>
Sent: 17 September 2020 22:27:47
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  multiple OSD crash, unfound objects

Hi Frank,

Yes, it does sounds similar to your ticket.

I've tried a few things to restore the failed files:

* Locate a missing object with 'ceph pg $pgid list_unfound'

* Convert the hex oid to a decimal inode number

* Identify the affected file with 'find /ceph -inum $inode'

At this point, I know which file is affected by the missing object.  As
expected, attempts to read the file simply hang.  Unexpectedly, attempts
to 'ls' the file or its containing directory also hang.  I presume from
this that the stat() system call needs some information that is
contained in the missing object, and is waiting for the object to become
available.

Next I tried to remove the affected object with:

* ceph pg $pgid mark_unfound_lost delete

Now 'ceph status' shows one fewer missing objects, but attempts to 'ls'
or 'rm' the affected file continue to hang.

Finally, I ran a scrub over the part of the filesystem containing the
affected file:

ceph tell mds.ceph4 scrub start /frames/postO3/hoft recursive

Nothing seemed to come up during the scrub:

2020-09-17T14:56:15.208-0500 7f39bca24700  1 mds.ceph4 asok_command:
scrub status {prefix=scrub status} (starting...)
2020-09-17T14:58:58.013-0500 7f39bca24700  1 mds.ceph4 asok_command:
scrub start {path=/frames/postO3/hoft,prefix=scrub
start,scrubops=[recursive]} (starting...)
2020-09-17T14:58:58.013-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub summary: active
2020-09-17T14:58:58.014-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub queued for path: /frames/postO3/hoft
2020-09-17T14:58:58.014-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub summary: active [paths:/frames/postO3/hoft]
2020-09-17T14:59:02.535-0500 7f39bca24700  1 mds.ceph4 asok_command:
scrub status {prefix=scrub status} (starting...)
2020-09-17T15:00:12.520-0500 7f39bca24700  1 mds.ceph4 asok_command:
scrub status {prefix=scrub status} (starting...)
2020-09-17T15:02:32.944-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub summary: idle
2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub complete with tag '1405e5c7-3ecf-4754-918e-129e9d101f7a'
2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub completed for path: /frames/postO3/hoft
2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log
[INF] : scrub summary: idle

After the scrub completed, access to the file (ls or rm) continue to
hang.  The MDS reports slow reads:

2020-09-17T15:11:05.654-0500 7f39b9a1e700  0 log_channel(cluster) log
[WRN] : slow request 481.867381 seconds old, received at
2020-09-17T15:03:03.788058-0500: client_request(client.451432:11309
getattr pAsLsXsFs #0x1000005b1c0 2020-09-17T15:03:03.787602-0500
caller_uid=0, caller_gid=0{}) currently dispatched

Does anyone have any suggestions on how else to clean up from a
permanently lost object?

--Mike

On 9/16/20 2:03 AM, Frank Schilder wrote:
Sounds similar to this one: https://tracker.ceph.com/issues/46847

If you have or can reconstruct the crush map from before adding the OSDs, you might be able to discover everything with the temporary reversal of the crush map method.

Not sure if there is another method, i never got a reply to my question in the tracker.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michael Thomas <wart@xxxxxxxxxxx>
Sent: 16 September 2020 01:27:19
To: ceph-users@xxxxxxx
Subject:  multiple OSD crash, unfound objects

Over the weekend I had multiple OSD servers in my Octopus cluster
(15.2.4) crash and reboot at nearly the same time.  The OSDs are part of
an erasure coded pool.  At the time the cluster had been busy with a
long-running (~week) remapping of a large number of PGs after I
incrementally added more OSDs to the cluster.  After bringing all of the
OSDs back up, I have 25 unfound objects and 75 degraded objects.  There
are other problems reported, but I'm primarily concerned with these
unfound/degraded objects.

The pool with the missing objects is a cephfs pool.  The files stored in
the pool are backed up on tape, so I can easily restore individual files
as needed (though I would not want to restore the entire filesystem).

I tried following the guide at
https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg/#unfound-objects.
     I found a number of OSDs that are still 'not queried'.  Restarting a
sampling of these OSDs changed the state from 'not queried' to 'already
probed', but that did not recover any of the unfound or degraded objects.

I have also tried 'ceph pg deep-scrub' on the affected PGs, but never
saw them get scrubbed.  I also tried doing a 'ceph pg force-recovery' on
the affected PGs, but only one seems to have been tagged accordingly
(see ceph -s output below).

The guide also says "Sometimes it simply takes some time for the cluster
to query possible locations."  I'm not sure how long "some time" might
take, but it hasn't changed after several hours.

My questions are:

* Is there a way to force the cluster to query the possible locations
sooner?

* Is it possible to identify the files in cephfs that are affected, so
that I could delete only the affected files and restore them from backup
tapes?

--Mike

ceph -s:

      cluster:
        id:     066f558c-6789-4a93-aaf1-5af1ba01a3ad
        health: HEALTH_ERR
                1 clients failing to respond to capability release
                1 MDSs report slow requests
                25/78520351 objects unfound (0.000%)
                2 nearfull osd(s)
                Reduced data availability: 1 pg inactive
                Possible data damage: 9 pgs recovery_unfound
                Degraded data redundancy: 75/626645098 objects degraded
(0.000%), 9 pgs degraded
                1013 pgs not deep-scrubbed in time
                1013 pgs not scrubbed in time
                2 pool(s) nearfull
                1 daemons have recently crashed
                4 slow ops, oldest one blocked for 77939 sec, daemons
[osd.0,osd.41] have slow ops.

      services:
        mon: 4 daemons, quorum ceph1,ceph2,ceph3,ceph4 (age 9d)
        mgr: ceph3(active, since 11d), standbys: ceph2, ceph4, ceph1
        mds: archive:1 {0=ceph4=up:active} 3 up:standby
        osd: 121 osds: 121 up (since 6m), 121 in (since 101m); 4 remapped pgs

      task status:
        scrub status:
            mds.ceph4: idle

      data:
        pools:   9 pools, 2433 pgs
        objects: 78.52M objects, 298 TiB
        usage:   412 TiB used, 545 TiB / 956 TiB avail
        pgs:     0.041% pgs unknown
                 75/626645098 objects degraded (0.000%)
                 135224/626645098 objects misplaced (0.022%)
                 25/78520351 objects unfound (0.000%)
                 2421 active+clean
                 5    active+recovery_unfound+degraded
                 3    active+recovery_unfound+degraded+remapped
                 2    active+clean+scrubbing+deep
                 1    unknown
                 1    active+forced_recovery+recovery_unfound+degraded

      progress:
        PG autoscaler decreasing pool 7 PGs from 1024 to 512 (5d)
          [............................]
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx