Re: multiple OSD crash, unfound objects

Michael Thomas <wart@xxxxxxxxxxx> · Wed, 21 Oct 2020 15:58:47 -0500

On 10/21/20 6:47 AM, Frank Schilder wrote:
Hi Michael,

some quick thoughts.

That you can create a pool with 1 PG is a good sign, the crush rule is OK. That pg query says it doesn't have PG 1.0 points in the right direction. There is an inconsistency in the cluster. This is also indicated by the fact that no upmaps seem to exist (the clean-up script was empty). With the osd map you extracted, you could check what the osd map believes the mapping of the PGs of pool 1 are:

   # osdmaptool osd.map --test-map-pgs-dump --pool 1

https://pastebin.com/seh6gb7R

As I suspected, it thinks that OSDs 0, 41 are the acting set.

or if it also claims the PG does not exist. It looks like something went wrong during pool creation and you are not the only one having problems with this particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . Sounds a lot like a bug in cephadm.

In principle, it looks like the idea to delete and recreate the health metrics pool is a way forward. Please look at the procedure mentioned in the thread quoted above. Deletion of the pool there lead to some crashes and some surgery on some OSDs was necessary. However, in your case it might just work, because you redeployed the OSDs in question already - if I remember correctly.

That is correct.  The original OSDs 0 and 41 were removed and redeployed 
on new disks.

In order to do so cleanly, however, you will probably want to shut down all clients accessing this pool. Note that clients accessing the health metrics pool are not FS clients, so the mds cannot tell you anything about them. The only command that seems to list all clients is

   # ceph daemon mon.MON-ID sessions

that needs to be executed on all mon hosts. On the other hand, you could also just go ahead and see if something crashes (an MGR module probably) or disable all MGR modules during this recovery attempt. I found some info that cephadm creates this pool and starts an MGR module.

If you google "device_health_metric pool" you should find descriptions of similar cases. It looks solvable.

Unfortunately, in Octopus you can not disable the devicehealth manager 
module, and the manager is required for operation.  So I just went ahead 
and removed the pool with everything still running.  Fortunately, this 
did not appear to cause any problems, and the single unknown PG has 
disappeared from the ceph health output.

I will look at the incomplete PG issue. I hope this is just some PG tuning. At least pg query didn't complain :)

I have OSDs ready to add to the pool, in case you think we should try.

The stuck MDS request could be an attempt to access an unfound object. It should be possible to locate the fs client and find out what it was trying to do. I see this sometimes when people are too impatient. They manage to trigger a race condition and an MDS operation gets stuck (there are MDS bugs and in my case it was an ls command that got stuck). Usually, evicting the client temporarily solves the issue (but tell the user :).

I found the fs client and rebooted it.  The MDS still reports the slow 
OPs, but according to the mds logs the offending ops were established 
before the client was rebooted, and the offending client session (now 
defunct) has been blacklisted.  I'll check back later to see if the slow 
OPS get cleared from 'ceph status'.

Regards,

--Mike
________________________________________
From: Michael Thomas <wart@xxxxxxxxxxx>
Sent: 20 October 2020 23:48:36
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Re: multiple OSD crash, unfound objects

On 10/20/20 1:18 PM, Frank Schilder wrote:
Dear Michael,

Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD mapping?

I meant here with crush rule replicated_host_nvme. Sorry, forgot.

Seems to have worked fine:

https://pastebin.com/PFgDE4J1

Yes, the OSD was still out when the previous health report was created.

Hmm, this is odd. If this is correct, then it did report a slow op even though it was out of the cluster:

from https://pastebin.com/3G3ij9ui:
[WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons [osd.0,osd.41] have slow ops.

Not sure what to make of that. It looks almost like you have a ghost osd.41.

I think (some of) the slow ops you are seeing are directed to the health_metrics pool and can be ignored. If it is too annoying, you could try to find out who runs the client with IDs client.7524484 and disable it. Might be an MGR module.

I'm also pretty certain that the slow ops are related to the health
metrics pool, which is why I've been ignoring them.

What I'm not sure about is whether re-creating the device_health_metrics
pool will cause any problems in the ceph cluster.

Looking at the data you provided and also some older threads of yours (https://www.mail-archive.com/ceph-users@xxxxxxx/msg05842.html), I start considering that we are looking at the fall-out of a past admin operation. A possibility is, that an upmap for PG 1.0 exists that conflicts with the crush rule replicated_host_nvme and, hence, prevents the assignment of OSDs to PG 1.0. For example, the upmap specifies HDDs, but the crush rule required NVMEs. This result is an empty set.

So var I've been unable to locate the client with the ID 7524484.  It's
not showing up in the manager dashboard -> Filesystems page, nor in the
output of 'ceph tell mds.ceph1 client ls'.

I'm digging through the compress logs for the past week to see if I can
find the culprit.

I couldn't really find a simple command to list up-maps. The only non-destructive way seems to be to extract the osdmap and create a clean-up command file. The cleanup file should contain a command for every PG with an upmap. To check this, you can execute (see also https://docs.ceph.com/en/latest/man/8/osdmaptool/)

    # ceph osd getmap > osd.map
    # osdmaptool osd.map --upmap-cleanup cleanup.cmd

If you do this, could you please post as usual the contents of cleanup.cmd?

It was empty:

[root@ceph1 ~]# ceph osd getmap > osd.map
got osdmap epoch 52833

[root@ceph1 ~]# osdmaptool osd.map --upmap-cleanup cleanup.cmd
osdmaptool: osdmap file 'osd.map'
writing upmap command output to: cleanup.cmd
checking for upmap cleanups

[root@ceph1 ~]# wc cleanup.cmd
0 0 0 cleanup.cmd

Also, with the OSD map of your cluster, you can simulate certain admin operations and check resulting PG mappings for pools and other things without having to touch the cluster; see https://docs.ceph.com/en/latest/man/8/osdmaptool/.

To dig a little bit deeper, could you please post as usual the output of:

- ceph pg 1.0 query
- ceph pg 7.39d query

Oddly, it claims that it doesn't have pgid 1.0.

https://pastebin.com/pHh33Dq7

It would also be helpful if you could post the decoded crush map. You can get the map as a txt-file as follows:

    # ceph osd getcrushmap -o crush-orig.bin
    # crushtool -d crush-orig.bin -o crush.txt

and post the contents of file crush.txt.

https://pastebin.com/EtEGpWy3

Did the slow MDS request complete by now?

Nope.

--Mike

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx