Re: multiple OSD crash, unfound objects

Frank Schilder <frans@xxxxxx> · Thu, 22 Oct 2020 07:32:07 +0000

Sounds good. Did you re-create the pool again? If not, please do to give the devicehealth manager module its storage. In case you can't see any IO, it might be necessary to restart the MGR to flush out a stale rados connection. I would probably give the pool 10 PGs instead of 1, but that's up to you.

I hope I find time today to look at the incomplete PG.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michael Thomas <wart@xxxxxxxxxxx>
Sent: 21 October 2020 22:58:47
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Re: multiple OSD crash, unfound objects

On 10/21/20 6:47 AM, Frank Schilder wrote:
> Hi Michael,
>
> some quick thoughts.
>
>
> That you can create a pool with 1 PG is a good sign, the crush rule is OK. That pg query says it doesn't have PG 1.0 points in the right direction. There is an inconsistency in the cluster. This is also indicated by the fact that no upmaps seem to exist (the clean-up script was empty). With the osd map you extracted, you could check what the osd map believes the mapping of the PGs of pool 1 are:
>
>    # osdmaptool osd.map --test-map-pgs-dump --pool 1

https://pastebin.com/seh6gb7R

As I suspected, it thinks that OSDs 0, 41 are the acting set.

> or if it also claims the PG does not exist. It looks like something went wrong during pool creation and you are not the only one having problems with this particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . Sounds a lot like a bug in cephadm.
>
> In principle, it looks like the idea to delete and recreate the health metrics pool is a way forward. Please look at the procedure mentioned in the thread quoted above. Deletion of the pool there lead to some crashes and some surgery on some OSDs was necessary. However, in your case it might just work, because you redeployed the OSDs in question already - if I remember correctly.

That is correct.  The original OSDs 0 and 41 were removed and redeployed
on new disks.

> In order to do so cleanly, however, you will probably want to shut down all clients accessing this pool. Note that clients accessing the health metrics pool are not FS clients, so the mds cannot tell you anything about them. The only command that seems to list all clients is
>
>    # ceph daemon mon.MON-ID sessions
>
> that needs to be executed on all mon hosts. On the other hand, you could also just go ahead and see if something crashes (an MGR module probably) or disable all MGR modules during this recovery attempt. I found some info that cephadm creates this pool and starts an MGR module.
>
> If you google "device_health_metric pool" you should find descriptions of similar cases. It looks solvable.

Unfortunately, in Octopus you can not disable the devicehealth manager
module, and the manager is required for operation.  So I just went ahead
and removed the pool with everything still running.  Fortunately, this
did not appear to cause any problems, and the single unknown PG has
disappeared from the ceph health output.

> I will look at the incomplete PG issue. I hope this is just some PG tuning. At least pg query didn't complain :)

I have OSDs ready to add to the pool, in case you think we should try.

> The stuck MDS request could be an attempt to access an unfound object. It should be possible to locate the fs client and find out what it was trying to do. I see this sometimes when people are too impatient. They manage to trigger a race condition and an MDS operation gets stuck (there are MDS bugs and in my case it was an ls command that got stuck). Usually, evicting the client temporarily solves the issue (but tell the user :).

I found the fs client and rebooted it.  The MDS still reports the slow
OPs, but according to the mds logs the offending ops were established
before the client was rebooted, and the offending client session (now
defunct) has been blacklisted.  I'll check back later to see if the slow
OPS get cleared from 'ceph status'.

Regards,

--Mike
________________________________________
> From: Michael Thomas <wart@xxxxxxxxxxx>
> Sent: 20 October 2020 23:48:36
> To: Frank Schilder; ceph-users@xxxxxxx
> Subject: Re:  Re: multiple OSD crash, unfound objects
>
> On 10/20/20 1:18 PM, Frank Schilder wrote:
>> Dear Michael,
>>
>>>> Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD mapping?
>>
>> I meant here with crush rule replicated_host_nvme. Sorry, forgot.
>
> Seems to have worked fine:
>
> https://pastebin.com/PFgDE4J1
>
>>> Yes, the OSD was still out when the previous health report was created.
>>
>> Hmm, this is odd. If this is correct, then it did report a slow op even though it was out of the cluster:
>>
>>> from https://pastebin.com/3G3ij9ui:
>>> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons [osd.0,osd.41] have slow ops.
>>
>> Not sure what to make of that. It looks almost like you have a ghost osd.41.
>>
>>
>> I think (some of) the slow ops you are seeing are directed to the health_metrics pool and can be ignored. If it is too annoying, you could try to find out who runs the client with IDs client.7524484 and disable it. Might be an MGR module.
>
> I'm also pretty certain that the slow ops are related to the health
> metrics pool, which is why I've been ignoring them.
>
> What I'm not sure about is whether re-creating the device_health_metrics
> pool will cause any problems in the ceph cluster.
>
>> Looking at the data you provided and also some older threads of yours (https://www.mail-archive.com/ceph-users@xxxxxxx/msg05842.html), I start considering that we are looking at the fall-out of a past admin operation. A possibility is, that an upmap for PG 1.0 exists that conflicts with the crush rule replicated_host_nvme and, hence, prevents the assignment of OSDs to PG 1.0. For example, the upmap specifies HDDs, but the crush rule required NVMEs. This result is an empty set.
>
> So var I've been unable to locate the client with the ID 7524484.  It's
> not showing up in the manager dashboard -> Filesystems page, nor in the
> output of 'ceph tell mds.ceph1 client ls'.
>
> I'm digging through the compress logs for the past week to see if I can
> find the culprit.
>
>> I couldn't really find a simple command to list up-maps. The only non-destructive way seems to be to extract the osdmap and create a clean-up command file. The cleanup file should contain a command for every PG with an upmap. To check this, you can execute (see also https://docs.ceph.com/en/latest/man/8/osdmaptool/)
>>
>>     # ceph osd getmap > osd.map
>>     # osdmaptool osd.map --upmap-cleanup cleanup.cmd
>>
>> If you do this, could you please post as usual the contents of cleanup.cmd?
>
> It was empty:
>
> [root@ceph1 ~]# ceph osd getmap > osd.map
> got osdmap epoch 52833
>
> [root@ceph1 ~]# osdmaptool osd.map --upmap-cleanup cleanup.cmd
> osdmaptool: osdmap file 'osd.map'
> writing upmap command output to: cleanup.cmd
> checking for upmap cleanups
>
> [root@ceph1 ~]# wc cleanup.cmd
> 0 0 0 cleanup.cmd
>
>> Also, with the OSD map of your cluster, you can simulate certain admin operations and check resulting PG mappings for pools and other things without having to touch the cluster; see https://docs.ceph.com/en/latest/man/8/osdmaptool/.
>>
>>
>> To dig a little bit deeper, could you please post as usual the output of:
>>
>> - ceph pg 1.0 query
>> - ceph pg 7.39d query
>
> Oddly, it claims that it doesn't have pgid 1.0.
>
> https://pastebin.com/pHh33Dq7
>
>> It would also be helpful if you could post the decoded crush map. You can get the map as a txt-file as follows:
>>
>>     # ceph osd getcrushmap -o crush-orig.bin
>>     # crushtool -d crush-orig.bin -o crush.txt
>>
>> and post the contents of file crush.txt.
>
> https://pastebin.com/EtEGpWy3
>
>> Did the slow MDS request complete by now?
>
> Nope.
>
> --Mike
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx