Re: Troubleshooting "N slow requests are blocked > 30 secs" on Pacific

Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx> · Wed, 24 May 2023 10:45:32 +0200

Hi Milind,

you are absolutely right.

The dump_ops_in_flight is giving a good hint about what's happening:
{
    "ops": [
        {
            "description": "internal op exportdir:mds.5:975673",
            "initiated_at": "2023-05-23T17:49:53.030611+0200",
            "age": 60596.355186077999,
            "duration": 60596.355234167997,
            "type_data": {
                "flag_point": "failed to wrlock, waiting",
                "reqid": "mds.5:975673",
                "op_type": "internal_op",
                "internal_op": 5377,
                "op_name": "exportdir",
                "events": [
                    {
                        "time": "2023-05-23T17:49:53.030611+0200",
                        "event": "initiated"
                    },
                    {
                        "time": "2023-05-23T17:49:53.030611+0200",
                        "event": "throttled"
                    },
                    {
                        "time": "2023-05-23T17:49:53.030611+0200",
                        "event": "header_read"
                    },
                    {
                        "time": "2023-05-23T17:49:53.030611+0200",
                        "event": "all_read"
                    },
                    {
                        "time": "2023-05-23T17:49:53.030611+0200",
                        "event": "dispatched"
                    },
                    {
                        "time": "2023-05-23T17:49:53.030657+0200",
                        "event": "requesting remote authpins"
                    },
                    {
                        "time": "2023-05-23T17:49:53.050253+0200",
                        "event": "failed to wrlock, waiting"
                    }
                ]
            }
        }
    ],
    "num_ops": 1
}

However, the dump cache does not seem to produce an output:
root@icadmin011:~# ceph --cluster floki daemon mds.icadmin011 dump cache
/tmp/dump.txt
root@icadmin011:~# ls /tmp
ssh-cHvP3iF611
systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-ceph-mds@icadmin011.service-SGZrKf
systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-logind.service-uU1GAi
systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-resolved.service-KYHd7f
systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-timesyncd.service-1Qtj5i

Do you have any hint?

Best,

Emmanuel

On Wed, May 24, 2023 at 10:30 AM Milind Changire <mchangir@xxxxxxxxxx>
wrote:

> Emmanuel,
> You probably missed the "daemon" keyword after the "ceph" command name.
> Here's the docs for pacific:
> https://docs.ceph.com/en/pacific/cephfs/troubleshooting/
>
> So, your command should've been:
> # ceph daemon mds.icadmin011 dump cache /tmp/dump.txt
>
> You could also dump the ops in flight with:
> # ceph daemon mds.icadmin011 dump_ops_in_flight
>
>
>
> On Wed, May 24, 2023 at 1:38 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx>
> wrote:
>
> > Hi,
> >
> > we are running a cephfs cluster with the following version:
> > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific
> > (stable)
> >
> > Several MDSs are reporting slow requests:
> > HEALTH_WARN 4 MDSs report slow requests
> > [WRN] MDS_SLOW_REQUEST: 4 MDSs report slow requests
> >     mds.icadmin011(mds.5): 1 slow requests are blocked > 30 secs
> >     mds.icadmin015(mds.6): 2 slow requests are blocked > 30 secs
> >     mds.icadmin006(mds.4): 8 slow requests are blocked > 30 secs
> >     mds.icadmin007(mds.2): 2 slow requests are blocked > 30 secs
> >
> > According to Quincy's documentation (
> > https://docs.ceph.com/en/quincy/cephfs/troubleshooting/), this can be
> > investigated by issuing:
> > ceph mds.icadmin011 dump cache /tmp/dump.txt
> >
> > Unfortunately, this command fails:
> > no valid command found; 10 closest matches:
> > pg stat
> > pg getmap
> > pg dump [all|summary|sum|delta|pools|osds|pgs|pgs_brief...]
> > pg dump_json [all|summary|sum|pools|osds|pgs...]
> > pg dump_pools_json
> > pg ls-by-pool <poolstr> [<states>...]
> > pg ls-by-primary <id|osd.id> [<pool:int>] [<states>...]
> > pg ls-by-osd <id|osd.id> [<pool:int>] [<states>...]
> > pg ls [<pool:int>] [<states>...]
> > pg dump_stuck [inactive|unclean|stale|undersized|degraded...]
> > [<threshold:int>]
> > Error EINVAL: invalid command
> >
> >
> > I imagine that it is related to the fact that we are running the Pacific
> > version and not the Quincy version.
> >
> > When looking at the Pacific's documentation (
> > https://docs.ceph.com/en/pacific/cephfs/health-messages/), I should:
> > > Use the ops admin socket command to list outstanding metadata
> operations.
> >
> > Unfortunately, I fail to really understand what I'm supposed to do. Can
> > someone give a pointer?
> >
> > Best,
> >
> > Emmanuel
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
>
> --
> Milind
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx