Re: Troubleshooting "N slow requests are blocked > 30 secs" on Pacific

Milind Changire <mchangir@xxxxxxxxxx> · Wed, 24 May 2023 16:47:06 +0530

I hope the daemon mds.icadmin011 is running on the same machine that you
are looking for /tmp/dump.txt, since the file is created on the system
which has that daemon running.

On Wed, May 24, 2023 at 2:16 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx>
wrote:

> Hi Milind,
>
> you are absolutely right.
>
> The dump_ops_in_flight is giving a good hint about what's happening:
> {
>     "ops": [
>         {
>             "description": "internal op exportdir:mds.5:975673",
>             "initiated_at": "2023-05-23T17:49:53.030611+0200",
>             "age": 60596.355186077999,
>             "duration": 60596.355234167997,
>             "type_data": {
>                 "flag_point": "failed to wrlock, waiting",
>                 "reqid": "mds.5:975673",
>                 "op_type": "internal_op",
>                 "internal_op": 5377,
>                 "op_name": "exportdir",
>                 "events": [
>                     {
>                         "time": "2023-05-23T17:49:53.030611+0200",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2023-05-23T17:49:53.030611+0200",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2023-05-23T17:49:53.030611+0200",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2023-05-23T17:49:53.030611+0200",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2023-05-23T17:49:53.030611+0200",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2023-05-23T17:49:53.030657+0200",
>                         "event": "requesting remote authpins"
>                     },
>                     {
>                         "time": "2023-05-23T17:49:53.050253+0200",
>                         "event": "failed to wrlock, waiting"
>                     }
>                 ]
>             }
>         }
>     ],
>     "num_ops": 1
> }
>
> However, the dump cache does not seem to produce an output:
> root@icadmin011:~# ceph --cluster floki daemon mds.icadmin011 dump cache
> /tmp/dump.txt
> root@icadmin011:~# ls /tmp
> ssh-cHvP3iF611
>
> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-ceph-mds@icadmin011.service-SGZrKf
>
> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-logind.service-uU1GAi
>
> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-resolved.service-KYHd7f
>
> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-timesyncd.service-1Qtj5i
>
> Do you have any hint?
>
> Best,
>
> Emmanuel
>
> On Wed, May 24, 2023 at 10:30 AM Milind Changire <mchangir@xxxxxxxxxx>
> wrote:
>
>> Emmanuel,
>> You probably missed the "daemon" keyword after the "ceph" command name.
>> Here's the docs for pacific:
>> https://docs.ceph.com/en/pacific/cephfs/troubleshooting/
>>
>> So, your command should've been:
>> # ceph daemon mds.icadmin011 dump cache /tmp/dump.txt
>>
>> You could also dump the ops in flight with:
>> # ceph daemon mds.icadmin011 dump_ops_in_flight
>>
>>
>>
>> On Wed, May 24, 2023 at 1:38 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx>
>> wrote:
>>
>> > Hi,
>> >
>> > we are running a cephfs cluster with the following version:
>> > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific
>> > (stable)
>> >
>> > Several MDSs are reporting slow requests:
>> > HEALTH_WARN 4 MDSs report slow requests
>> > [WRN] MDS_SLOW_REQUEST: 4 MDSs report slow requests
>> >     mds.icadmin011(mds.5): 1 slow requests are blocked > 30 secs
>> >     mds.icadmin015(mds.6): 2 slow requests are blocked > 30 secs
>> >     mds.icadmin006(mds.4): 8 slow requests are blocked > 30 secs
>> >     mds.icadmin007(mds.2): 2 slow requests are blocked > 30 secs
>> >
>> > According to Quincy's documentation (
>> > https://docs.ceph.com/en/quincy/cephfs/troubleshooting/), this can be
>> > investigated by issuing:
>> > ceph mds.icadmin011 dump cache /tmp/dump.txt
>> >
>> > Unfortunately, this command fails:
>> > no valid command found; 10 closest matches:
>> > pg stat
>> > pg getmap
>> > pg dump [all|summary|sum|delta|pools|osds|pgs|pgs_brief...]
>> > pg dump_json [all|summary|sum|pools|osds|pgs...]
>> > pg dump_pools_json
>> > pg ls-by-pool <poolstr> [<states>...]
>> > pg ls-by-primary <id|osd.id> [<pool:int>] [<states>...]
>> > pg ls-by-osd <id|osd.id> [<pool:int>] [<states>...]
>> > pg ls [<pool:int>] [<states>...]
>> > pg dump_stuck [inactive|unclean|stale|undersized|degraded...]
>> > [<threshold:int>]
>> > Error EINVAL: invalid command
>> >
>> >
>> > I imagine that it is related to the fact that we are running the Pacific
>> > version and not the Quincy version.
>> >
>> > When looking at the Pacific's documentation (
>> > https://docs.ceph.com/en/pacific/cephfs/health-messages/), I should:
>> > > Use the ops admin socket command to list outstanding metadata
>> operations.
>> >
>> > Unfortunately, I fail to really understand what I'm supposed to do. Can
>> > someone give a pointer?
>> >
>> > Best,
>> >
>> > Emmanuel
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >
>> >
>>
>> --
>> Milind
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>

-- 
Milind
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx