Re: Troubleshooting "N slow requests are blocked > 30 secs" on Pacific

Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx> · Wed, 24 May 2023 13:39:38 +0200

Absolutely! :-)

root@icadmin011:/tmp# ceph --cluster floki daemon mds.icadmin011 dump cache
/tmp/dump.txt
root@icadmin011:/tmp# ll
total 48
drwxrwxrwt 12 root root 4096 May 24 13:23  ./
drwxr-xr-x 18 root root 4096 Jun  9  2022  ../
drwxrwxrwt  2 root root 4096 May  4 12:43  .ICE-unix/
drwxrwxrwt  2 root root 4096 May  4 12:43  .Test-unix/
drwxrwxrwt  2 root root 4096 May  4 12:43  .X11-unix/
drwxrwxrwt  2 root root 4096 May  4 12:43  .XIM-unix/
drwxrwxrwt  2 root root 4096 May  4 12:43  .font-unix/
drwx------  2 root root 4096 May 24 13:23  ssh-Sl5AiotnXp/
drwx------  3 root root 4096 May  8 13:26
'systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-ceph-mds@icadmin011.service-SGZrKf
'/
drwx------  3 root root 4096 May  4 12:43
 systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-logind.service-uU1GAi/
drwx------  3 root root 4096 May  4 12:43
 systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-resolved.service-KYHd7f/
drwx------  3 root root 4096 May  4 12:43
 systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-timesyncd.service-1Qtj5i/

On Wed, May 24, 2023 at 1:17 PM Milind Changire <mchangir@xxxxxxxxxx> wrote:

> I hope the daemon mds.icadmin011 is running on the same machine that you
> are looking for /tmp/dump.txt, since the file is created on the system
> which has that daemon running.
>
>
> On Wed, May 24, 2023 at 2:16 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx>
> wrote:
>
>> Hi Milind,
>>
>> you are absolutely right.
>>
>> The dump_ops_in_flight is giving a good hint about what's happening:
>> {
>>     "ops": [
>>         {
>>             "description": "internal op exportdir:mds.5:975673",
>>             "initiated_at": "2023-05-23T17:49:53.030611+0200",
>>             "age": 60596.355186077999,
>>             "duration": 60596.355234167997,
>>             "type_data": {
>>                 "flag_point": "failed to wrlock, waiting",
>>                 "reqid": "mds.5:975673",
>>                 "op_type": "internal_op",
>>                 "internal_op": 5377,
>>                 "op_name": "exportdir",
>>                 "events": [
>>                     {
>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>                         "event": "initiated"
>>                     },
>>                     {
>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>                         "event": "throttled"
>>                     },
>>                     {
>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>                         "event": "header_read"
>>                     },
>>                     {
>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>                         "event": "all_read"
>>                     },
>>                     {
>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>                         "event": "dispatched"
>>                     },
>>                     {
>>                         "time": "2023-05-23T17:49:53.030657+0200",
>>                         "event": "requesting remote authpins"
>>                     },
>>                     {
>>                         "time": "2023-05-23T17:49:53.050253+0200",
>>                         "event": "failed to wrlock, waiting"
>>                     }
>>                 ]
>>             }
>>         }
>>     ],
>>     "num_ops": 1
>> }
>>
>> However, the dump cache does not seem to produce an output:
>> root@icadmin011:~# ceph --cluster floki daemon mds.icadmin011 dump cache
>> /tmp/dump.txt
>> root@icadmin011:~# ls /tmp
>> ssh-cHvP3iF611
>>
>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-ceph-mds@icadmin011.service-SGZrKf
>>
>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-logind.service-uU1GAi
>>
>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-resolved.service-KYHd7f
>>
>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-timesyncd.service-1Qtj5i
>>
>> Do you have any hint?
>>
>> Best,
>>
>> Emmanuel
>>
>> On Wed, May 24, 2023 at 10:30 AM Milind Changire <mchangir@xxxxxxxxxx>
>> wrote:
>>
>>> Emmanuel,
>>> You probably missed the "daemon" keyword after the "ceph" command name.
>>> Here's the docs for pacific:
>>> https://docs.ceph.com/en/pacific/cephfs/troubleshooting/
>>>
>>> So, your command should've been:
>>> # ceph daemon mds.icadmin011 dump cache /tmp/dump.txt
>>>
>>> You could also dump the ops in flight with:
>>> # ceph daemon mds.icadmin011 dump_ops_in_flight
>>>
>>>
>>>
>>> On Wed, May 24, 2023 at 1:38 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > we are running a cephfs cluster with the following version:
>>> > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific
>>> > (stable)
>>> >
>>> > Several MDSs are reporting slow requests:
>>> > HEALTH_WARN 4 MDSs report slow requests
>>> > [WRN] MDS_SLOW_REQUEST: 4 MDSs report slow requests
>>> >     mds.icadmin011(mds.5): 1 slow requests are blocked > 30 secs
>>> >     mds.icadmin015(mds.6): 2 slow requests are blocked > 30 secs
>>> >     mds.icadmin006(mds.4): 8 slow requests are blocked > 30 secs
>>> >     mds.icadmin007(mds.2): 2 slow requests are blocked > 30 secs
>>> >
>>> > According to Quincy's documentation (
>>> > https://docs.ceph.com/en/quincy/cephfs/troubleshooting/), this can be
>>> > investigated by issuing:
>>> > ceph mds.icadmin011 dump cache /tmp/dump.txt
>>> >
>>> > Unfortunately, this command fails:
>>> > no valid command found; 10 closest matches:
>>> > pg stat
>>> > pg getmap
>>> > pg dump [all|summary|sum|delta|pools|osds|pgs|pgs_brief...]
>>> > pg dump_json [all|summary|sum|pools|osds|pgs...]
>>> > pg dump_pools_json
>>> > pg ls-by-pool <poolstr> [<states>...]
>>> > pg ls-by-primary <id|osd.id> [<pool:int>] [<states>...]
>>> > pg ls-by-osd <id|osd.id> [<pool:int>] [<states>...]
>>> > pg ls [<pool:int>] [<states>...]
>>> > pg dump_stuck [inactive|unclean|stale|undersized|degraded...]
>>> > [<threshold:int>]
>>> > Error EINVAL: invalid command
>>> >
>>> >
>>> > I imagine that it is related to the fact that we are running the
>>> Pacific
>>> > version and not the Quincy version.
>>> >
>>> > When looking at the Pacific's documentation (
>>> > https://docs.ceph.com/en/pacific/cephfs/health-messages/), I should:
>>> > > Use the ops admin socket command to list outstanding metadata
>>> operations.
>>> >
>>> > Unfortunately, I fail to really understand what I'm supposed to do. Can
>>> > someone give a pointer?
>>> >
>>> > Best,
>>> >
>>> > Emmanuel
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> >
>>> >
>>>
>>> --
>>> Milind
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
>
> --
> Milind
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx