Re: Troubleshooting "N slow requests are blocked > 30 secs" on Pacific

Milind Changire <mchangir@xxxxxxxxxx> · Thu, 25 May 2023 16:48:47 +0530

try the command with the --id argument:

# ceph --id admin --cluster floki daemon mds.icadmin011 dump cache
/tmp/dump.txt

I presume that your keyring has an appropriate entry for the client.admin
user

On Wed, May 24, 2023 at 5:10 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx>
wrote:

> Absolutely! :-)
>
> root@icadmin011:/tmp# ceph --cluster floki daemon mds.icadmin011 dump
> cache /tmp/dump.txt
> root@icadmin011:/tmp# ll
> total 48
> drwxrwxrwt 12 root root 4096 May 24 13:23  ./
> drwxr-xr-x 18 root root 4096 Jun  9  2022  ../
> drwxrwxrwt  2 root root 4096 May  4 12:43  .ICE-unix/
> drwxrwxrwt  2 root root 4096 May  4 12:43  .Test-unix/
> drwxrwxrwt  2 root root 4096 May  4 12:43  .X11-unix/
> drwxrwxrwt  2 root root 4096 May  4 12:43  .XIM-unix/
> drwxrwxrwt  2 root root 4096 May  4 12:43  .font-unix/
> drwx------  2 root root 4096 May 24 13:23  ssh-Sl5AiotnXp/
> drwx------  3 root root 4096 May  8 13:26
> 'systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-ceph-mds@icadmin011.service-SGZrKf
> '/
> drwx------  3 root root 4096 May  4 12:43
>  systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-logind.service-uU1GAi/
> drwx------  3 root root 4096 May  4 12:43
>  systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-resolved.service-KYHd7f/
> drwx------  3 root root 4096 May  4 12:43
>  systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-timesyncd.service-1Qtj5i/
>
> On Wed, May 24, 2023 at 1:17 PM Milind Changire <mchangir@xxxxxxxxxx>
> wrote:
>
>> I hope the daemon mds.icadmin011 is running on the same machine that you
>> are looking for /tmp/dump.txt, since the file is created on the system
>> which has that daemon running.
>>
>>
>> On Wed, May 24, 2023 at 2:16 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx>
>> wrote:
>>
>>> Hi Milind,
>>>
>>> you are absolutely right.
>>>
>>> The dump_ops_in_flight is giving a good hint about what's happening:
>>> {
>>>     "ops": [
>>>         {
>>>             "description": "internal op exportdir:mds.5:975673",
>>>             "initiated_at": "2023-05-23T17:49:53.030611+0200",
>>>             "age": 60596.355186077999,
>>>             "duration": 60596.355234167997,
>>>             "type_data": {
>>>                 "flag_point": "failed to wrlock, waiting",
>>>                 "reqid": "mds.5:975673",
>>>                 "op_type": "internal_op",
>>>                 "internal_op": 5377,
>>>                 "op_name": "exportdir",
>>>                 "events": [
>>>                     {
>>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>>                         "event": "initiated"
>>>                     },
>>>                     {
>>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>>                         "event": "throttled"
>>>                     },
>>>                     {
>>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>>                         "event": "header_read"
>>>                     },
>>>                     {
>>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>>                         "event": "all_read"
>>>                     },
>>>                     {
>>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>>                         "event": "dispatched"
>>>                     },
>>>                     {
>>>                         "time": "2023-05-23T17:49:53.030657+0200",
>>>                         "event": "requesting remote authpins"
>>>                     },
>>>                     {
>>>                         "time": "2023-05-23T17:49:53.050253+0200",
>>>                         "event": "failed to wrlock, waiting"
>>>                     }
>>>                 ]
>>>             }
>>>         }
>>>     ],
>>>     "num_ops": 1
>>> }
>>>
>>> However, the dump cache does not seem to produce an output:
>>> root@icadmin011:~# ceph --cluster floki daemon mds.icadmin011 dump
>>> cache /tmp/dump.txt
>>> root@icadmin011:~# ls /tmp
>>> ssh-cHvP3iF611
>>>
>>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-ceph-mds@icadmin011.service-SGZrKf
>>>
>>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-logind.service-uU1GAi
>>>
>>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-resolved.service-KYHd7f
>>>
>>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-timesyncd.service-1Qtj5i
>>>
>>> Do you have any hint?
>>>
>>> Best,
>>>
>>> Emmanuel
>>>
>>> On Wed, May 24, 2023 at 10:30 AM Milind Changire <mchangir@xxxxxxxxxx>
>>> wrote:
>>>
>>>> Emmanuel,
>>>> You probably missed the "daemon" keyword after the "ceph" command name.
>>>> Here's the docs for pacific:
>>>> https://docs.ceph.com/en/pacific/cephfs/troubleshooting/
>>>>
>>>> So, your command should've been:
>>>> # ceph daemon mds.icadmin011 dump cache /tmp/dump.txt
>>>>
>>>> You could also dump the ops in flight with:
>>>> # ceph daemon mds.icadmin011 dump_ops_in_flight
>>>>
>>>>
>>>>
>>>> On Wed, May 24, 2023 at 1:38 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx>
>>>> wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > we are running a cephfs cluster with the following version:
>>>> > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17)
>>>> pacific
>>>> > (stable)
>>>> >
>>>> > Several MDSs are reporting slow requests:
>>>> > HEALTH_WARN 4 MDSs report slow requests
>>>> > [WRN] MDS_SLOW_REQUEST: 4 MDSs report slow requests
>>>> >     mds.icadmin011(mds.5): 1 slow requests are blocked > 30 secs
>>>> >     mds.icadmin015(mds.6): 2 slow requests are blocked > 30 secs
>>>> >     mds.icadmin006(mds.4): 8 slow requests are blocked > 30 secs
>>>> >     mds.icadmin007(mds.2): 2 slow requests are blocked > 30 secs
>>>> >
>>>> > According to Quincy's documentation (
>>>> > https://docs.ceph.com/en/quincy/cephfs/troubleshooting/), this can be
>>>> > investigated by issuing:
>>>> > ceph mds.icadmin011 dump cache /tmp/dump.txt
>>>> >
>>>> > Unfortunately, this command fails:
>>>> > no valid command found; 10 closest matches:
>>>> > pg stat
>>>> > pg getmap
>>>> > pg dump [all|summary|sum|delta|pools|osds|pgs|pgs_brief...]
>>>> > pg dump_json [all|summary|sum|pools|osds|pgs...]
>>>> > pg dump_pools_json
>>>> > pg ls-by-pool <poolstr> [<states>...]
>>>> > pg ls-by-primary <id|osd.id> [<pool:int>] [<states>...]
>>>> > pg ls-by-osd <id|osd.id> [<pool:int>] [<states>...]
>>>> > pg ls [<pool:int>] [<states>...]
>>>> > pg dump_stuck [inactive|unclean|stale|undersized|degraded...]
>>>> > [<threshold:int>]
>>>> > Error EINVAL: invalid command
>>>> >
>>>> >
>>>> > I imagine that it is related to the fact that we are running the
>>>> Pacific
>>>> > version and not the Quincy version.
>>>> >
>>>> > When looking at the Pacific's documentation (
>>>> > https://docs.ceph.com/en/pacific/cephfs/health-messages/), I should:
>>>> > > Use the ops admin socket command to list outstanding metadata
>>>> operations.
>>>> >
>>>> > Unfortunately, I fail to really understand what I'm supposed to do.
>>>> Can
>>>> > someone give a pointer?
>>>> >
>>>> > Best,
>>>> >
>>>> > Emmanuel
>>>> > _______________________________________________
>>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>> >
>>>> >
>>>>
>>>> --
>>>> Milind
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>>
>>
>> --
>> Milind
>>
>>

-- 
Milind
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx