Re: Troubleshooting "N slow requests are blocked > 30 secs" on Pacific

Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx> · Fri, 26 May 2023 09:21:00 +0200

Hi Milind,

I finally managed to dump the cache and find the file.
It generated a 1.5 GB file with about 7 Mio lines. It's kind of hard to
know what is out of the ordinary…

Furthermore, I noticed that dumping the cache was actually stopping the
MDS. Is it a normal behavior?

Best,

Emmanuel

On Thu, May 25, 2023 at 1:19 PM Milind Changire <mchangir@xxxxxxxxxx> wrote:

> try the command with the --id argument:
>
> # ceph --id admin --cluster floki daemon mds.icadmin011 dump cache
> /tmp/dump.txt
>
> I presume that your keyring has an appropriate entry for the client.admin
> user
>
>
> On Wed, May 24, 2023 at 5:10 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx>
> wrote:
>
>> Absolutely! :-)
>>
>> root@icadmin011:/tmp# ceph --cluster floki daemon mds.icadmin011 dump
>> cache /tmp/dump.txt
>> root@icadmin011:/tmp# ll
>> total 48
>> drwxrwxrwt 12 root root 4096 May 24 13:23  ./
>> drwxr-xr-x 18 root root 4096 Jun  9  2022  ../
>> drwxrwxrwt  2 root root 4096 May  4 12:43  .ICE-unix/
>> drwxrwxrwt  2 root root 4096 May  4 12:43  .Test-unix/
>> drwxrwxrwt  2 root root 4096 May  4 12:43  .X11-unix/
>> drwxrwxrwt  2 root root 4096 May  4 12:43  .XIM-unix/
>> drwxrwxrwt  2 root root 4096 May  4 12:43  .font-unix/
>> drwx------  2 root root 4096 May 24 13:23  ssh-Sl5AiotnXp/
>> drwx------  3 root root 4096 May  8 13:26
>> 'systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-ceph-mds@icadmin011.service-SGZrKf
>> '/
>> drwx------  3 root root 4096 May  4 12:43
>>  systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-logind.service-uU1GAi/
>> drwx------  3 root root 4096 May  4 12:43
>>  systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-resolved.service-KYHd7f/
>> drwx------  3 root root 4096 May  4 12:43
>>  systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-timesyncd.service-1Qtj5i/
>>
>> On Wed, May 24, 2023 at 1:17 PM Milind Changire <mchangir@xxxxxxxxxx>
>> wrote:
>>
>>> I hope the daemon mds.icadmin011 is running on the same machine that
>>> you are looking for /tmp/dump.txt, since the file is created on the system
>>> which has that daemon running.
>>>
>>>
>>> On Wed, May 24, 2023 at 2:16 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx>
>>> wrote:
>>>
>>>> Hi Milind,
>>>>
>>>> you are absolutely right.
>>>>
>>>> The dump_ops_in_flight is giving a good hint about what's happening:
>>>> {
>>>>     "ops": [
>>>>         {
>>>>             "description": "internal op exportdir:mds.5:975673",
>>>>             "initiated_at": "2023-05-23T17:49:53.030611+0200",
>>>>             "age": 60596.355186077999,
>>>>             "duration": 60596.355234167997,
>>>>             "type_data": {
>>>>                 "flag_point": "failed to wrlock, waiting",
>>>>                 "reqid": "mds.5:975673",
>>>>                 "op_type": "internal_op",
>>>>                 "internal_op": 5377,
>>>>                 "op_name": "exportdir",
>>>>                 "events": [
>>>>                     {
>>>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>>>                         "event": "initiated"
>>>>                     },
>>>>                     {
>>>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>>>                         "event": "throttled"
>>>>                     },
>>>>                     {
>>>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>>>                         "event": "header_read"
>>>>                     },
>>>>                     {
>>>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>>>                         "event": "all_read"
>>>>                     },
>>>>                     {
>>>>                         "time": "2023-05-23T17:49:53.030611+0200",
>>>>                         "event": "dispatched"
>>>>                     },
>>>>                     {
>>>>                         "time": "2023-05-23T17:49:53.030657+0200",
>>>>                         "event": "requesting remote authpins"
>>>>                     },
>>>>                     {
>>>>                         "time": "2023-05-23T17:49:53.050253+0200",
>>>>                         "event": "failed to wrlock, waiting"
>>>>                     }
>>>>                 ]
>>>>             }
>>>>         }
>>>>     ],
>>>>     "num_ops": 1
>>>> }
>>>>
>>>> However, the dump cache does not seem to produce an output:
>>>> root@icadmin011:~# ceph --cluster floki daemon mds.icadmin011 dump
>>>> cache /tmp/dump.txt
>>>> root@icadmin011:~# ls /tmp
>>>> ssh-cHvP3iF611
>>>>
>>>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-ceph-mds@icadmin011.service-SGZrKf
>>>>
>>>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-logind.service-uU1GAi
>>>>
>>>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-resolved.service-KYHd7f
>>>>
>>>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-timesyncd.service-1Qtj5i
>>>>
>>>> Do you have any hint?
>>>>
>>>> Best,
>>>>
>>>> Emmanuel
>>>>
>>>> On Wed, May 24, 2023 at 10:30 AM Milind Changire <mchangir@xxxxxxxxxx>
>>>> wrote:
>>>>
>>>>> Emmanuel,
>>>>> You probably missed the "daemon" keyword after the "ceph" command name.
>>>>> Here's the docs for pacific:
>>>>> https://docs.ceph.com/en/pacific/cephfs/troubleshooting/
>>>>>
>>>>> So, your command should've been:
>>>>> # ceph daemon mds.icadmin011 dump cache /tmp/dump.txt
>>>>>
>>>>> You could also dump the ops in flight with:
>>>>> # ceph daemon mds.icadmin011 dump_ops_in_flight
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 24, 2023 at 1:38 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx
>>>>> >
>>>>> wrote:
>>>>>
>>>>> > Hi,
>>>>> >
>>>>> > we are running a cephfs cluster with the following version:
>>>>> > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17)
>>>>> pacific
>>>>> > (stable)
>>>>> >
>>>>> > Several MDSs are reporting slow requests:
>>>>> > HEALTH_WARN 4 MDSs report slow requests
>>>>> > [WRN] MDS_SLOW_REQUEST: 4 MDSs report slow requests
>>>>> >     mds.icadmin011(mds.5): 1 slow requests are blocked > 30 secs
>>>>> >     mds.icadmin015(mds.6): 2 slow requests are blocked > 30 secs
>>>>> >     mds.icadmin006(mds.4): 8 slow requests are blocked > 30 secs
>>>>> >     mds.icadmin007(mds.2): 2 slow requests are blocked > 30 secs
>>>>> >
>>>>> > According to Quincy's documentation (
>>>>> > https://docs.ceph.com/en/quincy/cephfs/troubleshooting/), this can
>>>>> be
>>>>> > investigated by issuing:
>>>>> > ceph mds.icadmin011 dump cache /tmp/dump.txt
>>>>> >
>>>>> > Unfortunately, this command fails:
>>>>> > no valid command found; 10 closest matches:
>>>>> > pg stat
>>>>> > pg getmap
>>>>> > pg dump [all|summary|sum|delta|pools|osds|pgs|pgs_brief...]
>>>>> > pg dump_json [all|summary|sum|pools|osds|pgs...]
>>>>> > pg dump_pools_json
>>>>> > pg ls-by-pool <poolstr> [<states>...]
>>>>> > pg ls-by-primary <id|osd.id> [<pool:int>] [<states>...]
>>>>> > pg ls-by-osd <id|osd.id> [<pool:int>] [<states>...]
>>>>> > pg ls [<pool:int>] [<states>...]
>>>>> > pg dump_stuck [inactive|unclean|stale|undersized|degraded...]
>>>>> > [<threshold:int>]
>>>>> > Error EINVAL: invalid command
>>>>> >
>>>>> >
>>>>> > I imagine that it is related to the fact that we are running the
>>>>> Pacific
>>>>> > version and not the Quincy version.
>>>>> >
>>>>> > When looking at the Pacific's documentation (
>>>>> > https://docs.ceph.com/en/pacific/cephfs/health-messages/), I should:
>>>>> > > Use the ops admin socket command to list outstanding metadata
>>>>> operations.
>>>>> >
>>>>> > Unfortunately, I fail to really understand what I'm supposed to do.
>>>>> Can
>>>>> > someone give a pointer?
>>>>> >
>>>>> > Best,
>>>>> >
>>>>> > Emmanuel
>>>>> > _______________________________________________
>>>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>> >
>>>>> >
>>>>>
>>>>> --
>>>>> Milind
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>>
>>>
>>> --
>>> Milind
>>>
>>>
>
> --
> Milind
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx