Hi Milind, I finally managed to dump the cache and find the file. It generated a 1.5 GB file with about 7 Mio lines. It's kind of hard to know what is out of the ordinary… Furthermore, I noticed that dumping the cache was actually stopping the MDS. Is it a normal behavior? Best, Emmanuel On Thu, May 25, 2023 at 1:19 PM Milind Changire <mchangir@xxxxxxxxxx> wrote: > try the command with the --id argument: > > # ceph --id admin --cluster floki daemon mds.icadmin011 dump cache > /tmp/dump.txt > > I presume that your keyring has an appropriate entry for the client.admin > user > > > On Wed, May 24, 2023 at 5:10 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx> > wrote: > >> Absolutely! :-) >> >> root@icadmin011:/tmp# ceph --cluster floki daemon mds.icadmin011 dump >> cache /tmp/dump.txt >> root@icadmin011:/tmp# ll >> total 48 >> drwxrwxrwt 12 root root 4096 May 24 13:23 ./ >> drwxr-xr-x 18 root root 4096 Jun 9 2022 ../ >> drwxrwxrwt 2 root root 4096 May 4 12:43 .ICE-unix/ >> drwxrwxrwt 2 root root 4096 May 4 12:43 .Test-unix/ >> drwxrwxrwt 2 root root 4096 May 4 12:43 .X11-unix/ >> drwxrwxrwt 2 root root 4096 May 4 12:43 .XIM-unix/ >> drwxrwxrwt 2 root root 4096 May 4 12:43 .font-unix/ >> drwx------ 2 root root 4096 May 24 13:23 ssh-Sl5AiotnXp/ >> drwx------ 3 root root 4096 May 8 13:26 >> 'systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-ceph-mds@icadmin011.service-SGZrKf >> '/ >> drwx------ 3 root root 4096 May 4 12:43 >> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-logind.service-uU1GAi/ >> drwx------ 3 root root 4096 May 4 12:43 >> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-resolved.service-KYHd7f/ >> drwx------ 3 root root 4096 May 4 12:43 >> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-timesyncd.service-1Qtj5i/ >> >> On Wed, May 24, 2023 at 1:17 PM Milind Changire <mchangir@xxxxxxxxxx> >> wrote: >> >>> I hope the daemon mds.icadmin011 is running on the same machine that >>> you are looking for /tmp/dump.txt, since the file is created on the system >>> which has that daemon running. >>> >>> >>> On Wed, May 24, 2023 at 2:16 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx> >>> wrote: >>> >>>> Hi Milind, >>>> >>>> you are absolutely right. >>>> >>>> The dump_ops_in_flight is giving a good hint about what's happening: >>>> { >>>> "ops": [ >>>> { >>>> "description": "internal op exportdir:mds.5:975673", >>>> "initiated_at": "2023-05-23T17:49:53.030611+0200", >>>> "age": 60596.355186077999, >>>> "duration": 60596.355234167997, >>>> "type_data": { >>>> "flag_point": "failed to wrlock, waiting", >>>> "reqid": "mds.5:975673", >>>> "op_type": "internal_op", >>>> "internal_op": 5377, >>>> "op_name": "exportdir", >>>> "events": [ >>>> { >>>> "time": "2023-05-23T17:49:53.030611+0200", >>>> "event": "initiated" >>>> }, >>>> { >>>> "time": "2023-05-23T17:49:53.030611+0200", >>>> "event": "throttled" >>>> }, >>>> { >>>> "time": "2023-05-23T17:49:53.030611+0200", >>>> "event": "header_read" >>>> }, >>>> { >>>> "time": "2023-05-23T17:49:53.030611+0200", >>>> "event": "all_read" >>>> }, >>>> { >>>> "time": "2023-05-23T17:49:53.030611+0200", >>>> "event": "dispatched" >>>> }, >>>> { >>>> "time": "2023-05-23T17:49:53.030657+0200", >>>> "event": "requesting remote authpins" >>>> }, >>>> { >>>> "time": "2023-05-23T17:49:53.050253+0200", >>>> "event": "failed to wrlock, waiting" >>>> } >>>> ] >>>> } >>>> } >>>> ], >>>> "num_ops": 1 >>>> } >>>> >>>> However, the dump cache does not seem to produce an output: >>>> root@icadmin011:~# ceph --cluster floki daemon mds.icadmin011 dump >>>> cache /tmp/dump.txt >>>> root@icadmin011:~# ls /tmp >>>> ssh-cHvP3iF611 >>>> >>>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-ceph-mds@icadmin011.service-SGZrKf >>>> >>>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-logind.service-uU1GAi >>>> >>>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-resolved.service-KYHd7f >>>> >>>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-timesyncd.service-1Qtj5i >>>> >>>> Do you have any hint? >>>> >>>> Best, >>>> >>>> Emmanuel >>>> >>>> On Wed, May 24, 2023 at 10:30 AM Milind Changire <mchangir@xxxxxxxxxx> >>>> wrote: >>>> >>>>> Emmanuel, >>>>> You probably missed the "daemon" keyword after the "ceph" command name. >>>>> Here's the docs for pacific: >>>>> https://docs.ceph.com/en/pacific/cephfs/troubleshooting/ >>>>> >>>>> So, your command should've been: >>>>> # ceph daemon mds.icadmin011 dump cache /tmp/dump.txt >>>>> >>>>> You could also dump the ops in flight with: >>>>> # ceph daemon mds.icadmin011 dump_ops_in_flight >>>>> >>>>> >>>>> >>>>> On Wed, May 24, 2023 at 1:38 PM Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx >>>>> > >>>>> wrote: >>>>> >>>>> > Hi, >>>>> > >>>>> > we are running a cephfs cluster with the following version: >>>>> > ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) >>>>> pacific >>>>> > (stable) >>>>> > >>>>> > Several MDSs are reporting slow requests: >>>>> > HEALTH_WARN 4 MDSs report slow requests >>>>> > [WRN] MDS_SLOW_REQUEST: 4 MDSs report slow requests >>>>> > mds.icadmin011(mds.5): 1 slow requests are blocked > 30 secs >>>>> > mds.icadmin015(mds.6): 2 slow requests are blocked > 30 secs >>>>> > mds.icadmin006(mds.4): 8 slow requests are blocked > 30 secs >>>>> > mds.icadmin007(mds.2): 2 slow requests are blocked > 30 secs >>>>> > >>>>> > According to Quincy's documentation ( >>>>> > https://docs.ceph.com/en/quincy/cephfs/troubleshooting/), this can >>>>> be >>>>> > investigated by issuing: >>>>> > ceph mds.icadmin011 dump cache /tmp/dump.txt >>>>> > >>>>> > Unfortunately, this command fails: >>>>> > no valid command found; 10 closest matches: >>>>> > pg stat >>>>> > pg getmap >>>>> > pg dump [all|summary|sum|delta|pools|osds|pgs|pgs_brief...] >>>>> > pg dump_json [all|summary|sum|pools|osds|pgs...] >>>>> > pg dump_pools_json >>>>> > pg ls-by-pool <poolstr> [<states>...] >>>>> > pg ls-by-primary <id|osd.id> [<pool:int>] [<states>...] >>>>> > pg ls-by-osd <id|osd.id> [<pool:int>] [<states>...] >>>>> > pg ls [<pool:int>] [<states>...] >>>>> > pg dump_stuck [inactive|unclean|stale|undersized|degraded...] >>>>> > [<threshold:int>] >>>>> > Error EINVAL: invalid command >>>>> > >>>>> > >>>>> > I imagine that it is related to the fact that we are running the >>>>> Pacific >>>>> > version and not the Quincy version. >>>>> > >>>>> > When looking at the Pacific's documentation ( >>>>> > https://docs.ceph.com/en/pacific/cephfs/health-messages/), I should: >>>>> > > Use the ops admin socket command to list outstanding metadata >>>>> operations. >>>>> > >>>>> > Unfortunately, I fail to really understand what I'm supposed to do. >>>>> Can >>>>> > someone give a pointer? >>>>> > >>>>> > Best, >>>>> > >>>>> > Emmanuel >>>>> > _______________________________________________ >>>>> > ceph-users mailing list -- ceph-users@xxxxxxx >>>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> > >>>>> > >>>>> >>>>> -- >>>>> Milind >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> >>>> >>> >>> -- >>> Milind >>> >>> > > -- > Milind > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx