Hi Erich, Here is how to map the client ID to some extra info: ceph tell mds.0 client ls id=99445 Here is how to map inode ID to the path: ceph tell mds.0 dump inode 0x100081b9ceb | jq -r .path On Fri, Mar 29, 2024 at 1:12 AM Erich Weiler <weiler@xxxxxxxxxxxx> wrote: > > Here are some of the MDS logs: > > Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : slow request 511.703289 seconds old, received at > 2024-03-27T18:49:53.623192+0000: client_request(client.99375:459393 > getattr AsXsFs #0x100081b9ceb 2024-03-27T18:49:53.620806+0000 > caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch > getattr > Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : slow request 690.189459 seconds old, received at > 2024-03-27T18:46:55.137022+0000: client_request(client.99445:4189994 > getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+0000 > caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch > getattr > Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : slow request 686.308604 seconds old, received at > 2024-03-27T18:46:59.017876+0000: client_request(client.99445:4190508 > getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.018864+0000 > caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch > getattr > Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : slow request 686.156943 seconds old, received at > 2024-03-27T18:46:59.169537+0000: client_request(client.99400:591887 > getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.170644+0000 > caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch > getattr > Mar 27 11:58:26 pr-md-01.prism ceph-mds[1296468]: > mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16631 from mon.0 > Mar 27 11:58:30 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : 16 slow requests, 0 included below; oldest blocked for > > 699.385743 secs > Mar 27 11:58:34 pr-md-01.prism ceph-mds[1296468]: > mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16632 from mon.0 > Mar 27 11:58:35 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : 16 slow requests, 0 included below; oldest blocked for > > 704.385896 secs > Mar 27 11:58:38 pr-md-01.prism ceph-mds[1296468]: > mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16633 from mon.0 > Mar 27 11:58:40 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : 16 slow requests, 0 included below; oldest blocked for > > 709.385979 secs > Mar 27 11:58:42 pr-md-01.prism ceph-mds[1296468]: > mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16634 from mon.0 > Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : 78 slow requests, 5 included below; oldest blocked for > > 714.386040 secs > Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : slow request 710.189838 seconds old, received at > 2024-03-27T18:46:55.137022+0000: client_request(client.99445:4189994 > getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+0000 > caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch > getattr > Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : slow request 706.308983 seconds old, received at > 2024-03-27T18:46:59.017876+0000: client_request(client.99445:4190508 > getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.018864+0000 > caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch > getattr > Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : slow request 706.157322 seconds old, received at > 2024-03-27T18:46:59.169537+0000: client_request(client.99400:591887 > getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.170644+0000 > caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch > getattr > Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : slow request 706.086751 seconds old, received at > 2024-03-27T18:46:59.240108+0000: client_request(client.99400:591894 > getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.242644+0000 > caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch > getattr > Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : slow request 705.196030 seconds old, received at > 2024-03-27T18:47:00.130829+0000: client_request(client.99400:591985 > getattr AsXsFs #0x100081b9ceb 2024-03-27T18:47:00.130641+0000 > caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch > getattr > Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: > mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16635 from mon.0 > Mar 27 11:58:50 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : 16 slow requests, 0 included below; oldest blocked for > > 719.386116 secs > Mar 27 11:58:53 pr-md-01.prism ceph-mds[1296468]: > mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16636 from mon.0 > Mar 27 11:58:55 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : 16 slow requests, 0 included below; oldest blocked for > > 724.386184 secs > Mar 27 11:58:57 pr-md-01.prism ceph-mds[1296468]: > mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16637 from mon.0 > Mar 27 11:59:00 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : 16 slow requests, 0 included below; oldest blocked for > > 729.386333 secs > Mar 27 11:59:02 pr-md-01.prism ceph-mds[1296468]: > mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16638 from mon.0 > Mar 27 11:59:05 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : 53 slow requests, 5 included below; oldest blocked for > > 734.386400 secs > Mar 27 11:59:05 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster) > log [WRN] : slow request 730.190197 seconds old, received at > 2024-03-27T18:46:55.137022+0000: client_request(client.99445:4189994 > getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+0000 > caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch > getattr > > Can we tell which client the slow requests are coming from? It says > stuff like "client.99445:4189994" but I don't know how to map that to a > client... > > Thanks for the response! > > -erich > > On 3/27/24 21:28, Xiubo Li wrote: > > > > On 3/28/24 04:03, Erich Weiler wrote: > >> Hi All, > >> > >> I've been battling this for a while and I'm not sure where to go from > >> here. I have a Ceph health warning as such: > >> > >> # ceph -s > >> cluster: > >> id: 58bde08a-d7ed-11ee-9098-506b4b4da440 > >> health: HEALTH_WARN > >> 1 MDSs report slow requests > > > > There had slow requests. I just suspect the behind on trimming was > > caused by this. > > > > Could you share the logs about the slow requests ? What are they ? > > > > Thanks > > > > > >> 1 MDSs behind on trimming > >> > >> services: > >> mon: 5 daemons, quorum > >> pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d) > >> mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz > >> mds: 1/1 daemons up, 2 standby > >> osd: 46 osds: 46 up (since 9h), 46 in (since 2w) > >> > >> data: > >> volumes: 1/1 healthy > >> pools: 4 pools, 1313 pgs > >> objects: 260.72M objects, 466 TiB > >> usage: 704 TiB used, 424 TiB / 1.1 PiB avail > >> pgs: 1306 active+clean > >> 4 active+clean+scrubbing+deep > >> 3 active+clean+scrubbing > >> > >> io: > >> client: 123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr > >> > >> And the specifics are: > >> > >> # ceph health detail > >> HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming > >> [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests > >> mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked > > >> 30 secs > >> [WRN] MDS_TRIM: 1 MDSs behind on trimming > >> mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250) > >> max_segments: 250, num_segments: 13884 > >> > >> That "num_segments" number slowly keeps increasing. I suspect I just > >> need to tell the MDS servers to trim faster but after hours of > >> googling around I just can't figure out the best way to do it. The > >> best I could come up with was to decrease "mds_cache_trim_decay_rate" > >> from 1.0 to .8 (to start), based on this page: > >> > >> https://www.suse.com/support/kb/doc/?id=000019740 > >> > >> But it doesn't seem to help, maybe I should decrease it further? I am > >> guessing this must be a common issue...? I am running Reef on the MDS > >> servers, but most clients are on Quincy. > >> > >> Thanks for any advice! > >> > >> cheers, > >> erich > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Alexander E. Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx