Re: MDS Behind on Trimming...

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Fri, 29 Mar 2024 00:29:59 +0100

Hi,

we have similar problems from time to time. Running Reef on servers and 
latest ubuntu 20.04 hwe kernel on the clients.

There are probably two scenarios with slightly different observations:

1. MDS reports slow ops

Some client is holding caps for a certain file / directory and blocks 
other clients and the MDS from processing requests in its queue. This 
usually happens if the same directory is accessed from multiple hosts in 
our compute cluster, but the actual cause is unknown yet. No rsync 
involved...

I usually check the complete output of the "dump inode" MDS command. It 
also contains a list of machines with active capabilities for that 
inode. I then iterate through the machines in the list and look for 
obvious problem. Usually it's a machine with a static load but no active 
processes (blocked / spinning kernel threads). User processes accessing 
the target inode are often stalled/blocked.

Possible fixes include a restart / failover of the MDS, or a restart of 
the compute node. But sometimes the method described in the next 
scenarios might also help

2. OSD/MDS/Clients report slow ops

In contrast to the other scenarios this one indicates a problem with 
data access, not metadata (in the end metadata access on the MDS is data 
access to the metadata pool..). Our kernels a build with debug support, 
so after using the MDS client lookup I check the pending blocks for 
cephfs on the client. These are available at 
/sys/kernel/debug/ceph/<ceph fs uuid>.<client id>. The pseudo file osdc 
and mdsc show the pending requests for OSDs and MDSs. Usually requests 
should not show up again if the one of the files is cat(1)'ed a second 
time. If a requests persists it might be blocked. If the cluster is 
otherwise healthy a restart of the primary OSD (the OSD that handles the 
request) usually unblocks it. For the MDS the "dump_blocked_ops" command 
will print the necessary information.

I've seen both problems on pacific, quincy and reef. I neither had the 
time for a complete investigation nor do I have any clue where to start 
debugging them. In most case the OSD restart helps, but it's a manual 
intervention.

Best regards,

Burkhard

On 28.03.24 23:58, Alexander E. Patrakov wrote:
Hello Erich,

What you are experiencing is definitely a bug - but possibly a client
bug. Not sure. Upgrading Ceph packages on the clients, though, will
not help, because the actual CephFS client is the kernel. You can try
upgrading it to the latest 6.8.x (or, better, trying the same workload
from different hosts with the upgraded kernel), but I doubt that it
will help.

On Fri, Mar 29, 2024 at 6:17 AM Erich Weiler <weiler@xxxxxxxxxxxx> wrote:
Could there be an issue with the fact that the servers (MDS, MGR, MON,
OSD) are running reef and all the clients are running quincy?

I can easily enough get the new reef repo in for all our clients (Ubuntu
22.04) and upgrade the clients to reef if that might help..?

On 3/28/24 3:05 PM, Erich Weiler wrote:
I asked the user and they said no, no rsync involved.  Although I
rsync'd 500TB into this filesystem in the beginning without incident, so
hopefully it's not a big deal here.

I'm asking the user what their workflow does to try and pin this down.

Are there any other known reason why a slow request would start on a
certain inode, then block a bunch of cache segments behind it, until the
MDS is restarted?

Once I restart the MDS daemon that is slow, it shows the cache segments
transfer to the other MDS server and very quickly drop to zero, then
everything is healthy again, the stuck directory in question responds
again and all is well.  Then a few hours later it started happening
again (not always the same directory).

I hope I'm not experiencing a bug, but I can't see what would be causing
this...

On 3/28/24 2:37 PM, Alexander E. Patrakov wrote:
Hello Erich,

Does the workload, by any chance, involve rsync? It is unfortunately
well-known for triggering such issues. A workaround is to export the
directory via NFS and run rsync against the NFS mount instead of
directly against CephFS.

On Fri, Mar 29, 2024 at 4:58 AM Erich Weiler <weiler@xxxxxxxxxxxx> wrote:
MDS logs show:

Mar 28 13:42:29 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
3676.400077 secs
Mar 28 13:42:30 pr-md-02.prism ceph-mds[1464328]:
mds.slugfs.pr-md-02.sbblqq Updating MDS map to version 22775 from mon.3
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : 320 slow requests, 5 included below; oldest blocked for >
3681.400104 secs
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3668.805732 seconds old, received at
2024-03-28T19:41:25.772531+0000: client_request(client.99375:574268
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:25.770954+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3667.883853 seconds old, received at
2024-03-28T19:41:26.694410+0000: client_request(client.99390:374844
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:26.696172+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3663.724571 seconds old, received at
2024-03-28T19:41:30.853692+0000: client_request(client.99390:375258
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:30.852166+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3681.399582 seconds old, received at
2024-03-28T19:41:13.178681+0000: client_request(client.99385:11712080
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:13.178764+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently failed to
rdlock, waiting
Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : slow request 3680.508972 seconds old, received at
2024-03-28T19:41:14.069291+0000: client_request(client.99385:11712556
getattr AsXsFs #0x1000c097307 2024-03-28T19:41:14.070764+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr

The client IDs map to several of our cluster nodes but the inode
reference always refers to the same directory in these recent logs:

/private/groups/shapirolab/brock/r2/cactus_coord

That directory does not respond to an 'ls', but other directories
directly above it do just fine.  Maybe it's a bad cache item on the MDS?

# ceph health detail
HEALTH_WARN 2 clients failing to advance oldest client/flush tid; 1 MDSs
report slow requests; 1 MDSs behind on trimming
[WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest
client/flush tid
       mds.slugfs.pr-md-02.sbblqq(mds.0): Client mustard failing to
advance its oldest client/flush tid.  client_id: 101305
       mds.slugfs.pr-md-01.xdtppo(mds.1): Client  failing to advance its
oldest client/flush tid.  client_id: 101305
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
       mds.slugfs.pr-md-02.sbblqq(mds.0): 201 slow requests are blocked >
30 secs
[WRN] MDS_TRIM: 1 MDSs behind on trimming
       mds.slugfs.pr-md-02.sbblqq(mds.0): Behind on trimming (4786/250)
max_segments: 250, num_segments: 4786

I think that this is somehow causing the "slow requests", on the nodes
listed in the logs, as that directory in inaccessible.  And maybe the
'behind on trimming' part is also related, as it can't trim past that
inode or something?

If I restart the MDS daemon this will clear (I've done it before).  But
it just comes back.  Often somewhere in the same directory
/private/groups/shapirolab/brock/...[something].

-erich

On 3/28/24 10:11 AM, Erich Weiler wrote:
Here are some of the MDS logs:

Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 511.703289 seconds old, received at
2024-03-27T18:49:53.623192+0000: client_request(client.99375:459393
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:49:53.620806+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 690.189459 seconds old, received at
2024-03-27T18:46:55.137022+0000: client_request(client.99445:4189994
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 686.308604 seconds old, received at
2024-03-27T18:46:59.017876+0000: client_request(client.99445:4190508
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.018864+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 686.156943 seconds old, received at
2024-03-27T18:46:59.169537+0000: client_request(client.99400:591887
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.170644+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:26 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16631 from mon.0
Mar 27 11:58:30 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
699.385743 secs
Mar 27 11:58:34 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16632 from mon.0
Mar 27 11:58:35 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
704.385896 secs
Mar 27 11:58:38 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16633 from mon.0
Mar 27 11:58:40 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
709.385979 secs
Mar 27 11:58:42 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16634 from mon.0
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 78 slow requests, 5 included below; oldest blocked for >
714.386040 secs
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 710.189838 seconds old, received at
2024-03-27T18:46:55.137022+0000: client_request(client.99445:4189994
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 706.308983 seconds old, received at
2024-03-27T18:46:59.017876+0000: client_request(client.99445:4190508
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.018864+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 706.157322 seconds old, received at
2024-03-27T18:46:59.169537+0000: client_request(client.99400:591887
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.170644+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 706.086751 seconds old, received at
2024-03-27T18:46:59.240108+0000: client_request(client.99400:591894
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:59.242644+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 705.196030 seconds old, received at
2024-03-27T18:47:00.130829+0000: client_request(client.99400:591985
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:47:00.130641+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr
Mar 27 11:58:45 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16635 from mon.0
Mar 27 11:58:50 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
719.386116 secs
Mar 27 11:58:53 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16636 from mon.0
Mar 27 11:58:55 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
724.386184 secs
Mar 27 11:58:57 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16637 from mon.0
Mar 27 11:59:00 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
729.386333 secs
Mar 27 11:59:02 pr-md-01.prism ceph-mds[1296468]:
mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16638 from mon.0
Mar 27 11:59:05 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : 53 slow requests, 5 included below; oldest blocked for >
734.386400 secs
Mar 27 11:59:05 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 730.190197 seconds old, received at
2024-03-27T18:46:55.137022+0000: client_request(client.99445:4189994
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+0000
caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
getattr

Can we tell which client the slow requests are coming from?  It says
stuff like "client.99445:4189994" but I don't know how to map that to a
client...

Thanks for the response!

-erich

On 3/27/24 21:28, Xiubo Li wrote:
On 3/28/24 04:03, Erich Weiler wrote:
Hi All,

I've been battling this for a while and I'm not sure where to go from
here.  I have a Ceph health warning as such:

# ceph -s
    cluster:
      id:     58bde08a-d7ed-11ee-9098-506b4b4da440
      health: HEALTH_WARN
              1 MDSs report slow requests
There had slow requests. I just suspect the behind on trimming was
caused by this.

Could you share the logs about the slow requests ? What are they ?

Thanks

1 MDSs behind on trimming

    services:
      mon: 5 daemons, quorum
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)
      mgr: pr-md-01.jemmdf(active, since 3w), standbys:
pr-md-02.emffhz
      mds: 1/1 daemons up, 2 standby
      osd: 46 osds: 46 up (since 9h), 46 in (since 2w)

    data:
      volumes: 1/1 healthy
      pools:   4 pools, 1313 pgs
      objects: 260.72M objects, 466 TiB
      usage:   704 TiB used, 424 TiB / 1.1 PiB avail
      pgs:     1306 active+clean
               4    active+clean+scrubbing+deep
               3    active+clean+scrubbing

    io:
      client:   123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr

And the specifics are:

# ceph health detail
HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
      mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are
blocked >
30 secs
[WRN] MDS_TRIM: 1 MDSs behind on trimming
      mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming
(13884/250)
max_segments: 250, num_segments: 13884

That "num_segments" number slowly keeps increasing.  I suspect I just
need to tell the MDS servers to trim faster but after hours of
googling around I just can't figure out the best way to do it. The
best I could come up with was to decrease "mds_cache_trim_decay_rate"
from 1.0 to .8 (to start), based on this page:

https://www.suse.com/support/kb/doc/?id=000019740

But it doesn't seem to help, maybe I should decrease it further? I am
guessing this must be a common issue...?  I am running Reef on the
MDS servers, but most clients are on Quincy.

Thanks for any advice!

cheers,
erich
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx