Re: Clients failing to respond to capability release

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Tim,

We've been seeing something that may be similar to you with concurrent MDS_CLIENT_LATE_RELEASE and MDS_SLOW_REQUEST warning messages as well as frequently MDS_CLIENT_RECALL and MDS_SLOW_METADATA_IO warnings from the same MDS referring to the same client. We are using 1 MDS for our non-containerised filesystem on a 5.7 PiB sized Alma8.8 cluster with 29 nodes and 348 spinning OSDs for bulk data and 4 OSDs for the metadata on NVMe SSDs with 174 million files (we also have additional NVMe drive partitions as DB and WAL devices for each OSD). All of our clients are using the kernel mount where we also have a SMB gateway which kernel mounts the filesystem and shares it to Windows and Mac machines.

This problem seems to have various symptoms including but not limited to: (i) a particular file or directory hanging on openfs, read, and or statx system calls for all clients mounting the filesystem, (ii) all our connected clients hanging on the aforementioned system calls when performing any metadata or bulk data I/O on the filesystem, (iii) 1 client being unable to stat or cd into the filesystem whilst all other clients are completely unaffected, and (iv) possibly a near complete loss of metadata due to the MDS journal swelling beyond the size of the MDS pool where we eventually had some success following the recovery docs (https://docs.ceph.com/en/quincy/cephfs/disaster-recovery-experts/) but opted for a backup restore instead (we were able to restore the bulk data to a read-only state but not all the metadata where we would be more than happy to write a "worked example" of this if the community would find it useful). We use to solve this by forcefully re-mounting the filesystem on the client but as we run multi-user systems and have 100s of clients, this was not a sustainable option for us.

We notice generally 3 classes of error:

 * The oldest slow or blocked request in the ceph-mds log is waiting to
   acquire locks for an inode (visible via `ceph tell mds.N
   dump_blocked_ops`). By tracing the oldest FS client which has the
   capability for this file and seeing which OSD the client is waiting
   for with `watch 'cat /sys/kernel/debug/ceph/*/osdc'` or by looking
   at the output of `ceph tell mds.N objecter requests`, we can
   determine which exact OSD is being slow (always a spinning disk).
   These issues are due to hardware issues or the OSD crashing (where a
   restart seems to remedy the problem) and thus counterintuitively (at
   least for me) suggests that our error messages for the MDS can refer
   to OSD hard-drive problems. These HDDs are in a completely separate
   pool to where I would have thought the metadata I/O would be being
   performed since our metadata pool is exclusively SSDs.
 * The oldest, or near oldest, blocked MDS op being an "internal op
   fragmentdir" operation and many client_requests at the flag_point
   "failed to authpin, dir is being fragmented". This seems to be an
   MDS bug as the documentation suggests that this should only happen
   when in a multi-MDS system but we have been using only 1 MDS. A
   simple restart of the MDS solves this problem.
 * Client has a crash and either has been blocklisted or should be.
   Blocklisted clients may not immediately know they've been evicted
   and thus may be hanging indefinitely when stating the filesystem.
   This can be detected by searching through the client list via `ceph
   tell mds.N client ls` and seeing whether there is an ID associated
   with the client. However whilst a manual re-mount will allow the
   former client to establish new session, we prefer to use the
   "recover_session=clean" mount option to do this automatically and
   now we have no incidents of this.

Yesterday for example, we had an incident of multiple MDS warning messages of: MDS_SLOW_REQUEST,  MDS_TRIM, MDS_CLIENT_RECALL, and MDS_CLIENT_LATE_RELEASE. This was caused by a non-responsive hard-drive leading to a build up of the MDS cache and trims being unable to be completed where we managed to narrow down the hard-drive for the inode which the blocked client was waiting for a rdlock on and restarted the OSD for that the drive. Notably this hard-drive had no errors with smartctl or elsewhere and only had the following slow ops message on OSD systemctl status/ log:

    osd.281 123892 get_health_metrics reporting 4 slow ops, oldest is osd_op(client.49576654.0:9101 3.d9bs0 3.ee6f2d9b (undecoded) ondisk+read+known_if_redirected e123889)

Whilst restarting the hard-drive that the client with the oldest blocked op was waiting for did "clear" this /sys/kernel/debug/ceph/*/osdc queue, the oldest blocked MDS op then became an "internal op fragmentdir:mds.0:1" one where restarting the active MDS cleared this. Alas, this resulted in another blocked getattr op at the flag point "failed to authpin, dir is being fragmented" which was similarly tackled by restarting the MDS that just took over. This finally resulted in only two clients failing to respond to caps releases on inodes they were holding (despite rebooting at the time) where performing a "ceph tell mds.N session kill CLIENT_ID" removed them from the session map and allow the MDS' cache to become manageable again, thereby clearing all of these warning messages.

We've had this problem since the beginning of this year and upgrading from octopus to quincy has unfortunately not solved our problem. We've only really been able to solve this problem by undergoing an aggressive campaign of replacing hard-drives which were reaching the end of their lives. This has substantially reduced the amount of problems we've had in relation to this.

We would be very interested to hear about the rest of the community's experience in relation to this and I would recommend looking at your underlying OSDs Tim to see whether there are any timeout or uncorrectable errors. We would also be very eager to hear if these approaches are sub-optimal and whether anyone else has any insight into our problems. Sorry as well for resurrecting an old thread but we thought our experiences may be helpfully for others!

Kindest regards,

Ivan Clayson

On 19/09/2023 12:35, Tim Bishop wrote:
Hi,

I've seen this issue mentioned in the past, but with older releases. So
I'm wondering if anybody has any pointers.

The Ceph cluster is running Pacific 16.2.13 on Ubuntu 20.04. Almost all
clients are working fine, with the exception of our backup server. This
is using the kernel CephFS client on Ubuntu 22.04 with kernel 6.2.0 [1]
(so I suspect a newer Ceph version?).

The backup server has multiple (12) CephFS mount points. One of them,
the busiest, regularly causes this error on the cluster:

HEALTH_WARN 1 clients failing to respond to capability release
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
     mds.mds-server(mds.0): Client backupserver:cephfs-backupserver failing to respond to capability release client_id: 521306112

And occasionally, which may be unrelated, but occurs at the same time:

[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
     mds.mds-server(mds.0): 1 slow requests are blocked > 30 secs

The second one clears itself, but the first sticks until I can unmount
the filesystem on the client after the backup completes.

It appears that whilst it's in this stuck state there may be one or more
directory trees that are inaccessible to all clients. The backup server
is walking the whole tree but never gets stuck itself, so either the
inaccessible directory entry is caused after it has gone past, or it's
not affected. Maybe the backup server is holding a directory when it
shouldn't?

It may be that an upgrade to Quincy resolves this, since it's more
likely to be inline with the kernel client version wise, but I don't
want to knee-jerk upgrade just to try and fix this problem.

Thanks for any advice.

Tim.

[1] The reason for the newer kernel is that the backup performance from
CephFS was terrible with older kernels. This newer kernel does at least
resolve that issue.
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx


--
CAUTION: This email originated from outside of the LMB.
Do not click links or open attachments unless you recognize the sender and know the content is safe.
.-ceph-users-bounces@xxxxxxx-.

--
Ivan Clayson
-----------------
Scientific Computing Officer
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux