Re: Clients failing to respond to capability release

Tim Bishop <tim-lists@xxxxxxxxxxx> · Thu, 12 Oct 2023 11:33:07 +0100

Hi Ivan,

I don't think we're necessarily seeing the same issue. Mine didn't seem
to be related to OSDs, and in fact I could unblock it by killing the
backup job on our backup server and unmounting the filesystem. This
would then release all other stuck ops on the MDS.

I've been waiting to follow-up, so as not to tempt fate, but I upgraded
to Quincy at the end of last week and so far I haven't had a single
issue. Whether this is due to a bugfix in the MDS code, or running MDS
and client versions that are nearer to each other (the client was much
newer), I don't know. But for now, I'm relieved.

I did find that enabling snapshots on CephFS exacerbated the problem, so
for now they're disabled. Once I'm happy things are working I'll give
them another go.

Hopefully your detailed response will be useful to others facing similar
issues.

Tim.

On Thu, Oct 12, 2023 at 09:22:03AM +0100, Ivan Clayson wrote:
> Hi Tim,
> 
> We've been seeing something that may be similar to you with concurrent
> MDS_CLIENT_LATE_RELEASE and MDS_SLOW_REQUEST warning messages as well as
> frequently MDS_CLIENT_RECALL and MDS_SLOW_METADATA_IO warnings from the same
> MDS referring to the same client. We are using 1 MDS for our
> non-containerised filesystem on a 5.7 PiB sized Alma8.8 cluster with 29
> nodes and 348 spinning OSDs for bulk data and 4 OSDs for the metadata on
> NVMe SSDs with 174 million files (we also have additional NVMe drive
> partitions as DB and WAL devices for each OSD). All of our clients are using
> the kernel mount where we also have a SMB gateway which kernel mounts the
> filesystem and shares it to Windows and Mac machines.
> 
> This problem seems to have various symptoms including but not limited to:
> (i) a particular file or directory hanging on openfs, read, and or statx
> system calls for all clients mounting the filesystem, (ii) all our connected
> clients hanging on the aforementioned system calls when performing any
> metadata or bulk data I/O on the filesystem, (iii) 1 client being unable to
> stat or cd into the filesystem whilst all other clients are completely
> unaffected, and (iv) possibly a near complete loss of metadata due to the
> MDS journal swelling beyond the size of the MDS pool where we eventually had
> some success following the recovery docs
> (https://docs.ceph.com/en/quincy/cephfs/disaster-recovery-experts/) but
> opted for a backup restore instead (we were able to restore the bulk data to
> a read-only state but not all the metadata where we would be more than happy
> to write a "worked example" of this if the community would find it useful).
> We use to solve this by forcefully re-mounting the filesystem on the client
> but as we run multi-user systems and have 100s of clients, this was not a
> sustainable option for us.
> 
> We notice generally 3 classes of error:
> 
>  * The oldest slow or blocked request in the ceph-mds log is waiting to
>    acquire locks for an inode (visible via `ceph tell mds.N
>    dump_blocked_ops`). By tracing the oldest FS client which has the
>    capability for this file and seeing which OSD the client is waiting
>    for with `watch 'cat /sys/kernel/debug/ceph/*/osdc'` or by looking
>    at the output of `ceph tell mds.N objecter requests`, we can
>    determine which exact OSD is being slow (always a spinning disk).
>    These issues are due to hardware issues or the OSD crashing (where a
>    restart seems to remedy the problem) and thus counterintuitively (at
>    least for me) suggests that our error messages for the MDS can refer
>    to OSD hard-drive problems. These HDDs are in a completely separate
>    pool to where I would have thought the metadata I/O would be being
>    performed since our metadata pool is exclusively SSDs.
>  * The oldest, or near oldest, blocked MDS op being an "internal op
>    fragmentdir" operation and many client_requests at the flag_point
>    "failed to authpin, dir is being fragmented". This seems to be an
>    MDS bug as the documentation suggests that this should only happen
>    when in a multi-MDS system but we have been using only 1 MDS. A
>    simple restart of the MDS solves this problem.
>  * Client has a crash and either has been blocklisted or should be.
>    Blocklisted clients may not immediately know they've been evicted
>    and thus may be hanging indefinitely when stating the filesystem.
>    This can be detected by searching through the client list via `ceph
>    tell mds.N client ls` and seeing whether there is an ID associated
>    with the client. However whilst a manual re-mount will allow the
>    former client to establish new session, we prefer to use the
>    "recover_session=clean" mount option to do this automatically and
>    now we have no incidents of this.
> 
> Yesterday for example, we had an incident of multiple MDS warning messages
> of: MDS_SLOW_REQUEST,  MDS_TRIM, MDS_CLIENT_RECALL, and
> MDS_CLIENT_LATE_RELEASE. This was caused by a non-responsive hard-drive
> leading to a build up of the MDS cache and trims being unable to be
> completed where we managed to narrow down the hard-drive for the inode which
> the blocked client was waiting for a rdlock on and restarted the OSD for
> that the drive. Notably this hard-drive had no errors with smartctl or
> elsewhere and only had the following slow ops message on OSD systemctl
> status/ log:
> 
>     osd.281 123892 get_health_metrics reporting 4 slow ops, oldest is
> osd_op(client.49576654.0:9101 3.d9bs0 3.ee6f2d9b (undecoded)
> ondisk+read+known_if_redirected e123889)
> 
> Whilst restarting the hard-drive that the client with the oldest blocked op
> was waiting for did "clear" this /sys/kernel/debug/ceph/*/osdc queue, the
> oldest blocked MDS op then became an "internal op fragmentdir:mds.0:1" one
> where restarting the active MDS cleared this. Alas, this resulted in another
> blocked getattr op at the flag point "failed to authpin, dir is being
> fragmented" which was similarly tackled by restarting the MDS that just took
> over. This finally resulted in only two clients failing to respond to caps
> releases on inodes they were holding (despite rebooting at the time) where
> performing a "ceph tell mds.N session kill CLIENT_ID" removed them from the
> session map and allow the MDS' cache to become manageable again, thereby
> clearing all of these warning messages.
> 
> We've had this problem since the beginning of this year and upgrading from
> octopus to quincy has unfortunately not solved our problem. We've only
> really been able to solve this problem by undergoing an aggressive campaign
> of replacing hard-drives which were reaching the end of their lives. This
> has substantially reduced the amount of problems we've had in relation to
> this.
> 
> We would be very interested to hear about the rest of the community's
> experience in relation to this and I would recommend looking at your
> underlying OSDs Tim to see whether there are any timeout or uncorrectable
> errors. We would also be very eager to hear if these approaches are
> sub-optimal and whether anyone else has any insight into our problems. Sorry
> as well for resurrecting an old thread but we thought our experiences may be
> helpfully for others!
> 
> Kindest regards,
> 
> Ivan Clayson
> 
> On 19/09/2023 12:35, Tim Bishop wrote:
> > Hi,
> > 
> > I've seen this issue mentioned in the past, but with older releases. So
> > I'm wondering if anybody has any pointers.
> > 
> > The Ceph cluster is running Pacific 16.2.13 on Ubuntu 20.04. Almost all
> > clients are working fine, with the exception of our backup server. This
> > is using the kernel CephFS client on Ubuntu 22.04 with kernel 6.2.0 [1]
> > (so I suspect a newer Ceph version?).
> > 
> > The backup server has multiple (12) CephFS mount points. One of them,
> > the busiest, regularly causes this error on the cluster:
> > 
> > HEALTH_WARN 1 clients failing to respond to capability release
> > [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
> >      mds.mds-server(mds.0): Client backupserver:cephfs-backupserver failing to respond to capability release client_id: 521306112
> > 
> > And occasionally, which may be unrelated, but occurs at the same time:
> > 
> > [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
> >      mds.mds-server(mds.0): 1 slow requests are blocked > 30 secs
> > 
> > The second one clears itself, but the first sticks until I can unmount
> > the filesystem on the client after the backup completes.
> > 
> > It appears that whilst it's in this stuck state there may be one or more
> > directory trees that are inaccessible to all clients. The backup server
> > is walking the whole tree but never gets stuck itself, so either the
> > inaccessible directory entry is caused after it has gone past, or it's
> > not affected. Maybe the backup server is holding a directory when it
> > shouldn't?
> > 
> > It may be that an upgrade to Quincy resolves this, since it's more
> > likely to be inline with the kernel client version wise, but I don't
> > want to knee-jerk upgrade just to try and fix this problem.
> > 
> > Thanks for any advice.
> > 
> > Tim.
> > 
> > [1] The reason for the newer kernel is that the backup performance from
> > CephFS was terrible with older kernels. This newer kernel does at least
> > resolve that issue.
> > _______________________________________________
> > ceph-users mailing list --ceph-users@xxxxxxx
> > To unsubscribe send an email toceph-users-leave@xxxxxxx
> > 
> > 
> > --
> > CAUTION: This email originated from outside of the LMB.
> > Do not click links or open attachments unless you recognize the sender and know the content is safe.
> > .-ceph-users-bounces@xxxxxxx-.
> > 
> -- 
> Ivan Clayson
> -----------------
> Scientific Computing Officer
> MRC Laboratory of Molecular Biology
> Francis Crick Ave, Cambridge
> CB2 0QH

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx