Re: [PATCH] ceph: force updating the msg pointer in non-split case

Frank Schilder <frans@xxxxxx> · Wed, 17 May 2023 15:39:50 +0000

Hi all,

joining in here as the one who is hit pretty badly and also not being able to upgrade ceph any time soon to a version receiving patches.

For these two reasons alone I strongly favour fixing both sides.

Extra reading, feel free to skip.

Additional reasons for fixing both sides are (1) to have more error tolerant code - if one side breaks/regresses the other side still knows what to do and can report back while moving on without a fatal crash and (2) to help users of old clusters who are affected without noticing yet. Every now and then one should afford to be nice.

I personally think that (1) is generally good practice, explicitly handling seemingly unexpected cases increases overall robustness (its a bit like raiding up code to catch code rot) and will highlight otherwise unnoticed issues early in testing. It is not the first time our installation was hit by an unnecessarily missing catch-all clause that triggered an assert or follow-up crash for no real reason.

The main reason we actually discovered this is that under certain rare circumstances it makes a server with a kclient mount freeze. There is some kind of follow-up condition that is triggered only under heavy load and almost certainly only at a time when a snapshot is taken. Hence, it is very well possible that many if not all users have these invalid snaptrace message on their system, but nothing else happens so they don't report anything.

The hallmark in our case is a hanging client caps recall that eventually leads to a spontaneous restart of the affected MDS and then we end up with either a frozen server or a stale file handle at the ceph mount point. Others might not encounter these conditions simultaneously on their system as often as we do.

Apart from that, its not even sure that this is the core issue causing all the trouble on our system. Having the kclient fixed would allow us to verify that we don't have yet another problem that should be looked at before considering a ceph upgrade - extra reason no. 3.

I hope this was a useful point of view from someone suffering from the condition.

Best regards and thanks for your efforts addressing this!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14