Re: MDS_DAMAGE in 17.2.7 / Cannot delete affected files

Sebastian Knust <sknust@xxxxxxxxxxxxxxxxxxxxxxx> · Thu, 30 Nov 2023 15:38:29 +0100

Hi Patrick,

On 30.11.23 03:58, Patrick Donnelly wrote:

I've not yet fully reviewed the logs but it seems there is a bug in
the detection logic which causes a spurious abort. This does not
appear to be actually new damage.

We are accessing the metadata (read-only) daily. The issue only popped 
up after updating to 17.2.7. Of course, this does not mean that there 
was no damage there before, only that it was not detected.

Are you using postgres?
Not on top of CephFS, no. We do use postgres on some RBD volumes.

If you can share details about your snapshot
workflow and general workloads that would be helpful (privately if
desired).

Our CephFS root looks like this:
/archive
/homes
/no-snapshot
/other-snapshot
/scratch

We are running snapshots on /homes and /other-snapshot with the same 
schedule. We mount the filesystem with a Kernel client on one of the 
Ceph Hosts (not running the MDS) and mkdir / rmdir as needed.
- daily between 06:00 and 19:45 UTC (inclusive): Create a snapshot every 
15 minutes, delete it unless it is hourly (xx:00) one hour later
- daily on the full hour: Create a snapshot, delete the 24 hours old 
snapshot unless it is midnight
- daily at midnight delete the snapshot from 14 days ago unless it is Sunday
- every Sunday at midnight delete the snapshot from 8 weeks ago

Workload is two main Samba servers (one only sharing a subdirectory 
which is generally not accessed on the other). Client access to those 
servers is limited to 1GBit/s each. Until Tuesday, we also had a 
mailserver with Dovecot running on top of CephFS. This was migrated on 
Tuesday to an RBD volume as we had some issues with hanging access to 
some files / directories (interestingly only in the main tree, in 
snapshots access was without issue). Additionally, we have a Nextcloud 
instance with ~200 active users storing data in CephFS as well as some 
other Kernel clients with little / sporadic traffic, some running Samba, 
some NFS, some interactive SSH / x2go servers with direct user access, 
some specialised web applications (notably OMERO).

We run daily incremental backups of most of the CephFS content with 
Bareos running on a dedicated server which has the whole CephFS tree 
mounted read-only. For most data a full backup is performed every two 
months, for some data only every six months. The affected area is 
contained in this "every six months" full backup portion of the file 
system tree.

Two weeks ago we deleted a folder structure with 6 TB, average file size 
in the range of 1GB. The structure was unter /other-snapshot as well. 
This led to severe load on the MDS, especially starting midnight. In 
conjunction with Ubuntu kernel mount, we also had issues with 
non-released capabilities preventing read-access to the /other-snapshot 
part.

To combat these lingering problems, we deleted all snapshots in 
/other-snapshot which led to a half a dozen PGs stuck in snaptrim state 
(and a few hundred in snaptrim_wait). Updating from 17.2.6 to 17.2.7 
solved that issue quickly, the affected PGs became unstuck and the whole 
cluster was in active+clean a few hours later.

For now, I'll hold off on running first-damage.py to try to remove the
affected files / inodes. Ultimately however, this seems to be the most
sensible solution to me, at least with regards to cluster downtime.

Please give me another day to review then feel free to use
first-damage.py to cleanup. If you see new damage please upload the
logs.

We are in no hurry and will probably run first-damage.py sometime next 
week. I will report new damage if it comes in.

Cheers
Sebastian

--
Dr. Sebastian Knust      | Bielefeld University
IT Administrator         | Faculty of Physics
Office: D2-110           | Universitätsstr. 25
Phone: +49 521 106 5234  | 33615 Bielefeld
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx