cephfs automatic data pool cleanup

"Jens-U. Mozdzen" <jmozdzen@xxxxxx> · Wed, 13 Dec 2017 14:11:53 +0000

Hi *,

during the last weeks, we noticed some strange behavior of our CephFS  
data pool (not metadata). As things have worked out over time, I'm  
just asking here so that I can better understand what to look out for  
in the future.

This is on a three-node Ceph Luminous (12.2.1) cluster with one active  
MDS and one standby MDS. We have a range of machines mounting that  
single CephFS via kernel mounts, using different versions of Linux  
kernels (all at least 4.4, with vendor backports).

We observed an ever-increasing number of objects and space allocation  
on the (HDD-based, replicated) CephFS data pool, although the actual  
file system usage didn't grow over time and actually decreased  
significantly during that time period. The pool allocation went above  
all warn and crit levels, forcing us to add new OSDs (our first three  
Bluestore OSDs - all others are file-based) to relief pressure, if  
only for some time.

Part of the growth seems to be related to a large nightly compile job,  
that was using CephFS via an NFS server (kernel-based) exposing the  
kernel-mounted CephFS to many nodes: Once we stopped that job, pool  
allocation growth significantly slowed (but didn't stop).

Further diagnosis hinted that the data pool had many orphan objects,  
that is objects for inodes we could not locate in the live CephFS.

All the time, we did not notice any significant growth of the metadata  
pool (SSD-based) nor obvious errors in the Ceph logs (Ceph, MDS,  
OSDs). Except for the fill levels, the cluster was healthy. Restarting  
MDSs did not help.

Then we had one of the nodes crash for a lack of memory (MDS was > 12  
GB, plus the new Bluestore OSD and probably the 12.2.1 BlueStore  
memory leak).

We brought the node back online and at first had MDS report an  
inconsistent file system, though no other errors were reported. Once  
we restarted the other MDS (by then active MDS on another node), that  
problem went away, too, and we were back online. We did not restart  
clients, neither CephFS mounts nor rbd clients.

The following day we noticed an ongoing significant decrease in the  
number of objects in the CephFS data pool. As we couldn't spot any  
actual problems with the content of the CephFS (which was rather  
stable at the time), we sat back and watched - after some hours, the  
pool stabilized in size and was at a total size a bit closer to the  
actual CephFS content than before the mass deletion (FS size around  
630 GB per "df" output, current data pool size about 1100 GB, peak  
size was around 1.3 TB before the mass deletion).

What may it have been that we were watching - some form of garbage  
collection that was triggered by the node outage? Is this something we  
could have triggered manually before, to avoid the free space problems  
we faced? Or is this something unexpected, that should have happened  
auto-magically and much more often, but that for some reason didn't  
occur in our environment?

Thank you for any ideas and/or pointers you may share.

Regards,
J

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com