Re: cephfs automatic data pool cleanup

"Yan, Zheng" <ukernel@xxxxxxxxx> · Thu, 14 Dec 2017 00:35:19 +0800

On Wed, Dec 13, 2017 at 10:11 PM, Jens-U. Mozdzen <jmozdzen@xxxxxx> wrote:
> Hi *,
>
> during the last weeks, we noticed some strange behavior of our CephFS data
> pool (not metadata). As things have worked out over time, I'm just asking
> here so that I can better understand what to look out for in the future.
>
> This is on a three-node Ceph Luminous (12.2.1) cluster with one active MDS
> and one standby MDS. We have a range of machines mounting that single CephFS
> via kernel mounts, using different versions of Linux kernels (all at least
> 4.4, with vendor backports).
>
> We observed an ever-increasing number of objects and space allocation on the
> (HDD-based, replicated) CephFS data pool, although the actual file system
> usage didn't grow over time and actually decreased significantly during that
> time period. The pool allocation went above all warn and crit levels,
> forcing us to add new OSDs (our first three Bluestore OSDs - all others are
> file-based) to relief pressure, if only for some time.
>
> Part of the growth seems to be related to a large nightly compile job, that
> was using CephFS via an NFS server (kernel-based) exposing the
> kernel-mounted CephFS to many nodes: Once we stopped that job, pool
> allocation growth significantly slowed (but didn't stop).
>
> Further diagnosis hinted that the data pool had many orphan objects, that is
> objects for inodes we could not locate in the live CephFS.
>

It's likely some clients had caps on unlinked inodes, which prevent
MDS from purging objects. When a file gets deleted, mds notifies all
clients, clients are supposed to drop corresponding caps if possible.
You may hit a bug in this area, some clients failed to drop cap for
unlinked inodes.

> All the time, we did not notice any significant growth of the metadata pool
> (SSD-based) nor obvious errors in the Ceph logs (Ceph, MDS, OSDs). Except
> for the fill levels, the cluster was healthy. Restarting MDSs did not help.
>
> Then we had one of the nodes crash for a lack of memory (MDS was > 12 GB,
> plus the new Bluestore OSD and probably the 12.2.1 BlueStore memory leak).
>
> We brought the node back online and at first had MDS report an inconsistent
> file system, though no other errors were reported. Once we restarted the
> other MDS (by then active MDS on another node), that problem went away, too,
> and we were back online. We did not restart clients, neither CephFS mounts
> nor rbd clients.
>
> The following day we noticed an ongoing significant decrease in the number
> of objects in the CephFS data pool. As we couldn't spot any actual problems
> with the content of the CephFS (which was rather stable at the time), we sat
> back and watched - after some hours, the pool stabilized in size and was at
> a total size a bit closer to the actual CephFS content than before the mass
> deletion (FS size around 630 GB per "df" output, current data pool size
> about 1100 GB, peak size was around 1.3 TB before the mass deletion).
>

There is a reconnect stage during MDS recovers. To reduce reconnect
message size, clients trim unused inodes from their cache
aggressively. In your case,  most unlinked inodes also got trimmed .
So mds could purge corresponding objects after it recovered

Regards
Yan, Zheng

> What may it have been that we were watching - some form of garbage
> collection that was triggered by the node outage? Is this something we could
> have triggered manually before, to avoid the free space problems we faced?
> Or is this something unexpected, that should have happened auto-magically
> and much more often, but that for some reason didn't occur in our
> environment?
>
> Thank you for any ideas and/or pointers you may share.
>
> Regards,
> J
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com