On Wed, Dec 13, 2017 at 10:11 PM, Jens-U. Mozdzen <jmozdzen@xxxxxx> wrote: > Hi *, > > during the last weeks, we noticed some strange behavior of our CephFS data > pool (not metadata). As things have worked out over time, I'm just asking > here so that I can better understand what to look out for in the future. > > This is on a three-node Ceph Luminous (12.2.1) cluster with one active MDS > and one standby MDS. We have a range of machines mounting that single CephFS > via kernel mounts, using different versions of Linux kernels (all at least > 4.4, with vendor backports). > > We observed an ever-increasing number of objects and space allocation on the > (HDD-based, replicated) CephFS data pool, although the actual file system > usage didn't grow over time and actually decreased significantly during that > time period. The pool allocation went above all warn and crit levels, > forcing us to add new OSDs (our first three Bluestore OSDs - all others are > file-based) to relief pressure, if only for some time. > > Part of the growth seems to be related to a large nightly compile job, that > was using CephFS via an NFS server (kernel-based) exposing the > kernel-mounted CephFS to many nodes: Once we stopped that job, pool > allocation growth significantly slowed (but didn't stop). > > Further diagnosis hinted that the data pool had many orphan objects, that is > objects for inodes we could not locate in the live CephFS. > It's likely some clients had caps on unlinked inodes, which prevent MDS from purging objects. When a file gets deleted, mds notifies all clients, clients are supposed to drop corresponding caps if possible. You may hit a bug in this area, some clients failed to drop cap for unlinked inodes. > All the time, we did not notice any significant growth of the metadata pool > (SSD-based) nor obvious errors in the Ceph logs (Ceph, MDS, OSDs). Except > for the fill levels, the cluster was healthy. Restarting MDSs did not help. > > Then we had one of the nodes crash for a lack of memory (MDS was > 12 GB, > plus the new Bluestore OSD and probably the 12.2.1 BlueStore memory leak). > > We brought the node back online and at first had MDS report an inconsistent > file system, though no other errors were reported. Once we restarted the > other MDS (by then active MDS on another node), that problem went away, too, > and we were back online. We did not restart clients, neither CephFS mounts > nor rbd clients. > > The following day we noticed an ongoing significant decrease in the number > of objects in the CephFS data pool. As we couldn't spot any actual problems > with the content of the CephFS (which was rather stable at the time), we sat > back and watched - after some hours, the pool stabilized in size and was at > a total size a bit closer to the actual CephFS content than before the mass > deletion (FS size around 630 GB per "df" output, current data pool size > about 1100 GB, peak size was around 1.3 TB before the mass deletion). > There is a reconnect stage during MDS recovers. To reduce reconnect message size, clients trim unused inodes from their cache aggressively. In your case, most unlinked inodes also got trimmed . So mds could purge corresponding objects after it recovered Regards Yan, Zheng > What may it have been that we were watching - some form of garbage > collection that was triggered by the node outage? Is this something we could > have triggered manually before, to avoid the free space problems we faced? > Or is this something unexpected, that should have happened auto-magically > and much more often, but that for some reason didn't occur in our > environment? > > Thank you for any ideas and/or pointers you may share. > > Regards, > J > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com