We're running a rook-ceph cluster that has gotten stuck in "1 MDSs behind on trimming". * 1 filesystem, three active MDS servers each with standby * Quite a few files (20M objects), daily snapshots. This might be a problem? * Ceph pacific 16.2.4 * `ceph health detail` doesn't provide much help (see below) * num_segments is very slowly increasing over time * Restarting all of the MDSs returns to the same point. * moderate CPU usage for each MDS server (~30% for the stuck one, ~80% of a core for the others) * logs for the stuck MDS looks clean, it hits rejoin_joint_start then standard 'updating MDS map to version XXX" messages * `ceph daemon mds.x ops` shows no active ops on each of the MDS servers * `mds_log_max_segments` is set to 128, setting to a higher number causes the warning to go away, but the filesystem remains degraded, and setting it back to 128 shows num_segments has not changed. * I've tried playing around with other MDS settings based on various posts on this list and elsewhere, to no avail * `cephfs-journal-tool journal inspect` for each rank says journal integrity is fine. Something similar happened last week and (probably by accident by removing/adding nodes?) I got the MDSs to start recovering and the filesystem went back to healthy. I'm at a bit of a loss for what else to try. Thanks! Zack `ceph health detail` HEALTH_WARN mons are allowing insecure global_id reclaim; 1 filesystem is degraded; 1 MDSs behind on trimming; mon x is low on available space [WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure global_id reclaim mon.x has auth_allow_insecure_global_id_reclaim set to true mon.ad has auth_allow_insecure_global_id_reclaim set to true mon.af has auth_allow_insecure_global_id_reclaim set to true [WRN] FS_DEGRADED: 1 filesystem is degraded fs myfs is degraded [WRN] MDS_TRIM: 1 MDSs behind on trimming mds.myfs-d(mds.2): Behind on trimming (340/128) max_segments: 128, num_segments: 340 [WRN] MON_DISK_LOW: mon x is low on available space mon.x has 22% avail `ceph config get mds` WHO MASK LEVEL OPTION VALUE RO global basic log_file * global basic log_to_file false mds basic mds_cache_memory_limit 17179869184 mds advanced mds_cache_trim_decay_rate 1.000000 mds advanced mds_cache_trim_threshold 1048576 mds advanced mds_log_max_segments 128 mds advanced mds_recall_max_caps 5000 mds advanced mds_recall_max_decay_rate 2.500000 global advanced mon_allow_pool_delete true global advanced mon_allow_pool_size_one true global advanced mon_cluster_log_file global advanced mon_pg_warn_min_per_osd 0 global advanced osd_pool_default_pg_autoscale_mode on global advanced osd_scrub_auto_repair true global advanced rbd_default_features 3 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx