Hello TL;DR We have a Nautilus cluster which has been operating without issue for quite some time. Recently one OSD experienced a relatively slow and painful death. The OSD was purged (via dashboard), replaced and added as a new OSD (same ID). Upon rebuild, we notice the node hosting the replaced OSD to experience extraordinary memory consumption (upwards of 95%). Remaining nodes healthy, operating normally. No adverse performance observations within the cluster. Looking for community assistance in helping troubleshoot this anomaly and resolving. Detail - Ceph 14.2.11 - Originally Deployed using ceph-ansible 4.0 stable; - OSD nodes run CentOS 7, have 10 drives, 256G RAM, no swap; - Problematic OSD ramps memory consumption to 95-99% available RAM; - Problematic OSD node appears to ramp RAM utilization until OSDs crash - and restart – resembling a contiguous saw-tooth pattern / memory leak??; - Other OSD nodes average around 30-40G steady state consumption. Cluster Logs: 2020-08-23 08:36:59.458072 [INF] Cluster is now healthy 2020-08-23 08:36:59.458032 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2274696/100424847 objects degraded (2.265%), 85 pgs degraded) 2020-08-23 08:36:56.352104 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 13 pgs inactive, 4 pgs down, 2 pgs peering, 15 pgs incomplete) 2020-08-23 08:36:56.352062 [WRN] Health check update: Degraded data redundancy: 2320455/100424847 objects degraded (2.311%), 87 pgs degraded (PG_DEGRADED) 2020-08-23 08:36:55.401933 [INF] osd.16 [v2:10.1.10.7:6831/204520,v1:10.1.10.7:6833/204520] boot 2020-08-23 08:36:54.744590 [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2020-08-23 08:36:53.688833 [WRN] Health check update: 1 osds down (OSD_DOWN) 2020-08-23 08:36:49.714790 [WRN] Health check update: Degraded data redundancy: 6617521/100424847 objects degraded (6.590%), 298 pgs degraded (PG_DEGRADED) 2020-08-23 08:36:49.714769 [WRN] Health check update: Reduced data availability: 17 pgs inactive, 5 pgs down, 2 pgs peering, 41 pgs incomplete (PG_AVAILABILITY) 2020-08-23 08:36:49.701966 [INF] osd.32 [v2:10.1.10.7:6812/204477,v1:10.1.10.7:6813/204477] boot 2020-08-23 08:36:49.701868 [INF] osd.28 [v2:10.1.10.7:6820/204478,v1:10.1.10.7:6821/204478] boot 2020-08-23 08:36:48.613586 [INF] osd.24 [v2:10.1.10.7:6836/204523,v1:10.1.10.7:6837/204523] boot 2020-08-23 08:36:48.540225 [WRN] Health check update: 3 osds down (OSD_DOWN) 2020-08-23 08:36:43.755645 [WRN] Health check failed: Degraded data redundancy: 5700895/100424847 objects degraded (5.677%), 266 pgs degraded (PG_DEGRADED) 2020-08-23 08:36:43.755611 [WRN] Health check failed: Reduced data availability: 12 pgs inactive, 5 pgs down, 37 pgs incomplete (PG_AVAILABILITY) 2020-08-23 08:36:40.049275 [INF] osd.16 failed (root=default,host=otr2817e02stor23) (connection refused reported by osd.39) 2020-08-23 08:36:40.047058 [INF] osd.24 failed (root=default,host=otr2817e02stor23) (connection refused reported by osd.6) 2020-08-23 08:36:39.452024 [WRN] Health check failed: 2 osds down (OSD_DOWN) 2020-08-23 08:36:39.399920 [INF] osd.32 failed (root=default,host=otr2817e02stor23) (connection refused reported by osd.33) 2020-08-23 08:36:39.399255 [INF] osd.28 failed (root=default,host=otr2817e02stor23) (connection refused reported by osd.18) 2020-08-23 08:04:08.089438 [INF] Cluster is now healthy 2020-08-23 08:04:08.089399 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 1549004/100424847 objects degraded (1.542%), 62 pgs degraded) 2020-08-23 08:04:05.154839 [WRN] Health check update: Degraded data redundancy: 1873878/100424847 objects degraded (1.866%), 68 pgs degraded (PG_DEGRADED) 2020-08-23 08:04:02.728402 [INF] osd.24 [v2:10.1.10.7:6820/202960,v1:10.1.10.7:6821/202960] boot 2020-08-23 08:04:02.572426 [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2020-08-23 08:03:53.138693 [WRN] Health check update: Degraded data redundancy: 2507272/100424847 objects degraded (2.497%), 88 pgs degraded (PG_DEGRADED) 2020-08-23 08:03:48.049569 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 7 pgs inactive, 59 pgs peering) 2020-08-23 08:03:45.170204 [WRN] Health check failed: Degraded data redundancy: 281695/100424847 objects degraded (0.281%), 15 pgs degraded (PG_DEGRADED) 2020-08-23 08:03:45.170175 [WRN] Health check failed: Reduced data availability: 7 pgs inactive, 59 pgs peering (PG_AVAILABILITY) 2020-08-23 08:03:41.355393 [WRN] Health check failed: 1 osds down (OSD_DOWN) _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx