OSD Crash, high RAM usage

Cloud Guy <cloudguy23@xxxxxxxxx> · Sun, 23 Aug 2020 09:15:31 -0400

Hello

TL;DR

We have a Nautilus cluster which has been operating without issue for quite
some time.   Recently one OSD experienced a relatively slow and painful
death.  The OSD was purged (via dashboard), replaced and added as a new OSD
(same ID).   Upon rebuild, we notice the node hosting the replaced OSD to
experience extraordinary memory consumption (upwards of 95%).   Remaining
nodes healthy, operating normally.   No adverse performance observations
within the cluster.   Looking for community assistance in helping
troubleshoot this anomaly and resolving.

Detail

- Ceph 14.2.11

- Originally Deployed using ceph-ansible 4.0 stable;

- OSD nodes run CentOS 7, have 10 drives, 256G RAM, no swap;

- Problematic OSD ramps memory consumption to 95-99% available RAM;

- Problematic OSD node appears to ramp RAM utilization until OSDs crash -
and restart – resembling a contiguous saw-tooth pattern / memory leak??;

- Other OSD nodes average around 30-40G steady state consumption.

Cluster Logs:

2020-08-23 08:36:59.458072

[INF]

Cluster is now healthy

2020-08-23 08:36:59.458032

[INF]

Health check cleared: PG_DEGRADED (was: Degraded data redundancy:
2274696/100424847 objects degraded (2.265%), 85 pgs degraded)

2020-08-23 08:36:56.352104

[INF]

Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 13
pgs inactive, 4 pgs down, 2 pgs peering, 15 pgs incomplete)

2020-08-23 08:36:56.352062

[WRN]

Health check update: Degraded data redundancy: 2320455/100424847 objects
degraded (2.311%), 87 pgs degraded (PG_DEGRADED)

2020-08-23 08:36:55.401933

[INF]

osd.16 [v2:10.1.10.7:6831/204520,v1:10.1.10.7:6833/204520] boot

2020-08-23 08:36:54.744590

[INF]

Health check cleared: OSD_DOWN (was: 1 osds down)

2020-08-23 08:36:53.688833

[WRN]

Health check update: 1 osds down (OSD_DOWN)

2020-08-23 08:36:49.714790

[WRN]

Health check update: Degraded data redundancy: 6617521/100424847 objects
degraded (6.590%), 298 pgs degraded (PG_DEGRADED)

2020-08-23 08:36:49.714769

[WRN]

Health check update: Reduced data availability: 17 pgs inactive, 5 pgs
down, 2 pgs peering, 41 pgs incomplete (PG_AVAILABILITY)

2020-08-23 08:36:49.701966

[INF]

osd.32 [v2:10.1.10.7:6812/204477,v1:10.1.10.7:6813/204477] boot

2020-08-23 08:36:49.701868

[INF]

osd.28 [v2:10.1.10.7:6820/204478,v1:10.1.10.7:6821/204478] boot

2020-08-23 08:36:48.613586

[INF]

osd.24 [v2:10.1.10.7:6836/204523,v1:10.1.10.7:6837/204523] boot

2020-08-23 08:36:48.540225

[WRN]

Health check update: 3 osds down (OSD_DOWN)

2020-08-23 08:36:43.755645

[WRN]

Health check failed: Degraded data redundancy: 5700895/100424847 objects
degraded (5.677%), 266 pgs degraded (PG_DEGRADED)

2020-08-23 08:36:43.755611

[WRN]

Health check failed: Reduced data availability: 12 pgs inactive, 5 pgs
down, 37 pgs incomplete (PG_AVAILABILITY)

2020-08-23 08:36:40.049275

[INF]

osd.16 failed (root=default,host=otr2817e02stor23) (connection refused
reported by osd.39)

2020-08-23 08:36:40.047058

[INF]

osd.24 failed (root=default,host=otr2817e02stor23) (connection refused
reported by osd.6)

2020-08-23 08:36:39.452024

[WRN]

Health check failed: 2 osds down (OSD_DOWN)

2020-08-23 08:36:39.399920

[INF]

osd.32 failed (root=default,host=otr2817e02stor23) (connection refused
reported by osd.33)

2020-08-23 08:36:39.399255

[INF]

osd.28 failed (root=default,host=otr2817e02stor23) (connection refused
reported by osd.18)

2020-08-23 08:04:08.089438

[INF]

Cluster is now healthy

2020-08-23 08:04:08.089399

[INF]

Health check cleared: PG_DEGRADED (was: Degraded data redundancy:
1549004/100424847 objects degraded (1.542%), 62 pgs degraded)

2020-08-23 08:04:05.154839

[WRN]

Health check update: Degraded data redundancy: 1873878/100424847 objects
degraded (1.866%), 68 pgs degraded (PG_DEGRADED)

2020-08-23 08:04:02.728402

[INF]

osd.24 [v2:10.1.10.7:6820/202960,v1:10.1.10.7:6821/202960] boot

2020-08-23 08:04:02.572426

[INF]

Health check cleared: OSD_DOWN (was: 1 osds down)

2020-08-23 08:03:53.138693

[WRN]

Health check update: Degraded data redundancy: 2507272/100424847 objects
degraded (2.497%), 88 pgs degraded (PG_DEGRADED)

2020-08-23 08:03:48.049569

[INF]

Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 7
pgs inactive, 59 pgs peering)

2020-08-23 08:03:45.170204

[WRN]

Health check failed: Degraded data redundancy: 281695/100424847 objects
degraded (0.281%), 15 pgs degraded (PG_DEGRADED)

2020-08-23 08:03:45.170175

[WRN]

Health check failed: Reduced data availability: 7 pgs inactive, 59 pgs
peering (PG_AVAILABILITY)

2020-08-23 08:03:41.355393

[WRN]

Health check failed: 1 osds down (OSD_DOWN)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx