Re: OSD Crash, high RAM usage

Edward kalk <ekalk@xxxxxxxxxx> · Sun, 23 Aug 2020 13:26:26 -0500

size of hard disks (OSDs)?
quantity of disks (OSDs) per server?
quantity of servers?
SSDs or spinners (OSDs)? 
quantity of pools?
are all pools on all disks?
quantity of PGs? PGPs? (per pool)
paste of ceph.conf variables?
was this a clean install, or upgrade? (previous version(s)?)

-Ed

> On Aug 23, 2020, at 8:17 AM, Cloud Guy <cloudguy23@xxxxxxxxx> wrote:
> 
> Hello
> 
> 
> 
> TL;DR
> 
> We have a Nautilus cluster which has been operating without issue for quite
> some time.   Recently one OSD experienced a relatively slow and painful
> death.  The OSD was purged (via dashboard), replaced and added as a new OSD
> (same ID).   Upon rebuild, we notice the node hosting the replaced OSD to
> experience extraordinary memory consumption (upwards of 95%).   Remaining
> nodes healthy, operating normally.   No adverse performance observations
> within the cluster.   Looking for community assistance in helping
> troubleshoot this anomaly and resolving.
> 
> 
> 
> Detail
> 
> - Ceph 14.2.11
> 
> - Originally Deployed using ceph-ansible 4.0 stable;
> 
> - OSD nodes run CentOS 7, have 10 drives, 256G RAM, no swap;
> 
> - Problematic OSD ramps memory consumption to 95-99% available RAM;
> 
> - Problematic OSD node appears to ramp RAM utilization until OSDs crash -
> and restart – resembling a contiguous saw-tooth pattern / memory leak??;
> 
> - Other OSD nodes average around 30-40G steady state consumption.
> 
> 
> 
> Cluster Logs:
> 
> 2020-08-23 08:36:59.458072
> 
> [INF]
> 
> Cluster is now healthy
> 
> 
> 
> 2020-08-23 08:36:59.458032
> 
> [INF]
> 
> Health check cleared: PG_DEGRADED (was: Degraded data redundancy:
> 2274696/100424847 objects degraded (2.265%), 85 pgs degraded)
> 
> 
> 
> 2020-08-23 08:36:56.352104
> 
> [INF]
> 
> Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 13
> pgs inactive, 4 pgs down, 2 pgs peering, 15 pgs incomplete)
> 
> 
> 
> 2020-08-23 08:36:56.352062
> 
> [WRN]
> 
> Health check update: Degraded data redundancy: 2320455/100424847 objects
> degraded (2.311%), 87 pgs degraded (PG_DEGRADED)
> 
> 
> 
> 2020-08-23 08:36:55.401933
> 
> [INF]
> 
> osd.16 [v2:10.1.10.7:6831/204520,v1:10.1.10.7:6833/204520] boot
> 
> 
> 
> 2020-08-23 08:36:54.744590
> 
> [INF]
> 
> Health check cleared: OSD_DOWN (was: 1 osds down)
> 
> 
> 
> 2020-08-23 08:36:53.688833
> 
> [WRN]
> 
> Health check update: 1 osds down (OSD_DOWN)
> 
> 
> 
> 2020-08-23 08:36:49.714790
> 
> [WRN]
> 
> Health check update: Degraded data redundancy: 6617521/100424847 objects
> degraded (6.590%), 298 pgs degraded (PG_DEGRADED)
> 
> 
> 
> 2020-08-23 08:36:49.714769
> 
> [WRN]
> 
> Health check update: Reduced data availability: 17 pgs inactive, 5 pgs
> down, 2 pgs peering, 41 pgs incomplete (PG_AVAILABILITY)
> 
> 
> 
> 2020-08-23 08:36:49.701966
> 
> [INF]
> 
> osd.32 [v2:10.1.10.7:6812/204477,v1:10.1.10.7:6813/204477] boot
> 
> 
> 
> 2020-08-23 08:36:49.701868
> 
> [INF]
> 
> osd.28 [v2:10.1.10.7:6820/204478,v1:10.1.10.7:6821/204478] boot
> 
> 
> 
> 2020-08-23 08:36:48.613586
> 
> [INF]
> 
> osd.24 [v2:10.1.10.7:6836/204523,v1:10.1.10.7:6837/204523] boot
> 
> 
> 
> 2020-08-23 08:36:48.540225
> 
> [WRN]
> 
> Health check update: 3 osds down (OSD_DOWN)
> 
> 
> 
> 2020-08-23 08:36:43.755645
> 
> [WRN]
> 
> Health check failed: Degraded data redundancy: 5700895/100424847 objects
> degraded (5.677%), 266 pgs degraded (PG_DEGRADED)
> 
> 
> 
> 2020-08-23 08:36:43.755611
> 
> [WRN]
> 
> Health check failed: Reduced data availability: 12 pgs inactive, 5 pgs
> down, 37 pgs incomplete (PG_AVAILABILITY)
> 
> 
> 
> 2020-08-23 08:36:40.049275
> 
> [INF]
> 
> osd.16 failed (root=default,host=otr2817e02stor23) (connection refused
> reported by osd.39)
> 
> 
> 
> 2020-08-23 08:36:40.047058
> 
> [INF]
> 
> osd.24 failed (root=default,host=otr2817e02stor23) (connection refused
> reported by osd.6)
> 
> 
> 
> 2020-08-23 08:36:39.452024
> 
> [WRN]
> 
> Health check failed: 2 osds down (OSD_DOWN)
> 
> 
> 
> 2020-08-23 08:36:39.399920
> 
> [INF]
> 
> osd.32 failed (root=default,host=otr2817e02stor23) (connection refused
> reported by osd.33)
> 
> 
> 
> 2020-08-23 08:36:39.399255
> 
> [INF]
> 
> osd.28 failed (root=default,host=otr2817e02stor23) (connection refused
> reported by osd.18)
> 
> 
> 
> 2020-08-23 08:04:08.089438
> 
> [INF]
> 
> Cluster is now healthy
> 
> 
> 
> 2020-08-23 08:04:08.089399
> 
> [INF]
> 
> Health check cleared: PG_DEGRADED (was: Degraded data redundancy:
> 1549004/100424847 objects degraded (1.542%), 62 pgs degraded)
> 
> 
> 
> 2020-08-23 08:04:05.154839
> 
> [WRN]
> 
> Health check update: Degraded data redundancy: 1873878/100424847 objects
> degraded (1.866%), 68 pgs degraded (PG_DEGRADED)
> 
> 
> 
> 2020-08-23 08:04:02.728402
> 
> [INF]
> 
> osd.24 [v2:10.1.10.7:6820/202960,v1:10.1.10.7:6821/202960] boot
> 
> 
> 
> 2020-08-23 08:04:02.572426
> 
> [INF]
> 
> Health check cleared: OSD_DOWN (was: 1 osds down)
> 
> 
> 
> 2020-08-23 08:03:53.138693
> 
> [WRN]
> 
> Health check update: Degraded data redundancy: 2507272/100424847 objects
> degraded (2.497%), 88 pgs degraded (PG_DEGRADED)
> 
> 
> 
> 2020-08-23 08:03:48.049569
> 
> [INF]
> 
> Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 7
> pgs inactive, 59 pgs peering)
> 
> 
> 
> 2020-08-23 08:03:45.170204
> 
> [WRN]
> 
> Health check failed: Degraded data redundancy: 281695/100424847 objects
> degraded (0.281%), 15 pgs degraded (PG_DEGRADED)
> 
> 
> 
> 2020-08-23 08:03:45.170175
> 
> [WRN]
> 
> Health check failed: Reduced data availability: 7 pgs inactive, 59 pgs
> peering (PG_AVAILABILITY)
> 
> 
> 
> 2020-08-23 08:03:41.355393
> 
> [WRN]
> 
> Health check failed: 1 osds down (OSD_DOWN)
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx