Re: Whole cluster flapping

Will Marley <Will.Marley@xxxxxxxxxxxx> · Wed, 8 Aug 2018 14:13:34 +0000

Hi again Frederic,

It may be worth looking at a recovery sleep.
osd recovery sleep

Description:

Time in seconds to sleep before next recovery or backfill op. Increasing this value will slow down recovery operation while client operations will be less impacted.

Type:

Float

Default:

0

osd recovery sleep hdd

Description:

Time in seconds to sleep before next recovery or backfill op for HDDs.

Type:

Float

Default:

0.1

osd recovery sleep ssd

Description:

Time in seconds to sleep before next recovery or backfill op for SSDs.

Type:

Float

Default:

0

osd recovery sleep hybrid

Description:

Time in seconds to sleep before next recovery or backfill op when osd data is on HDD and osd journal is on SSD.

Type:

Float

Default:

0.025

(Pulled from
http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/)

When we faced similar issues, using the command
ceph tell osd.* injectargs '--osd-recovery-sleep 2 allowed the OSDs to respond with a heartbeat whilst taking a break between recovery operations. I’d suggest tweaking the sleep wait time to find a sweet spot.

This may be worth a try, so let us know how you get on.

Regards,
Will

From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx>
On Behalf Of Webert de Souza Lima

Sent: 08 August 2018 15:06

To: frederic.cuza@xxxxxx

Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>

Subject: Re: [ceph-users] Whole cluster flapping

So your OSDs are really too busy to respond heartbeats. 

You'll be facing this for sometime until cluster loads get lower.

I would set `ceph osd set nodeep-scrub` until the heavy disk IO stops.

maybe you can schedule it for enable during the night and disabling in the morning.

Regards,

Webert Lima

DevOps Engineer at MAV Tecnologia

Belo Horizonte - Brasil

IRC NICK - WebertRLZ

On Wed, Aug 8, 2018 at 9:18 AM CUZA Frédéric <frederic.cuza@xxxxxx> wrote:

Thx for the command line, I did take a look too it what I don’t really know what to search for, my
 bad….
All this flapping is due to deep-scrub when it starts on an OSD things start to go bad.

I set out all the OSDs that were flapping the most (1 by 1 after rebalancing) and it looks better even
 if some osds keep going down/up with the same message in logs :

1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdabd897700' had timed out after 90

(I update it to 90 instead of 15s)

Regards,

De : ceph-users
 <ceph-users-bounces@xxxxxxxxxxxxxx>
De la part de Webert de Souza Lima

Envoyé : 07 August 2018 16:28

À : ceph-users <ceph-users@xxxxxxxxxxxxxx>

Objet : Re: [ceph-users] Whole cluster flapping

oops, my bad, you're right.

I don't know much you can see but maybe you can dig around performance counters and see what's happening on those OSDs, try these:

~# ceph daemonperf osd.XX

~# ceph daemon osd.XX perf dump

change XX to your OSD numbers.

Regards,

Webert Lima

DevOps Engineer at MAV Tecnologia

Belo Horizonte - Brasil

IRC NICK - WebertRLZ

On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric <frederic.cuza@xxxxxx> wrote:

Pool is already deleted and no longer present in stats.

Regards,

De : ceph-users
 <ceph-users-bounces@xxxxxxxxxxxxxx>
De la part de Webert de Souza Lima

Envoyé : 07 August 2018 15:08

À : ceph-users <ceph-users@xxxxxxxxxxxxxx>

Objet : Re: [ceph-users] Whole cluster flapping

Frédéric,

see if the number of objects is decreasing in the pool with `ceph df [detail]`

Regards,

Webert Lima

DevOps Engineer at MAV Tecnologia

Belo Horizonte - Brasil

IRC NICK - WebertRLZ

On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric <frederic.cuza@xxxxxx> wrote:

It’s been over a week now and the whole cluster keeps flapping, it is never the same OSDs that go down.
Is there a way to get the progress of this recovery ? (The pool hat I deleted is no longer present
 (for a while now))
In fact, there is a lot of i/o activity on the server where osds go down.

Regards,

De : ceph-users
 <ceph-users-bounces@xxxxxxxxxxxxxx>
De la part de Webert de Souza Lima

Envoyé : 31 July 2018 16:25

À : ceph-users <ceph-users@xxxxxxxxxxxxxx>

Objet : Re: [ceph-users] Whole cluster flapping

The pool deletion might have triggered a lot of IO operations on the disks and the process might be too busy to respond to hearbeats, so the mons mark them as down due to no response.

Check also the OSD logs to see if they are actually crashing and restarting, and disk IO usage (i.e. iostat).

Regards,

Webert Lima

DevOps Engineer at MAV Tecnologia

Belo Horizonte - Brasil

IRC NICK - WebertRLZ

On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric <frederic.cuza@xxxxxx> wrote:

Hi Everyone,

I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool that we had (120 TB).
Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), we have SDD for journal.

After I deleted the large pool my cluster started to flapping on all OSDs.
Osds are marked down and then marked up as follow :

2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
172.29.228.72:6800/95783 boot
2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
 degraded, 317 pgs undersized (PG_DEGRADED)
2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96
172.29.228.72:6803/95830 boot
2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN)
2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs
 degraded, 223 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4
172.29.228.246:6812/3144542 boot
2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds down (OSD_DOWN)
2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs
 degraded, 220 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs
 degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs
 degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed (root=default,room=xxxx,host=xxxx) (8 reporters from different host after 54.650576
 >= grace 54.300663)
2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN)
2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update: 5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs
 degraded, 201 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update: 78 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:22.481251 mon.ceph_monitor01 [WRN] Health check update: 4 osds down (OSD_DOWN)
2018-07-31 10:43:22.498621 mon.ceph_monitor01 [INF] osd.18
172.29.228.5:6812/14996 boot
2018-07-31 10:43:25.340099 mon.ceph_monitor01 [WRN] Health check update: 5712/5846235 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:25.340147 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 6 pgs inactive, 3 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:25.340163 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 138553/5846235 objects degraded (2.370%), 74 pgs
 degraded, 201 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:25.340181 mon.ceph_monitor01 [WRN] Health check update: 11 slow requests are blocked > 32 sec (REQUEST_SLOW)

On the OSDs that failed logs are full of this kind of message :
2018-07-31 03:41:28.789681 7f698b66c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.945710 7f698ae6b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.946263 7f698be6d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.994397 7f698b66c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.994443 7f698ae6b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.023356 7f698be6d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.023415 7f698be6d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.102909 7f698ae6b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.102917 7f698b66c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

At first it seems like a network issue but we haven’t change a thing on the network and this cluster has been okay for months.

I can’t figure out what is happening at the moment, some help will be greatly appreciated !

Regards,

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

NOTICE AND DISCLAIMER

This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming
 and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com