Re: Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

Tom W <Tom.W@xxxxxxxxxxxx> · Tue, 17 Jul 2018 22:05:35 +0000

Hi Bryan,

That’s unusual, and not something I can really begin to unravel. As some other pointers, perhaps run a PG query on some of the inactive and peering PGs for any potentially useful output?

I suspect from what you’ve put that most PGs are simply in a down and peering state, and it can’t peer as they are down still. The nodown flag doesn’t seem to have fixed that, but then again it can’t
 peer if they actually are down which nodown will mask.

Is pausing all cluster IO an option for you? My thinking here is to pause all IO, completely restart and verify all OSDs are back up and operational? If they fail to come up during paused IO, it
 will rule out any spiking load, but this seems to be more of a network issue, as even peering would normally generate some volume of traffic as it cycles to reattempt.

I’m not familiar at all with Rook or Kubernetes at this stage so I also have concern over how the networking stack there would work. MTU has been a problem in the past but this would only affect
 performance and not operation in my mind. Also perhaps being able to reach other nodes on the right interfaces, so can you definitely traverse the public and cluster networks successfully?

Tom

From: Bryan Banister <bbanister@xxxxxxxxxxxxxxx>

Sent: 17 July 2018 22:36

To: Tom W <Tom.W@xxxxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx

Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

Hi Tom,

I tried to check out the ops in flight as you suggested but this seems to just hang:
root@rook-ceph-osd-carg-kubelet-osd02-m9rhx:/# ceph --admin-daemon /var/lib/rook/osd238/rook-osd.238.asok daemon osd.238 dump_ops_in_flight

Nothing returns and don’t get a prompt back. 

The cluster is somewhat new, but has been running without any major issues for more than a week or so.  We’re not even sure how this all started.

I’m happy to provide more details of our deployment if you or others need anything.

We haven’t changed anything today/recently.  I think you’re correct that unsetting ‘nodown’ will just return things to the previous state.

Thanks!
-Bryan

From: Tom W [mailto:Tom.W@xxxxxxxxxxxx]

Sent: Tuesday, July 17, 2018 4:19 PM

To: Bryan Banister <bbanister@xxxxxxxxxxxxxxx>;
ceph-users@xxxxxxxxxxxxxx

Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

Note: External Email

Hi Bryan,

OSDs may not truly be up, this flag merely prevents them being marked as down even if they are unresponsive. It may be worth unsetting nodown as soon as you are confident, but unsetting it before anything changes will just return to the
 previous state. Perhaps not harmful, but I have no oversight on your deployment nor am I an expert in any regards.

Find an OSD which is up and having issues peering, and perhaps try something like this

ceph daemon osd.x dump_ops_in_flight

Replacing x with the OSD number, I am curious to see what may be holding it up. I assume you have already done the usual tests to ensure it is traversing the right interface, correct VLANs, reachable via ICMP, perhaps even run an iperf
 and tpcdump to be certain the flow is as expected.

Tom

From: Bryan Banister <bbanister@xxxxxxxxxxxxxxx>

Sent: 17 July 2018 22:03

To: Tom W <Tom.W@xxxxxxxxxxxx>; 
ceph-users@xxxxxxxxxxxxxx

Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

Hi Tom,

Decided to try your suggestion of the ‘nodown’ setting and this indeed has gotten all of the OSDs up and they haven’t failed out like before.  However the PGs are in bad states and Ceph doesn’t seem
 interested in starting recovery over the last 30 minues since the latest health message was reported:

2018-07-17 20:29:00.638398 mon.rook-ceph-mon7 [WRN] Health check update: 1/8884343 objects misplaced (0.000%) (OBJECT_MISPLACED)
2018-07-17 20:29:00.864863 mon.rook-ceph-mon7 [INF] osd.221 7.129.220.49:6957/30346 boot
2018-07-17 20:29:01.907855 mon.rook-ceph-mon7 [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2018-07-17 20:29:02.598518 mon.rook-ceph-mon7 [INF] osd.238 7.129.220.49:6923/30330 boot
2018-07-17 20:29:02.988546 mon.rook-ceph-mon7 [WRN] Health check update: Reduced data availability: 10895 pgs inactive, 6514 pgs down, 4391 pgs peering, 2 pgs stale (PG_AVAILABILITY)
2018-07-17 20:29:04.380454 mon.rook-ceph-mon7 [WRN] Health check update: Degraded data redundancy: 10/8884349 objects degraded (0.000%), 6 pgs degraded, 80 pgs undersized (PG_DEGRADED)
2018-07-17 20:29:08.319073 mon.rook-ceph-mon7 [WRN] Health check update: 1/8884349 objects misplaced (0.000%) (OBJECT_MISPLACED)
2018-07-17 20:29:08.319103 mon.rook-ceph-mon7 [WRN] Health check update: Reduced data availability: 10893 pgs inactive, 6391 pgs down, 4515 pgs peering, 1 pg stale (PG_AVAILABILITY)
2018-07-17 20:29:13.319406 mon.rook-ceph-mon7 [WRN] Health check update: Reduced data availability: 10893 pgs inactive, 6354 pgs down, 4552 pgs peering (PG_AVAILABILITY)
2018-07-17 20:29:14.044696 mon.rook-ceph-mon7 [WRN] Health check update: 123 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 20:29:20.277493 mon.rook-ceph-mon7 [WRN] Health check update: 129 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 20:29:27.344834 mon.rook-ceph-mon7 [WRN] Health check update: 135 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 20:29:54.516115 mon.rook-ceph-mon7 [WRN] Health check update: Reduced data availability: 10899 pgs inactive, 6354 pgs down, 4552 pgs peering (PG_AVAILABILITY)
2018-07-17 20:30:03.322101 mon.rook-ceph-mon7 [WRN] Health check update: Reduced data availability: 10907 pgs inactive, 6354 pgs down, 4553 pgs peering (PG_AVAILABILITY)

Nothing since then, which was 30 min ago.  Hosts are basically idle.

I’m thinking of unsetting the ‘nodown” now to see what it does, but is there any other recommendations here before I do that?

Thanks again!
-Bryan

From: Tom W [mailto:Tom.W@xxxxxxxxxxxx]

Sent: Tuesday, July 17, 2018 1:58 PM

To: Bryan Banister <bbanister@xxxxxxxxxxxxxxx>;
ceph-users@xxxxxxxxxxxxxx

Subject: Re: Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

Note: External Email

Prior to the OSD being marked as down by the cluster, do you note the PGs become inactive on it? Using a flag such as nodown may prevent OSDs flapping if it helps reduce the IO load
 to see if things stabilise out, but be wary of this flag as I believe PGs using the OSD as the primary will not failover to another OSD while nodown is set.

My thoughts here, albeit I am shooting in the dark a little with this theory, is perhaps individual OSDs being overloaded and not returning a heartbeat as a result of the load. When
 OSDs are marked as down and new maps are distributed this would add further load so while it keeps recalculating it may be a vicious cycle which may be alleviated if it could stabilise.

With networks mainly idle, do you see any spikes at all? Perhaps an OSD coming online, OSD attempts backfill/recovery and QoS dropping the heartbeat packets if it overloads the link?

Just spitballing some ideas here until somebody more qualified may have an idea.

From: Bryan Banister <bbanister@xxxxxxxxxxxxxxx>

Sent: 17 July 2018 19:18:15

To: Bryan Banister; Tom W; ceph-users@xxxxxxxxxxxxxx

Subject: RE: Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

I didn’t find anything obvious in the release notes about this issue we see to have, but I don’t understand it really.

We have seen logs indicating some kind of heartbeat issue with OSDs, but we don’t believe there is any issues with the networking between the nodes, which are mostly idle as well:

2018-07-17 17:41:32.903871 I | osd12: 2018-07-17 17:41:32.903793 7fffef198700 -1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6866 osd.219 ever on either front or back, first ping sent
 2018-07-17 17:41:09.893761 (cutoff 2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903875 I | osd12: 2018-07-17 17:41:32.903795 7fffef198700 -1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6922 osd.220 ever on either front or back, first ping sent
 2018-07-17 17:41:09.893761 (cutoff 2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903878 I | osd12: 2018-07-17 17:41:32.903798 7fffef198700 -1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6901 osd.221 ever on either front or back, first ping sent
 2018-07-17 17:41:09.893761 (cutoff 2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903880 I | osd12: 2018-07-17 17:41:32.903800 7fffef198700 -1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6963 osd.222 ever on either front or back, first ping sent
 2018-07-17 17:41:09.893761 (cutoff 2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903884 I | osd12: 2018-07-17 17:41:32.903803 7fffef198700 -1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6907 osd.224 ever on either front or back, first ping sent
 2018-07-17 17:41:09.893761 (cutoff 2018-07-17 17:41:12.903604) 

Is there a way to resolve this issue, which seems to be the root cause of the OSDs being marked as failed.

Thanks in advance for any help,
-Bryan

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Bryan Banister

Sent: Tuesday, July 17, 2018 12:08 PM

To: Tom W <Tom.W@xxxxxxxxxxxx>; 
ceph-users@xxxxxxxxxxxxxx

Subject: Re:  Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

Note: External Email

Hi Tom,

We’re apparently running ceph version 12.2.5 on a Rook based cluster.  We have EC pools on large 8TB HDDs and metadata on bluestore OSDs on NVMe drives.

I’ll look at the release notes.

Thanks!
-Bryan

From: Tom W [mailto:Tom.W@xxxxxxxxxxxx]

Sent: Tuesday, July 17, 2018 12:05 PM

To: Bryan Banister <bbanister@xxxxxxxxxxxxxxx>;
ceph-users@xxxxxxxxxxxxxx

Subject: Re: Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

Note: External Email

Hi Bryan,

What version of Ceph are you currently running on, and do you run any erasure coded pools or bluestore OSDs? Might be worth having a quick glance over the recent changelogs:

http://docs.ceph.com/docs/master/releases/luminous/

Tom

From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Bryan Banister <bbanister@xxxxxxxxxxxxxxx>

Sent: 17 July 2018 18:00:05

To: ceph-users@xxxxxxxxxxxxxx

Subject:  Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

Hi all,

We’re still very new to managing Ceph and seem to have cluster that is in an endless loop of failing OSDs, then marking them down, then booting them again:

Here are some example logs:
2018-07-17 16:48:28.976673 mon.rook-ceph-mon7 [INF] osd.83 failed (root=default,host=carg-kubelet-osd04) (3 reporters from different host after 61.491973 >= grace 20.010293)
2018-07-17 16:48:28.976730 mon.rook-ceph-mon7 [INF] osd.84 failed (root=default,host=carg-kubelet-osd04) (3 reporters from different host after 61.491916 >= grace 20.010293)
2018-07-17 16:48:28.976785 mon.rook-ceph-mon7 [INF] osd.85 failed (root=default,host=carg-kubelet-osd04) (3 reporters from different host after 61.491870 >= grace 20.011151)
2018-07-17 16:48:28.976843 mon.rook-ceph-mon7 [INF] osd.86 failed (root=default,host=carg-kubelet-osd04) (3 reporters from different host after 61.491828 >= grace 20.010293)
2018-07-17 16:48:28.976890 mon.rook-ceph-mon7 [INF] Marking osd.1 out (has been down for 605 seconds)
2018-07-17 16:48:28.976913 mon.rook-ceph-mon7 [INF] Marking osd.2 out (has been down for 605 seconds)
2018-07-17 16:48:28.976933 mon.rook-ceph-mon7 [INF] Marking osd.3 out (has been down for 605 seconds)
2018-07-17 16:48:28.976954 mon.rook-ceph-mon7 [INF] Marking osd.4 out (has been down for 605 seconds)
2018-07-17 16:48:28.976979 mon.rook-ceph-mon7 [INF] Marking osd.9 out (has been down for 605 seconds)
2018-07-17 16:48:28.977000 mon.rook-ceph-mon7 [INF] Marking osd.10 out (has been down for 605 seconds)
2018-07-17 16:48:28.977020 mon.rook-ceph-mon7 [INF] Marking osd.11 out (has been down for 605 seconds)
2018-07-17 16:48:28.977040 mon.rook-ceph-mon7 [INF] Marking osd.12 out (has been down for 605 seconds)
2018-07-17 16:48:28.977059 mon.rook-ceph-mon7 [INF] Marking osd.13 out (has been down for 605 seconds)
2018-07-17 16:48:28.977079 mon.rook-ceph-mon7 [INF] Marking osd.14 out (has been down for 605 seconds)
2018-07-17 16:48:30.889316 mon.rook-ceph-mon7 [INF] osd.55 7.129.218.12:6920/90761 boot
2018-07-17 16:48:31.113052 mon.rook-ceph-mon7 [WRN] Health check update: 4946/8854434 objects misplaced (0.056%) (OBJECT_MISPLACED)
2018-07-17 16:48:31.113087 mon.rook-ceph-mon7 [WRN] Health check update: Degraded data redundancy: 7951/8854434 objects degraded (0.090%), 88 pgs degraded, 273 pgs undersized (PG_DEGRADED)
2018-07-17 16:48:32.763546 mon.rook-ceph-mon7 [WRN] Health check update: Reduced data availability: 10439 pgs inactive, 8994 pgs down, 1639 pgs peering, 88 pgs incomplete, 3430 pgs stale (PG_AVAILABILITY)
2018-07-17 16:48:32.763578 mon.rook-ceph-mon7 [WRN] Health check update: 29 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 16:48:34.096178 mon.rook-ceph-mon7 [INF] osd.88 failed (root=default,host=carg-kubelet-osd04) (3 reporters from different host after 66.612054 >= grace 20.010283)
2018-07-17 16:48:34.108020 mon.rook-ceph-mon7 [WRN] Health check update: 112 osds down (OSD_DOWN)
2018-07-17 16:48:38.736108 mon.rook-ceph-mon7 [WRN] Health check update: 4946/8843715 objects misplaced (0.056%) (OBJECT_MISPLACED)
2018-07-17 16:48:38.736140 mon.rook-ceph-mon7 [WRN] Health check update: Reduced data availability: 10415 pgs inactive, 9000 pgs down, 1635 pgs peering, 88 pgs incomplete, 3418 pgs stale (PG_AVAILABILITY)
2018-07-17 16:48:38.736166 mon.rook-ceph-mon7 [WRN] Health check update: Degraded data redundancy: 7949/8843715 objects degraded (0.090%), 86 pgs degraded, 267 pgs undersized (PG_DEGRADED)
2018-07-17 16:48:40.430146 mon.rook-ceph-mon7 [WRN] Health check update: 111 osds down (OSD_DOWN)
2018-07-17 16:48:40.812579 mon.rook-ceph-mon7 [INF] osd.117 7.129.217.10:6833/98090 boot
2018-07-17 16:48:42.427204 mon.rook-ceph-mon7 [INF] osd.115 7.129.217.10:6940/98114 boot
2018-07-17 16:48:42.427297 mon.rook-ceph-mon7 [INF] osd.100 7.129.217.10:6899/98091 boot
2018-07-17 16:48:42.427502 mon.rook-ceph-mon7 [INF] osd.95 7.129.217.10:6901/98092 boot

Not sure this is going to fix itself.  Any ideas on how to handle this situation??

Thanks in advance!
-Bryan

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination,
 or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees
 as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of
 transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related
 purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company’s treatment of personal data, please email
datarequests@xxxxxxxxxxxxxxx. 

NOTICE AND DISCLAIMER

This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming
 and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination,
 or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees
 as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of
 transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related
 purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company’s treatment of personal data, please email
datarequests@xxxxxxxxxxxxxxx. 

NOTICE AND DISCLAIMER

This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming
 and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination,
 or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees
 as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of
 transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related
 purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company’s treatment of personal data, please email
datarequests@xxxxxxxxxxxxxxx. 

NOTICE AND DISCLAIMER

This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming
 and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination,
 or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees
 as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of
 transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related
 purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company’s treatment of personal data, please email
datarequests@xxxxxxxxxxxxxxx. 

NOTICE AND DISCLAIMER

This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming
 and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com