Re: PG stuck peering after host reboot

<george.vasilakakos@xxxxxxxxxx> · Wed, 8 Feb 2017 18:32:53 +0000

Hey Greg,

Thanks for your quick responses. I have to leave the office now but I'll look deeper into it tomorrow to try and understand what's the cause of this. I'll try to find other peerings between these two hosts and check those OSDs' logs for potential anomalies. I'll also have a look at any potential configuration changes that might have affected the host post-reboot.

I'll be back here with more info once I have it tomorrow.

Thanks again!

George
________________________________________
From: Gregory Farnum [gfarnum@xxxxxxxxxx]
Sent: 08 February 2017 18:29
To: Vasilakakos, George (STFC,RAL,SC)
Cc: Ceph Users
Subject: Re:  PG stuck peering after host reboot

On Wed, Feb 8, 2017 at 10:25 AM,  <george.vasilakakos@xxxxxxxxxx> wrote:
> Hi Greg,
>
>> Yes, "bad crc" indicates that the checksums on an incoming message did
>> not match what was provided — ie, the message got corrupted. You
>> shouldn't try and fix that by playing around with the peering settings
>> as it's not a peering bug.
>> Unless there's a bug in the messaging layer causing this (very
>> unlikely), you have bad hardware or a bad network configuration
>> (people occasionally talk about MTU settings?). Fix that and things
>> will work; don't and the only software tweaks you could apply are more
>> likely to result in lost data than a happy cluster.
>> -Greg
>
>
> I thought of the network initially but I didn't observe packet loss between the two hosts and neither host is having trouble talking to the rest of its peers. It's these two OSDs that can't talk to each other so I figured it's not likely to be a network issue. Network monitoring does show virtually non-existent inbound traffic over those links compared to the other ports on the switch but no other peerings fail.
>
> Is there something you can suggest to do to drill down deeper?

Sadly no. It being a single route is indeed weird and hopefully
somebody with more networking background can suggest a cause. :)

> Also, am I correct in assuming that I can pull one of these OSDs from the cluster as a last resort to cause a remapping to a different to potentially give this a quick/temp fix and get the cluster serving I/O properly again?

I'd expect so!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com